Algorithm Wildcard Searching with *

#	User	Rating
1	tourist	3985
2	jiangly	3885
3	jqdai0815	3682
4	Benq	3580
5	orzdevinwang	3526
6	ksun48	3506
7	ecnerwala	3505
8	Radewoosh	3457
9	Kevin114514	3377
10	gamegame	3374

#	User	Contrib.
1	cry	170
2	-is-this-fft-	162
2	Um_nik	162
4	atcoder_official	160
5	djm03178	157
6	Dominater069	156
7	adamant	153
8	luogu_official	152
8	awoo	152
10	maomao90	148

I am thinking about an effient algorithm for wildcard searching with * representing any characters with any length.

aa*c, she*he, *she*he

Example:

caa find aa*c
return "DOES NOT EXIST"
hesherheshe find she*he
return 2
Because sherheshe begins at index 2
sherheshe find she*he
return 0
Because the whole string matches

When I am supposed to return, say, the beginning index of the first matching instance.

Say the pattern is of length M and the document is of length N, and the pattern has K '*' signs. I can think of a solution that first uses AC Automation to find all occurences of each chunk in O(N + M), with bitmask.

While converting the bitmask to indexes takes O(N * K)

Then binary search for the last possible beginning positions for each chunk. This could take O (K log N)

So the overall time complexity is still O (N * K), any way to do better?

References

https://codeforces.net/blog/entry/111380

https://codeforces.net/problemset/problem/1023/A

https://codeforces.net/blog/entry/127169

Comments (5)

Write comment?

piggy123

12 days ago, # |

I'm not acquainted with regex expressions, so the following might be a misinterpretation of your problem, but it seems that what you are asking is essentially to find a series of non-intersecting occurrences of several strings. We should now seek a way to find all occurrences of a certain string, and find the first one past a certain threshold (determined by the last string's position). This can be accomplished by the use of a Suffix Automaton, which turns the problem into online queries about the lower bound upon the values within a node's subtree (Since the fail tree/suffix tree's subtree values represent the endpos set for a string). This can be done by a persistent segment tree and binary search upon its structure, resulting in a time and memory complexity of O(nlogn).

→ Reply

12 days ago, # ^ |

Wait wait wait, if my interpretation is correct, there is no need to be so complex. Each time you just need to find the first occurrence of each string between the '*'s, so you can just run KMP upon each of them (taking O(n) time), and scan thru the document, it should only take O(n+m) time.

I'm now convinced I've misunderstood. Please give some more explanation :(

cardcounter

← Rev. 3 →

I see what you are saying. red coders are genius!

eg. seg1*seg2*seg3, I spent weeks thinking about AC automation.

KMP and incrementally build next array for strings.

Auto comment: topic has been updated by cardcounter (previous revision, new revision, compare).

cardcounter's blog