I am thinking about an effient algorithm for wildcard searching with * representing any characters with any length.
aa*c, she*he, *she*he
Example:
caa find aa*c
return "DOES NOT EXIST"
hesherheshe find she*he
return 2
Because sherheshe begins at index 2
sherheshe find she*he
return 0
Because the whole string matches
When I am supposed to return, say, the beginning index of the first matching instance.
Say the pattern is of length M and the document is of length N, and the pattern has K '*' signs. I can think of a solution that first uses AC Automation to find all occurences of each chunk in O(N + M), with bitmask.
While converting the bitmask to indexes takes O(N * K)
Then binary search for the last possible beginning positions for each chunk. This could take O (K log N)
So the overall time complexity is still O (N * K), any way to do better?
References
https://codeforces.net/blog/entry/111380
I'm not acquainted with regex expressions, so the following might be a misinterpretation of your problem, but it seems that what you are asking is essentially to find a series of non-intersecting occurrences of several strings. We should now seek a way to find all occurrences of a certain string, and find the first one past a certain threshold (determined by the last string's position). This can be accomplished by the use of a Suffix Automaton, which turns the problem into online queries about the lower bound upon the values within a node's subtree (Since the fail tree/suffix tree's subtree values represent the endpos set for a string). This can be done by a persistent segment tree and binary search upon its structure, resulting in a time and memory complexity of O(nlogn).
Wait wait wait, if my interpretation is correct, there is no need to be so complex. Each time you just need to find the first occurrence of each string between the '*'s, so you can just run KMP upon each of them (taking O(n) time), and scan thru the document, it should only take O(n+m) time.
I'm now convinced I've misunderstood. Please give some more explanation :(
I see what you are saying. red coders are genius!
eg. seg1*seg2*seg3, I spent weeks thinking about AC automation.
KMP and incrementally build next array for strings.
Auto comment: topic has been updated by cardcounter (previous revision, new revision, compare).
Auto comment: topic has been updated by cardcounter (previous revision, new revision, compare).