cardcounter's blog

By cardcounter, history, 12 days ago, In English

I am thinking about an effient algorithm for wildcard searching with * representing any characters with any length.

aa*c, she*he, *she*he

Example:

  1. caa find aa*c

    return "DOES NOT EXIST"

  2. hesherheshe find she*he

    return 2

    Because sherheshe begins at index 2

  3. sherheshe find she*he

    return 0

    Because the whole string matches

When I am supposed to return, say, the beginning index of the first matching instance.

Say the pattern is of length M and the document is of length N, and the pattern has K '*' signs. I can think of a solution that first uses AC Automation to find all occurences of each chunk in O(N + M), with bitmask.

While converting the bitmask to indexes takes O(N * K)

Then binary search for the last possible beginning positions for each chunk. This could take O (K log N)

So the overall time complexity is still O (N * K), any way to do better?

References

https://codeforces.net/blog/entry/111380

https://codeforces.net/problemset/problem/1023/A

https://codeforces.net/blog/entry/127169

  • Vote: I like it
  • 0
  • Vote: I do not like it

»
12 days ago, # |
  Vote: I like it +8 Vote: I do not like it

I'm not acquainted with regex expressions, so the following might be a misinterpretation of your problem, but it seems that what you are asking is essentially to find a series of non-intersecting occurrences of several strings. We should now seek a way to find all occurrences of a certain string, and find the first one past a certain threshold (determined by the last string's position). This can be accomplished by the use of a Suffix Automaton, which turns the problem into online queries about the lower bound upon the values within a node's subtree (Since the fail tree/suffix tree's subtree values represent the endpos set for a string). This can be done by a persistent segment tree and binary search upon its structure, resulting in a time and memory complexity of O(nlogn).

  • »
    »
    12 days ago, # ^ |
      Vote: I like it +8 Vote: I do not like it

    Wait wait wait, if my interpretation is correct, there is no need to be so complex. Each time you just need to find the first occurrence of each string between the '*'s, so you can just run KMP upon each of them (taking O(n) time), and scan thru the document, it should only take O(n+m) time.

    I'm now convinced I've misunderstood. Please give some more explanation :(

    • »
      »
      »
      12 days ago, # ^ |
      Rev. 3   Vote: I like it 0 Vote: I do not like it

      I see what you are saying. red coders are genius!

      eg. seg1*seg2*seg3, I spent weeks thinking about AC automation.

      KMP and incrementally build next array for strings.

»
12 days ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by cardcounter (previous revision, new revision, compare).

»
12 days ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by cardcounter (previous revision, new revision, compare).