Please help me with a string search problem

→ Pay attention

Before contest
Codeforces Round 1006 (Div. 3)
2 days
Register now »

→ Top rated

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

Countries | Cities | Organizations

View all →

→ Top contributors

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	157
6	Qingyu	156
7	adamant	151
7	djm03178	151
7	luogu_official	151
10	awoo	146

View all →

→ Find user

→ Recent actions

Detailed →

Koyote's blog

Please help me with a string search problem

By Koyote, history, 20 months ago, In English

I encountered this problem at a past local competition. Here is the brief statement:

Given two string $$$s$$$ of length $$$n$$$ and t of length $$$m$$$ ($$$n$$$, $$$m$$$ $$$\le$$$ $$$10^{5}$$$) find the longest, lexicographically smallest substring of $$$s$$$ that appears in string $$$t$$$ at least $$$k$$$ times. The input is given in a way that there is at least one substring satisfies the constraints.

I have tried to find answers on Google and the best I could find is this problem from HackerEarth . The suggested solutions were either hashing with binary search, or suffix tree. And I am not knowledgeable enough to transform the problem into the original problem. So far the best I could have done is to iterate every substring of $$$s$$$ in $$$O(n^2)$$$, then use string search algorithm (KMP/Z-function) to count number of occurrences, yielding total complexity of $$$O(n^2*(n+m))$$$. Please give hints to an approach for this problem.

string, hard, suffix tree, hashing

Koyote
20 months ago
9

Comments (9)

Write comment?

Koyote

20 months ago, # |

I forgot. The first line of input contains 3 positive integers $$$n$$$, $$$m$$$, and $$$k$$$. The second line contains the string $$$s$$$ and the third line contains the string $$$t$$$. Here is an example test case:

Input:

7 19 3
GATTACA
TACATTACGCATTACACAT

Output:

ACA

Explanation: TAC and ACA are the two longest substring of $$$s$$$ that satisfy, and ACA is lexicographically smaller than TAC.

→ Reply

tminh_hk20

16 months ago, # ^ |

this is my code about this problem:

#include<bits/stdc++.h>
using namespace std;
#define int long long
#define ii pair<int,int>
#define fi first
#define se second
const int N=1e5+10,MOD=1e9+7,base=311;
int n,m,k,has1[N],has2[N],p[N];
int l,r,mi;
string s,t;
ii res;

int get1(int i, int j){
    return (has1[j]-has1[i-1]*p[j-i+1]+MOD*MOD)%MOD;
}

int get2(int i, int j){
    return (has2[j]-has2[i-1]*p[j-i+1]+MOD*MOD)%MOD;
}

bool compound(int a, int b, int n){
    if (b==-1) return true;
    int l =0,r=n-1,res=-1;

    while(l<=r){

        int m = (l+r)/2;

        if (get2(a,a+m)==get2(b,b+m)){
            l = m+1;
            res=m;
        }else{
            r =m-1;
        }

    }

    if (res==n-1) return true;
    return (t[a+res+1]<t[b+res+1]);
}

signed main(){

    ios_base::sync_with_stdio(0);
    cin.tie(0); cout.tie(0);

    freopen("GATTACA.INP","r",stdin);
    freopen("GATTACA.OUT","w",stdout);

    cin>>n>>m>>k;
    cin>>s>>t;

    s =" "+s;
    t =" "+t;

    p[0]=1;
    for (int i=1;i<=max(n,m)+1;i++){
        p[i]=(p[i-1]*base)%MOD;
    }

    has1[0]=0;
    for (int i=1;i<=n;i++){
        has1[i]=(has1[i-1]*base+s[i])%MOD;
    }

    has2[0]=0;
    for (int i=1;i<=m;i++){
        has2[i]=(has2[i-1]*base+t[i])%MOD;
    }

    l=1;
    r=1;
    bool div=0;
    res={-1,-1};

    while(l<=r){
        //cout <<l<<" "<<r<<endl;
        if (div)
            mi = (l+r)/2;
        else
            mi = r;

        bool check=0;
        unordered_map<int,int> mp,ap;

        for (int i=1;i<=n-mi+1;i++){
            ap[get1(i,i+mi-1)]++;
        }

        for (int i=1;i<=m-mi+1;i++){
            int tmp=get2(i,i+mi-1);
            mp[tmp]++;

            if (ap[tmp] && mp[tmp]>=k){
                check=1;
                if (mi<res.se-res.fi+1) continue;

                if (mi>res.se-res.fi+1){
                    res={i,i+mi-1};
                    continue;
                }

                if (compound(i,res.fi,mi)){
                    res = {i,i+mi-1};
                }
            }

        }

        if (div){
            if (check)
                l = mi+1;
            else
                r = mi-1;
        }else{
            if (check){
                l =r;
            }else{
                div=1;
            }
            r*=2;
        }


    }

    for (int i=res.fi;i<=res.se;i++)
        cout <<t[i];



}

→ Reply

ayhan37

20 months ago, # |

← Rev. 2 →

I have found O((M+N)*LOG_M*LOG_N) solution using binary search on the length of substring + suffix array for sorting substrings of length X. We also need to use hashing for counting occurences.

I don't know if there is any better solution than this one.

→ Reply

Koyote

20 months ago, # ^ |

That is good enough, please explain more.

→ Reply

ayhan37

20 months ago, # ^ |

We can use binary search on the answer technique. We fix the length of substring. Let's call the mid value X. We can use binary search because suppose that the answer is A(length of the substring). Then, an answer with length (A-1) must exist because you can pop the last character of the valid substring.

We use hashing to calculate the number of occurences (on string t with substring length X). This part is O(MlogM) (due to std::map). Finally , to find the lexicographically minimum substring we need to use suffix array to sort substring of length X. You can learn suffix array and it's applications from CP-ALGO. Also sorry for my poor english.

→ Reply

Koyote

20 months ago, # ^ |

No problem. Thank you for your idea, I'll try to implement it.

→ Reply

ayhan37

20 months ago, # ^ |

Your welcome :)

→ Reply

bicsi

20 months ago, # |

← Rev. 2 →

Translation: find a string of maximal length that has at least one occurrence in $$$s$$$ and at least $$$k$$$ occurrences in $$$t$$$. One way to do that is:

Build a suffix automaton of $$$s\$t$$$
For each node in automaton compute the number of occurrences in $$$s$$$ and in $$$t$$$
Out of all the valid nodes, choose the min lex one

A similar solution can be done with suffix arrays and two pointers technique:

Sort the suffixes of $$$s\$t$$$
For each $$$i \leq n + m$$$ find the smallest $$$j$$$ such that there is at least one index from $$$s$$$ and $$$k$$$ indices from $$$t$$$ in the range $$$sa[i..j]$$$ (via two pointers)
Update the solution if $$$lcp[i..j]$$$ is strictly better than the current best. This makes sure to pick the lex min solution.

→ Reply

Koyote

20 months ago, # ^ |

Thank you so much. That is so clever.

→ Reply