Need help in Hashing with overflow technique

→ Pay attention

Before contest
Codeforces Round (Div. 2)
4 days
Register now »

*has extra registration

→ Top rated

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

Countries | Cities | Organizations

View all →

→ Top contributors

#	User	Contrib.
1	cry	168
2	-is-this-fft-	164
3	Dominater069	160
4	Um_nik	159
5	atcoder_official	158
6	djm03178	154
7	adamant	153
8	awoo	148
8	luogu_official	148
10	TheScrasse	146

View all →

→ Find user

→ Recent actions

Detailed →

helpme's blog

Need help in Hashing with overflow technique

By helpme, 9 years ago, In English

Hello everyone,

Recently I've read some tutorial on hashing, I learned that after some preprocessing we can find the hash of a substring in O(1) using modular inverse. But, I was surprised to see that people do not use modular arithmetic! For example, see this submission: 845264, How is it counting hash? Is it utilizing the integer overflow? Moreover, it does not use any modular inverse to find the hash of a substring, e.g.

inline int hh(int x,int y)
{
	return a[y]-a[x-1]*p[y-x+1];
}

I can't understand how it actually works. It would be great if any of you could explain this or suggest some online resources about this.

Thanks in advance!

-1

helpme
9 years ago
7

Comments (7)

Write comment?

klamathix

9 years ago, # |

Yes, that code seems to be using integer overflow. However, this is generally not a good idea, as illustrated here: http://codeforces.net/blog/entry/4898

→ Reply

yeputons

9 years ago, # |

I'm going to answer your first question first.

Your suggestion is correct — it uses integer overflow. Basically, it's just counting hashes modulo 2³² if we're talking about int. First problem with that was already pointed out: it's easy to construct a test which will fail all polynomial hashes modulo 2ⁿ. You're unlikely to get that "in real life", though, or even in some specific problems (say, if you're hashing a graph, I doubt authors will try to fail each and every way of hashing the graph).

Another problem is that integer overflow in signed types is undefined behaviour in C++. That is, if you calculate int a = 1e5, b = 1e6, c = a * b;, compiler is free to do anything it wants — if you're lucky, it will work just like integer overflow. In other cases (on other compilers, platforms and dates) program theoretically can wipe your HDD and install spyware (although I can't imagine any compiler really doing this). So, one should not rely on overflow in signed types. Unsigned are ok: int c = (unsigned int)a * (unsigned int)b is ok: we convert values to an unsigned type first, multiply them (and get expected overflow), then convert back by assigning to c. I would just typedef unsigned int hash; and calculate everything I need in the newly declared type.

→ Reply

yeputons

9 years ago, # |

← Rev. 4 →

About modular inverse: you don't need it at all if you write polynomial hashes wisely, even if modulo is prime, etc.

I assume that you've seen the following pattern: if we have string a₀a₁... a_n - 1, then we consider a₀ + a₁p + a₂p² + ... + a_n - 1p^n - 1 as its polynomial hash. Afterwards we can compute hashes for all prefixes of a given string (let's say hash of a₀a₁... a_k is S_k) and if we ever need to calculate hash of a_la_l + 1... a_r, we take $\text{[math]}$ as an answer (division here is basically modular inverse).

However, there is another approach, which I like much more. We make rightmost characters in the string least significant, that is, hash of a₀a₁... a_n - 1 will now be a₀p^n - 1 + a₁p^n - 2 + ... + a_n - 2p + a_n - 1. Appending a character for string S is still easy, moreover, we don't need to know length of S: Hash(Sa) = Hash(S)·p + a.

Things get even better when we're talking about hashing substrings. If you have hashes for prefixes of lengths L and R, then answer is S_R - S_L - 1·p^R - L. That is, we take hash for prefix of length R, then "cut out" leftmost (that is, most significant) characters (which we know to be first L of the initial string) and we have the answer immediately without modular inversion.

→ Reply

helpme

9 years ago, # ^ |

← Rev. 2 →

You really helped me a lot, Thank you! :)

Btw, in this line: "if we ever need to calculate hash of a_l a_l+1... a_r, we take " -- I think it would be (S_r - S_l-1)/P^l, isn't it?

→ Reply

yeputons

9 years ago, # ^ |

That's correct. Fixed.

→ Reply

apadeh

9 years ago, # ^ |

Hi, could you explain the part

Things get even better when we're talking about hashing substrings. If you have hashes for prefixes of lengths L and R, then answer is SR - SL·p^(R - L)]()

in my imagination,if we want to hash the substring of aL, aL+1, aL+2, we can just take (SL+2 - SL)/p^L where SL is the prefix hash of length L, I dont understand why you multiply it by p^(R-L).

→ Reply

yeputons

9 years ago, # ^ |

← Rev. 9 →

Let's break it down. Assume that we have string a₀a₁a₂a₃a₄a₅ (that is n = 6). Then:

S₀ = a₀
S₁ = a₀p + a₁
S₂ = a₀p² + a₁p + a₂
S₃ = a₀p³ + a₁p² + a₂p + a₃
S₄ = a₀p⁴ + a₁p³ + a₂p² + a₃p + a₄
S₅ = a₀p⁵ + a₁p⁴ + a₂p³ + a₃p² + a₄p + a₅

We want to get hash of a₂a₃a₄, that is, L = 2, R = 4. It should have form of a₂p² + a₃p + a₄. First, we take S₄ as our initial approach:

S₄ = a₀p⁴ + a₁p³ + a₂p² + a₃p + a₄

We see that it's almost what we need with several extra terms: a₀p⁴ + a₁p³. These terms are hash of substring a₀a₁000 (I've just took prefix of length five and replaced last three latters with zeros). That is, it's hash of a₀a₁ multiplied by p³, i.e. S₁p³ = (a₀p + a₁)p³. So, if we subtract these two:

S₄ - S₁p³ = a₀p⁴ + a₁p³ + a₂p² + a₃p + a₄ - a₀p⁴ - a₁p³
S₄ - S₁p³ = a₂p² + a₃p + a₄

We get exactly what we wanted.

Does that make more sense now? Please note that there are other approaches with different formulas.

→ Reply