Hashing Issues in Python

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	157
6	Qingyu	156
7	djm03178	152
7	adamant	152
9	luogu_official	151
10	awoo	147

Recently, I was attempting the problem 1692H - Gambling. The details of the problem are not important for this post, it is just sufficient to know that I was making use of a dictionary to store some data.

I ran into a TLE verdict (2000ms), which I found strange, since my solution was linear as far as I could tell. After a ton of debugging, I was able to deduce that the following lines

slow code

current = max(1, d[i][1] - idx + d[i][2] + 2)
d[i] = [max(d[i][0], current), current, idx]

inside a for loop were being incredibly slow (again, what these lines do isn't important, only note that there are a few dictionary accesses). Around 20k iterations were taking about a second to execute.

I suspected that there might be too many hash collisions, so I tried multiplying i by 10^20, and later dividing it by 10^20, and this solution passed.

163665232 — TLE Submission

163664864 — AC Submission

Now in Python, for upto fairly large numbers (~10^18), hash(x) = x, which is why I was multiplying by 10^20. However, even for larger numbers, as far as I can tell, hash(x+1) = hash(x)+1. In this case, I thought the distribution of the hashes should be pretty similar with and without multiplying by 10^20.

However, this is speculation, and I could be wrong. Does anyone know a little more about this topic, and how I can avoid facing these issues in the future?

Comments (4)

Write comment?

freehandle

3 years ago, # |

← Rev. 2 →

I'd make a wrapper class of integer and implement your own hash and eq.

→ Reply

3 years ago, # ^ |

I don't know why double underbars are replaced.

smax

https://codeforces.net/blog/entry/101817

In short, Python hashing method is deterministic, and hackers specifically put in tests that force hash collisions for that problem. When you multiply by $$$10^{20}$$$, you change the malicious input values to different ones that don't cause a hash collision. Of course, the constant you use should be random, otherwise you don't address the underlying issue as a strong enough adversary could still figure out what values when multiplied by $$$10^{20}$$$ collide.

sahilrox

Ahh, I see. Thanks!

sahilrox's blog