I was trying to find some formal analysis on the error rate of Zobrist Hashing (used in 1977D - XORificator, and there's a tutorial in XOR Hashing [TUTORIAL]).
But I couldn't find any. (If you find any, feel free to post it here.)
Luckily, I managed to derive some upper bound on the error rate. I write it here and you can tell me if there's any problem in my derivation.
The problem statement:
You are given a set of $$$m$$$-bit integers $$$A=[a_1,a_2,...,a_N]$$$, and $$$n$$$ sets $$$[S_1,S_2,...,S_n]$$$ , each of which is a subset of $$$A$$$, and for every $$$1\le i < j \le n$$$, $$$S_i\ne S_j$$$. Denote $$$f(S_i)=\bigoplus\limits_{x\in S_i}x$$$, which is the xor sum of elements in $$$S_i$$$. Now, if I randomly assign each $$$a_1,a_2,...,a_N$$$ in $$$A$$$, what is the (estimated upper bound) probability that there exists some $$$1\le i < j \le n$$$ such that $$$f(S_i)=f(S_j)$$$?
The analysis:
One observation is that $$$f(S_i)=f(S_j)$$$ is true if and only if $$$f(S_i\cup S_j-S_i\cap S_j) = 0$$$. So let's denote $$$T_{ij}=S_i\cup S_j-S_i\cap S_j$$$. Now we want to estimate the probability that there's some $$$T_{ij}$$$, satisfying $$$f(T_{ij})=0$$$. For some specific $$$i$$$ and $$$j$$$, it is obvious that $$$P(f(T_{ij})=0)=\frac{1}{2^m}$$$ because elements in $$$A$$$ are randomly generated. However, there are $$$\frac{n(n-1)}{2}$$$ pairs of $$$(i,j)$$$, and all these $$$T_{ij}$$$ are actually not independent. For example, if $$$m=1$$$ (all integers in $$$A$$$ are 1-bit), and $$$A=[a_1,a_2],S_1=[a_1],S_2=[a_2],S_3=[a_1,a_2]$$$. Then no matter how you assign $$$a_1$$$ and $$$a_2$$$, for $$$f(T_{1,2})=a_1\oplus a_2,f(T_{2,3})=a_1,f(T_{1,3})=a_2$$$, there must be at least one $$$0$$$. So you can not treat the $$$f(T_{ij})$$$ as independent random variables for different $$$ij$$$.
The linearity property of expectation is used in my derivation. Let's first derive the expectation of how many pairs of $$$1\le i<j\le n$$$ satisfying $$$f(T_{ij})=0$$$ as
$$$\mathbf{E}\sum\limits_{i,j}[f(T_{ij})=0] = \sum\limits_{i,j}\mathbf{E}[f(T_{ij})=0] = \sum\limits_{i,j}P(f(T_{ij})=0) = \sum\limits_{i,j}\frac{1}{2^m} = \frac{n(n-1)}{2^{m+1}}$$$
The expectation of how many pairs of $$$1\le i<j\le n$$$ satisfying $$$f(T_{ij})=0$$$ can also be written as $$$P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{n(n-1)/2}$$$ , where $$$P_k$$$ is the probability that there're exactly $$$k$$$ pairs of $$$1\le i<j\le n$$$ satisfying $$$f(T_{ij})=0$$$.
So, $$$P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{\frac{n(n-1)}{2}}=\frac{n(n-1)}{2^{m+1}}$$$.
So, the probability that there exists some $$$1\le i < j \le n$$$ such that $$$f(S_i)=f(S_j)$$$ is
$$$P_1 + P_2 + ... + P_k + ... + P_{n(n-1)/2} < P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{n(n-1)/2} = \frac{n(n-1)}{2^{m+1}}$$$.
If $$$n=3\times 10^5$$$ and $$$m=64$$$, the above upper bound is $$$2.4\times 10^{-9}$$$.
I'm glad to know if you find something wrong. :)
Thanks ,good job how much time write the blog its tooo tall but thanks again
Awesome! I was wondering about this because I had to use 128 bits to make it pass during the contest (got WA with 64 bits).
Given this bound, I guess it was due to using a poor PRNG not close enough to true random (I get 64 bits from two calls to rand(), which is a linear congruential generator AFAIK)
In my 128 bit submission I used mt19337.
Doesn't rand() return a 16-bit number, maximally 32k ? 64 bits should have been enough if you used 4 calls to rand().
It does that on Codeforces? I didn't know. It returns a 32-bit number on my computer :(.
Blog post by neal: link
Thanks!
You can actually simplify the analysis quite a bit with the union bound "trick".
For any collection of events $$$A_1, A_2, A_3, \ldots, A_n$$$, we have the extremely simple upper bound on their union:
(Can be proven using induction and the two set version of PIE, but it is intuitively obvious)
Applying the union bound to the $$$T_{i,j}$$$'s gives you the result instantly.
Also in order to see that $$$\mathbb{P}(f(T_{i,j})=0) = \tfrac{1}{2^m}$$$, assume that $$$T_{i,j}$$$ is non-empty, let $$$a_k\in T_{i,j}$$$. Then, for any value that the xor of the other elements takes, there is exactly one value that $$$a_k$$$ can take which makes the whole hash 0. This argument can be formalized by using conditional probabilities.