Hi, I am trying to check if an array (consisting of the numbers from 1 to N, where N is the length of the array) is a permutation of N elements using hashing. My idea is to assign a randomized value as a hash for each value between 1 and N. The hash value of an array would be them sum of these individual hashes, and I compare this sum to the sum of the hashes for a permutation of length N.
I am wondering, what is the probability of collisions using this type of hashing? I couldn't find much online other than the probability for rolling hashes. Thank you!
Why not use xor-sum instead of normal sum? The probability is easier to bound this way. Assume that the hashes that you generate are
h[1..n]
. Suppose that your hashes retain $$$B$$$ bits of information and that they are randomly chosen from an uniform distribution:0 <= h[i] < 2**B
,P(h[i] = X | 0 <= X < 2**B) = 1 / 2**B
.Notice that this means that
P(the jth bit of h[i] = 0) = P(the jth bit of h[i] = 1) = 1/2
for any1 <= i <= n, 0 <= j < B
.Suppose that the array we are checking is
v[1..n]
and we know that1 <= v[i] <= n
.The correct answer for a permutation would be
h_ok = h[1] ^ h[2] ^ .. ^ h[n]
. Suppose thatv[1..n]
is not a permutation and we want to calculate the probability thath[v[1]] ^ h[v[2]] ^ .. ^ h[v[n]] = h_ok
.By moving all terms on one side, we would be left with an equality like
{0, 1} * h[1] ^ {0, 1} * h[2] ^ .. ^ {0, 1} * h[n] = 0
. (as inh[1]
either once or none ...)For example, if
v[1..7] = [3 4 5 3 3 2 7]
, we would have the equalityh[3] ^ h[3] ^ h[1] ^ h[6] = 0 <=> h[1] ^ h[6] = 0
.We will prove that the number of terms that we have on the left hand side doesn't matter and the probability of equality is $$$1/2^B$$$.
Let's prove first that the probability of the LSB to be equal to $$$0$$$ is $$$1/2$$$ no matter how many terms we have on the LHS.
If we have $$$1 \leq k \leq n$$$ terms on the left hand side, then the result is $$$0$$$ if either $$$0, 2, 4, ... $$$ out of the $$$k$$$ bits are $$$1$$$:
$$$P(LHS = 0) = \frac{{k \choose 0} + {k \choose 2} + {k \choose 4} + ...}{2^k} = \frac{2^{k-1}}{2^k} = \frac{1}{2}$$$.
Now, since the hashes have been uniformly chosen in $$$[0 .. 2^B)$$$, the bits' values are independent, so the probability for the LHS to be $$$0$$$ taking in consideration all of the bits is $$$(1/2) ^ B = \frac{1}{2 ^ B}$$$.
So if you use $$$B = 64$$$, the collision probability is $$$< 10^{-19}$$$.