purplesyringa's blog

By purplesyringa, history, 3 years ago, In English

I'd like to share this one unpopular but awesome algorithm that I only heard a mention of once. Kapun's algorithm finds a hash collision for moduli as large as $$$10^{18}$$$. I know, we have lots of algorithms that do much better than that (and I'm writing an article on that at the moment, keep tuned), but Kapun's algorithm is really surprising in its simplicity. That its correctness is so difficult to prove is of more surprise even.

Here we go.

A polynomial hash of a string $$$s$$$ is defined as

$$$ H(s) = s_0 b^0 + s_1 b^1 + s_2 b^2 + \dots + s_{n-1} b^{n-1} \pmod M. $$$

We want to find two strings with the same polynomial hash using only two characters: $$$0$$$ and $$$1$$$.

We can reformulate this problem in another way. Let $$$a_i = b^i \mod M$$$, then we want to find two distinct subsets of $$$a$$$ with the same sum modulo $$$M$$$. Now forget about the modulus: let's just find two subsets with the same sum.

Firstly, when is this possible? There are $$$2^n$$$ possible subsets and $$$n(M - 1) - n + 1 = n (M - 2) + 1$$$ distinct sums. If $$$2^n > n (M - 2) + 1$$$, that is, $$$\frac{2^n}{n} \gg M$$$, there are certainly two subsets with the same sum. Generating $$$M$$$ or even just $$$\sqrt{M}$$$ strings (if you use birthday paradox) is impossible for large $$$M$$$, but there's still a way through.

It turns out there is a deterministic algorithm that attempts to solve this problem that is very easy to implement!

Algorithm

Suppose we have a list of $$$n$$$ integers in range $$$[1; R]$$$ each: $$$[a_1, a_2, \dots, a_n]$$$. Kapun's algorithm attempts to find two subsets of these integers with equal sums.

The algorithm is as follows. Instead of finding two subsets, we will find an expression of kind $$$\pm a_{i_1} \pm a_{i_2} \dots \pm a_{i_k} = 0$$$ for some $$$i$$$ and some signs. It's easy to see that this is equivalent to finding two subsets with the same sum. Initially, we have $$$n$$$ trivial expressions $$$a_1, a_2, \dots, a_n$$$. The algorithm is iterative; at each step, we do the following:

  • If we have an expression with value $$$0$$$ at the moment, return this expression.
  • Otherwise, sort the expressions by their values.
  • Replace the current expressions $$$[b_1, b_2, \dots, b_k]$$$ by $$$[b_2 - b_1, b_4 - b_3, b_6 - b_5, \dots]$$$, that is, the expressions are divided into pairs and each pair is replaced by its difference. If $$$k$$$ is odd, $$$b_k$$$ is simply dropped.
  • Repeat from the beginning.

If we end up with $$$b = [x]$$$ and $$$x > 0$$$, the algorithm has failed.

Theorem

Kapun's algorithm necessarily succeeds if $$$R > \alpha + n^{\beta \ln n}$$$, where $$$\alpha, \beta$$$ are some small constants.

Proof. It's obvious that if there exists a counterexample for a particular $$$(n, R)$$$ tuple, it also exists for $$$(n, R + 1)$$$. Therefore we can define $$$f(n)$$$ as the smallest $$$R$$$ for which a counterexample exists for specific $$$n$$$.

Also, let's denote by operator $$$\sigma a$$$ the sorted array $$$a$$$, and by $$$\Delta a$$$ the neighbour-difference array of $$$\sigma a$$$. For instance, $$$\Delta [10, 5, 9, 7] = \Delta [5, 7, 9, 10] = [2, 1]$$$, and $$$\sigma [5, 1, 2] = [1, 2, 5]$$$.

For a particular $$$n$$$, let us consider the smallest countertest, the "smallest" meaning the one with the smallest $$$R$$$. Let's say it's some $$$[a_1, a_2, \dots, a_n]$$$. The algorithm sorts this array, so without loss of generality assume $$$a$$$ is sorted.

As long as $$$a_1 > 0$$$, the subsequent behavior of the algorithm only depends on $$$\Delta a$$$, right? So we can subtract $$$a_1 - 1$$$ from all elements of $$$a$$$, because this operation does not affect the difference array, $$$a_1$$$ stays positive and $$$R = a_n$$$ decreases. Therefore, in the optimal countertest $$$a_1 = 1$$$.

Notice that an array of kind $$$[1, a_2, a_3, a_4, \dots]$$$ has the same difference array as an array of kind $$$[1, a_2, a_2, a_4 - (a_3 - a_2), \dots]$$$, so in the optimal array $$$a_2 = a_3, a_4 = a_5$$$, and so on. Therefore the optimal countertest is of kind $$$[1, x, x, y, y, z, z, \dots]$$$.

Let us denote $$$a' = \Delta a$$$. Then $$$a = [1, 1 + a'_1, 1 + a'_1, 1 + a'_1 + a'_2, 1 + a'_1 + a'_2, \dots]$$$. In particular, $$$a_n = 1 + a'_1 + a'_2 + \dots$$$. Notice that the mapping between $$$a'$$$ and the optimal $$$a$$$ is a bijection! This means that instead of searching for $$$a$$$ of size $$$n$$$ with minimal $$$a_n = R$$$, we can forget about $$$a$$$ and search for $$$a'$$$ of size $$$n' = \lfloor \frac{n}{2} \rfloor$$$ with the minimal sum of $$$a'$$$.

What can we say about the optimal $$$a'$$$? We only care about its sum, so without loss of generality assume it's sorted. As long as $$$a'_1 > 0$$$, the behavior of the algorithm only depends on $$$\Delta a'$$$, so we can subtract $$$a'_1 - 1$$$ from all elements of $$$a'$$$ because the sum of the elements (non-strictly) decreases. Therefore $$$a'_1 = 1$$$. We can also assume the array $$$a'_2 = a'_3, a'_4 = a'_5$$$, and so on, because subtracting a constant from a suffix of the array decreases the sum. Therefore the optimal array is of kind $$$[1, x, x, y, y, z, z, \dots]$$$.

Let us denote $$$a{''} = \Delta a'$$$. Then $$$a' = [1, 1 + a{''}_1, 1 + a{''}_1, 1 + a{''}_1 + a{''}_2, 1 + a{''}_1 + a{''}_2, \dots]$$$. The sum of $$$a'$$$ is some weighted sum of $$$a{''}$$$. Notice that the mapping between $$$a{''}$$$ and the optimal $$$a'$$$ is a bijection, so we can search for $$$a{''}$$$ of size $$$n{''} = \lfloor \frac{n'}{2} \rfloor$$$ with a particular weighted sum of $$$a'$$$ minimized.

You get the drill. Every time we have some weight function $$$\mu(a) = \lambda_1 a_1 + \lambda_2 a_2 + \dots$$$, and we want to find $$$a$$$ with the minimal weight. $$$\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_n$$$ holds for all reductions except the first one (when $$$\lambda = [0, 0, \dots, 0, 1]$$$), so due to the rearrangement inequality the optimal $$$a$$$ is sorted. Subtracting $$$a_1 - 1$$$ from all elements of $$$a$$$ decreases $$$\mu(a)$$$ because $$$\lambda_i > 0$$$. Subtracting a constant from a suffix of $$$a$$$ reduces $$$\mu(a)$$$ for the same reason, so the optimal $$$a$$$ is of kind

$$$ [1, x, x, y, y, z, z, \dots] = [1, 1 + a'_1, 1 + a'_1, 1 + a'_1 + a'_2, 1 + a'_1 + a'_2, \dots]. $$$

Minimizing $$$\mu(a)$$$ is equivalent to minimizing $$$\mu'(a')$$$, where

$$$ \mu'(a') = \sum_{i=1}^n \left( \lambda_i \sum_{j=1}^{\lfloor i/2 \rfloor} a'_j \right) = \sum_{j=1}^{n'} \left( a'_j \cdot \sum_{i=2j}^n \lambda_i \right). $$$

It's easy to see from this formula that $$$\lambda'$$$ is always decreasing, so the argument can be continued by induction.

Now, what do we get at the end? We have $$$a = [x]$$$ and $$$\mu(a) = \lambda_1 a_1$$$; it's obvious that the optimal $$$a$$$ is $$$[1]$$$. Treating that as a neighbour-difference array of some $$$a'$$$ we get $$$a' = [1, 2]$$$ or $$$a' = [1, 2, 2]$$$ for the optimal $$$a'$$$, depending on the parity. Treating that as a neighbour-difference array of some $$$a{''}$$$ we get $$$a{''} = [1, 2, 2, 4]$$$ or $$$a{''} = [1, 2, 2, 4, 4]$$$ in the former case and $$$a{''} = [1, 2, 2, 4, 4, 6]$$$ or $$$a{''} = [1, 2, 2, 4, 4, 6, 6]$$$ in the latter case.

It takes a little while to notice the following fact.

Let us consider an infinite sequence $$$G$$$ with the following properties:

  • $$$G_1 = 1$$$,
  • $$$\Delta G = G$$$,
  • among such $$$G$$$ it is the lexicographically smallest one.

Let us prove that the optimal $$$a$$$ for a given $$$n$$$ (arbitrary) and $$$\mu$$$ (with $$$\lambda$$$ non-increasing) is the prefix of $$$G$$$ of length $$$n$$$, by induction. For $$$n = 1$$$ this is obvious. Induction step: say we have already proven the fact for all lesser $$$n$$$. We have also already derived that the optimal $$$a$$$ is of kind $$$[1, x, x, y, y, z, z, \dots]$$$. Due to induction $$$a' = \Delta a$$$ is optimal for $$$\mu'$$$ if $$$a'$$$ is a prefix of $$$G$$$ of length $$$n'$$$. Therefore, $$$a$$$ satisfies the following conditions:

  • $$$a_1 = 1$$$,
  • $$$\Delta a = [G_1, G_2, \dots, G_n]$$$,
  • among such $$$a$$$ it is the lexicographically smallest one.

It's easy to see that this implies that $$$a$$$ is the prefix of $$$G$$$ of length $$$n$$$. The proof by induction is complete.

We now know that $$$G$$$ gives the smallest counterexample to Kapun's algorithm, for lots of definitions of "smallest" (namely, for all $$$\mu$$$ with non-decreasing $$$\lambda$$$). Therefore $$$f(n) = G_n$$$.

The sequence $$$G$$$ is as follows:

$$$ G = [1, 2, 2, 4, 4, 6, 6, 10, 10, 14, 14, 20, 20, 26, 26, 36, 36, 46, 46, 60, 60, 74, 74, 94, 94, 114, \dots] $$$

This series is given by A018819 and is asymptotically

$$$ G_n \approx \exp \frac{\ln^2 n}{2} $$$

with high precision. Therefore Kapun's algorithm works for all

$$$ R \ll \exp \frac{\ln^2 n}{2} = n^{(\ln n) / 2}. $$$

QED.

Upper bound

The bound obtained above can be seen as a kind of "lower bound", meaning that, if you have an array and you want to find two subsets, it makes no sense to decrease $$$R$$$ any further than that bound.

It is reasonable to attempt to provide an upper bound in the same fashion, i.e. some $$$R$$$ such that there is little chance that Kapun's algorithm succeeds anymore.

This bound is, unfortunately, much less rigid than the lower bound. It is derived on assumption that Kapun's algorithm works best when $$$a$$$ is selected randomly uniformly from $$$[1; R]^n$$$, which is of course false. We can only hope that whatever bound we derive is of the right magnitude.

So let's forget about math and all and assume $$$a$$$ is selected randomly. Then, after sorting, $$$a_{i+1} - a_i \approx \frac{R}{n}$$$. So after running one iteration of Kapun's algorithm we have $$$R' \approx \frac{R}{n}$$$ and $$$n' \approx n$$$. After all $$$\log_2 n$$$ operations we have

$$$ \begin{align*} R' & \approx R \cdot \frac{1}{n} \cdot \frac{2}{n} \cdot \frac{4}{n} \cdot \dots \cdot \frac{n}{n} \approx \frac{R}{n^{\log_2 n}} \cdot \prod_{i=0}^{\log_2 n} 2^i \\ & \approx \frac{R}{n^{\log_2 n}} \cdot 2^{(\log_2^2 n) / 2} \approx \frac{R}{n^{(\log_2 n) / 2}}. \end{align*} $$$

We want $$$R' \ge 1$$$ or something, so $$$R > n^{(\log_2 n) / 2}$$$. This bound turns out to be really similar to the rigid lower bound, so maybe we weren't that much off after all.

As far as I am aware, this is what people used to call the "proof" of Kapun's algorithm, which is obviously wrong, but at least the correct proof explains away why it worked so well.

Application to hashing

In our use case, $$$R = M - 1$$$. After rearranging the inequality of the lower bound we get

$$$ n \gg \exp{\sqrt{2 \ln R}}, $$$

which translates to $$$n \approx 625$$$ for $$$M \approx 10^9$$$ and to $$$n \approx 8996$$$ for $$$M \approx 10^{18}$$$.

For the upper bound,

$$$ n \approx 2^{\sqrt{2 \log_2 R}}, $$$

which translates to $$$n \approx 212$$$ for $$$M \approx 10^9$$$ and to $$$n \approx 1958$$$ for $$$M \approx 10^{18}$$$.

In reality, the truth is somewhere in the middle. For $$$M \approx 10^{18}$$$, the 50-th percentile is $$$n \approx 3500$$$ and the 95-th percentile is $$$n \approx 4000$$$.

Implementation

Licensed under GPLv3 or any later version at your option for any use cases, and under MIT for use in competitive programming tournaments.

Code
  • Vote: I like it
  • +246
  • Vote: I do not like it

| Write comment?
»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by purplesyringa (previous revision, new revision, compare).

»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Why do you call this algorithm "Kapun's algorithm"? Is there a paper describing this algorithm?

»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

In before (forgot to add a note, sorry): there are some trivial improvements for this algorithm, like checking that there are no equal elements before computing $$$\Delta a$$$, but they only slightly reduce the required $$$n$$$. If anyone knows how to reduce it significantly without a new approach such as "multi-tree attack", do tell.

»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

I posted a sort of a follow-up post here: https://codeforces.net/blog/entry/100027.

»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

purplesyringa thank you for the great article! I read it through but I didn't get the last part in the Application to hashing. From the proof part you have $$$n \gg \exp \sqrt{2 \ln R}$$$. From the upper bound analysis, $$$n < 2^{\sqrt{2 \log_{2}{R}}}$$$. So, for $$$M \approx 10^9$$$, we have: $$$ 625 \ll n < 212 $$$. Could you elaborate on that?

  • »
    »
    3 years ago, # ^ |
    Rev. 2   Vote: I like it +5 Vote: I do not like it

    Yeah, the names "upper" and "lower" bound are sort of counterintuitive here. The signs should be $$$\gg$$$ in both inequalities. What I meant is that if $$$n \gg \exp \sqrt{2 \ln R}$$$, Kapun's algorithm works for certain, and if $$$n \gg 2^{\sqrt{2 \log_2 R}}$$$, it works with high probability. So in practice the correct $$$n$$$ is somewhere between the two numbers.

»
3 years ago, # |
  Vote: I like it +18 Vote: I do not like it

I just found out that this algorithm is relatively well-known outside Competitive Programming under the name largest differencing method, or Karmarkar-Karp's algorithm. Instead of replacing the array with an array of neighbor differences, the authors suggest to pop the largest two values from the heap and replace them with their difference. The complexity is approximately the same as what was described in this article, though, thanks to the fact that this algorithm deals with whole numbers, while Karmarkar-Karp handles real values just as well.