kfqg's blog

By kfqg, history, 7 years ago, In English

I was reading the tutorial of CF 763D, a problem about tree isomorphism. The solution involved computing polynomial hashes of all possible subtrees of a tree (max 10^5 vertices), and then comparing the hashes as a substitute for checking tree equality. It seems to me like the solution is vulnerable to hash collisions.

I'm relatively new to competitive programming. Is it normal for there to be problems that require solutions that are vulnerable to hash collisions? Shouldn't a good problem be one that is provably solvable for 100% of all possible inputs?

The problem is here: http://codeforces.net/problemset/problem/763/D

The tutorial is here: http://codeforces.net/blog/entry/50205

rng_58 also wrote a blog post on this: http://rng-58.blogspot.com/2017/02/hashing-and-probability-of-collision.html. The solution given there also has a nonzero hash collision probability.

  • Vote: I like it
  • +16
  • Vote: I do not like it

| Write comment?
»
7 years ago, # |
  Vote: I like it +8 Vote: I do not like it

Think about it like this: usually, there are specific tests that make hash collision a problem, as there's a small probability that it'll happen for randomly generated test. Still, one trick that I was taught was to use 2 hashes and so the probability of hash collision for both hashes is smaller(though it takes a bit more time and you can also go up to 3).

»
7 years ago, # |
  Vote: I like it +19 Vote: I do not like it

I've seen some people objecting to problems with nonzero collision chance, but I personally think that doesn't make much sense.

As mentioned, you can use double or triple hashing — and the chance of radiation from the sun changing some bits in your PC is most likely larger than a random triple hash collision.

The only sensible argument could be that if somebody with malicious intentions knows the hashing you are using, they could create an input on which your program behaves badly. However, this is a problem only in specific situations and you could use more sophisticated hashing functions than polynomial hash if that's an issue.

Hashing is quite practical in my opinion and it should be completely fine to set problems in which the intended solution has a nonzero collision chance, as long as that chance is small enough.

  • »
    »
    7 years ago, # ^ |
      Vote: I like it +16 Vote: I do not like it

    A simple way to counter the anti-hash "malicious intentions" would just be to randomly choose mods from a large array at the start. That way no one will be able to hack your solution.

    • »
      »
      »
      7 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Right, that should work. But why do we need a large array for this? Why not just pick a random prime every time?

      • »
        »
        »
        »
        7 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Wouldn't you need to take calculation time to find a prime then?

        • »
          »
          »
          »
          »
          7 years ago, # ^ |
            Vote: I like it +5 Vote: I do not like it

          Right, but even sqrt(n) loops to check for primality shouldn't take much time, assuming mods fit in integer range. Otherwise we can always use fermat's theorem or miller rabin :)

          • »
            »
            »
            »
            »
            »
            7 years ago, # ^ |
              Vote: I like it 0 Vote: I do not like it

            I still think pasting some hundred primes into an array is faster than writing prime generator both in amount of writing time and execution time.

»
7 years ago, # |
  Vote: I like it +3 Vote: I do not like it

Shouldn't a good problem be one that is provably solvable for 100% of all possible inputs?

Yeah, obviously. It's a condition that "non-wrong" problem should satisfy. But it is missing some details, so I will write that.

The problem is "not wrong" if it's 100% solvable for all possible inputs specified, assuming the random generators are completely random. Let's see some examples :

  1. Output a + b : Not wrong.

  2. Problem with a random solution, with provably low rate of being wrong or TL : Although there is a room for debate, but the general consensus of the CP community (including me) is to consider such solution as "correct" ones, because we believe that the modern PRNG's are sufficiently good — good enough to believe, that it's a compiler's fault if not random.

  3. Problem which explicitly says that it is a "random input, at the best of setter's knowledge" : This goes same with point 2.

  4. Problem with a random solution, which is unproved, but setter claims that is "sensible" : I've seen a lot of rookie setters who do that, and find a counterexample that breaks the solution regardless of PRNG. If that's the case, teach him to not make such mistakes again.

  5. A hashing problem : Okay, so we pick a random (sometimes even constant) prime, and do hashing.. Obviously I think it's sensible, but history shows that people are not sensible (we don't even need to talk WW2, Anti-hash test is evil enough to break the "sense"), so we need to prove here. I don't know why hashing has provably low rates of collision — but according to rng_58's article (namely Schwartz-Zippels lemma) it seems that there is an enough theory that illustrates why it works. So, now we can see this is same to case 2, so it's OK.

The solution is indeed vulnerable to hash collisions, but if we assume random primes, then we can prove that it does not collide — the result of Schwartz-Zippels lemma. Actually those proofs doesn't come from the setters, it's from rng_58's article — so I don't think it was a good problem, but a problem that was "saved" by rng_58.