Блог пользователя chromate00

Автор chromate00, 4 дня назад, По-английски

As the main problemsetter of Codeforces Round 1000 (Div. 2), I know that a lot of you were somewhat disappointed to see that it was not a Div.1 contest. Though I am not the one who caused a lack of Div.1 contests, I understand your feeling. So I present to you the great surprise: The Anti-LLM Evaluation Report for Codeforces Round 1000 (Div. 2), the first of its kind on a Div.2! In this blog we discuss about how the round combats against LLMs (especially focusing on OpenAI o1, the greatest of its kind while we prepared and tested the problemset), by looking at the timeline of how the problemset changed.


So, let us begin with the initial problemset we had when the testing began. (The task names are anonymized, unless they are released to the public before or in the round)

In the beginning, we had a problemset that looks like this:

  • A' — B — C0 — D' — E' — F0

(If you are confused with the meanings, (letter)' means it was not used in the final problemset but existed, while (letter)0 means that it appeared in the final problemset in a different form.)

So, you may see, at least half of the problemset has changed in the testing phase. How did o1 do against this revision of our problemset?

  • A': Not Solved.
  • B: Not Solved.
  • C0: Solved.
  • D': Not Solved.
  • E': Not Solved.
  • F0: Not Solved.

Okay, so it already was a bit strong against o1, but C0 being solved by LLMs while A' being not solved looked too odd to me. It felt like LLMs would get a too large unfair advantage if it was kept this way.

Now, the timeline of the problemset begins. (Note that the timeline is purely based on my memory and whatever information is left in the testing mashup)


  • The first task to get swapped was A'. A lot of testers felt A' a bit too hard for its position, and we had to find a new one. Luckily, we found A'', and swapped A' with it. A'' was solved by o1. But we believed that would be fine if we could find a replacement to C0 that isn't solved by LLMs. Back on that later.

  • The second task to get swapped was F0. F0 was a version of F that only asks for uniqueness. It was interactive, and we didn't really have a good way to force them to solve online. And there were unexpected solutions even in the online setting, so we made F that asks for counting, and split it to two subtasks. As you might expect, F was not solved by o1.

  • The third task to get swapped was C0. We swapped it out for C' initially, but the testers did not like it (probably it was too hard for its position), and we swapped it back to C0. C' was not used afterwards, and it was not evaluated against LLMs either.

It was very hard to find a replacement for C0. Until...


  • We found C. It was just a random idea I pulled while ranting about how hard it is to make tasks on the position. But somehow, very surprisingly, it was not solved by o1. I am still surprised about how o1 cannot solve it, don't ask us, go ask Sam Altman instead. Anyways, this replaced C0. Very lucky!

  • At this point, most people struggled on E'. We decided that it is not a good fit for the position. As a replacement, we found E and replaced E' with it. Thankfully, the testers pointed out that it's just the fine difficulty for the position. And it's also not solved by o1. Another good one. Later, E' became COUNTGOOD on CodeChef Starters 167.

  • Some time after that, some testers pointed out that D' is very similar to a task from GCJ. But worry not, we have discussed this out to balance the difficulty for a long time. Surprise: C0 became buffed to D. And it is now not solved by o1! Nice.

Around this time, I asked rewhile to test. He knew better about prompting LLMs, so I asked explicitly to test only using o1 (and he gladly accepted). Here was the result.

  • A'': AC
  • B: -11
  • C: -4
  • D: -2
  • E: -1
  • F (Easy): -2
  • F (Hard): -1

So yes, I believed that o1 will die horribly if it's the same way.


  • After this, I just changed A'' to A, which was much easier than A'' for humans. It is solved by o1, but it's fine now.

There are some omitted changes also (Such as a task proposed for Div2D but rejected immediately after I found that it's a Div2B), but these are all of the significant changes.


So the final result for o1 is as follows.

  • Expected Score: $$$498$$$.
  • Corresponding Rating: 911 (762 rating points less than the 1673, initially claimed by OpenAI)

The point we need to focus on is that, when these LLMs came out, people thought it's the end of the world for Div.2, but not yet! It might not be the end of the world! But it takes more effort to combat them. Here are some things I found, as a guidance for problemsetters who want to make your Div.2 contest strong against LLM.

  • You might have noticed, the round has significantly different problem styles compared to the usual Div.2; that might have helped combat them.
  • Maybe it could be time to change the meta again, after the last time it changed, which was probably when 1188B - Count Pairs appeared? I don't know. Up to the next Div.2 problemsetters to decide.
  • In terms of problem style, problems that require multiple small observations will be generally more robust against LLMs than those that require one or two big observations. Stronger LLMs usually get one or two first observations correct. Making more small observations will make them also suffer from limit on number of tokens. For example, on problem C, o1 got the immediate first observation, but got WA by choosing the first $$$k=\mathcal{O}(\sqrt{n})$$$ and bruteforcing $$$k \choose 2$$$ pairs.
  • It will require you significantly more effort to make easier tasks than to make hard tasks. Position C required us more effort than position F required us. For harder tasks you can care less if your task isn't extremely classic or already known, which I assume is usually not the case.

Maybe for a few things noted I might be not the closest to the ground truth. Tell me in comments if you need to point something out.


Also, for cheaters that tried to use o1:

I gave you a hint already. I hope you learn from negative delta, and become honest and diligent again. I hope you are a sane person. I, myself, truly improved only after moving on over bad behaviour. Yes, I had cheated back in the days, and now I became as honest as one could. I believe you can be better also.

Please take this opportunity as a lesson.

Thank you for reading.

  • Проголосовать: нравится
  • +263
  • Проголосовать: не нравится

»
6 часов назад, # |
  Проголосовать: нравится +1 Проголосовать: не нравится

Yann Lecun Vindicated

»
5 часов назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

This is the first contest I've seen have an Anti-LLM report explaining the process behind the contest design, and I really appreciate this and hope more codeforces contests do this in the future. I do have a question for rewhile about how he tested the LLMs; I find the issue with LLMs isn't just that they can solve singular observation problems but they can also help a user make observations they would not make otherwise, thus cheating for them by indirectly solving the problem. If it is possible, could you explain some of the general methods you used in testing against the o1?

»
5 часов назад, # |
  Проголосовать: нравится +2 Проголосовать: не нравится

It's awesome. though I think OpenAI should pay you for testing their model :)

»
5 часов назад, # |
  Проголосовать: нравится +7 Проголосовать: не нравится

how to make anti-gpt contest -> set problems until not solvable by gpt 😂

if any proposed problem is solvable then just reject immediately

»
5 часов назад, # |
  Проголосовать: нравится +1 Проголосовать: не нравится

Just a query, was there any special rating distribution for this round as there are participants who were lower than me in the contest and higher rated than me, but they finished with positive delta while I received negative?

  • »
    »
    5 часов назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    uhh, I think your rank is incorrectly applied for rating change. You could ask KAN for a quick answer, or probably wait until recalculation after plag check.

»
5 часов назад, # |
Rev. 2   Проголосовать: нравится +24 Проголосовать: не нравится

Your problemset is not as anti-LLM as you think. After seeing this post, I tested it out for myself, this is the result:

  • A: solved (one try)
  • B: not solved yet (two tries so far)
  • C: not solved yet (three tries so far)
  • D: solved (one try)
  • E: solved (one try)

as you can see from here, although B and C are indeed hard for o1 to solve, it managed to solve D and E in just one try. I suspect F1 might be solve-able too, but have not tried for myself.

It only takes around a performance of 1500 to solve ABC, but if you manage to solve ABCDE, your performance is now 2200+. Which means that a normal Specialist with GPT o1 can easily cheat their way to become a Master.

For anyone interested, check out my submissions.

  • »
    »
    5 часов назад, # ^ |
      Проголосовать: нравится +14 Проголосовать: не нравится

    Update: C is also solved

  • »
    »
    5 часов назад, # ^ |
      Проголосовать: нравится +4 Проголосовать: не нравится

    itsover

    I kneel chatgpt sama

  • »
    »
    5 часов назад, # ^ |
    Rev. 4   Проголосовать: нравится 0 Проголосовать: не нравится

    We checked multiple times that it could not solve the tasks, and there were rated participants almost matching the expected rating who seem to have used o1. Maybe it could be different for different prompts, like depending on if you can give it better ideas.

    I also understand that the results can be very different for DeepSeek-r1. In fact we caught some users suspicious for using r1 already :skull:

    • »
      »
      »
      5 часов назад, # ^ |
      Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

      No, all of this was achieved using o1, not r1, no ideas were given from my end either.

      • »
        »
        »
        »
        4 часа назад, # ^ |
          Проголосовать: нравится 0 Проголосовать: не нравится

        true, I didn't mean to say you didn't. Probably the experiment was done in different conditions considering that our tester didn't have o1-pro.

        Maybe the round is not $$$100$$$ percent LLM-proof (DeepSeek-r1 was about that level), but it is still meaningful that we found out it is at least possible to fool it partially.

        • »
          »
          »
          »
          »
          4 часа назад, # ^ |
            Проголосовать: нравится +12 Проголосовать: не нравится

          I was not using o1-pro nor do I have GPT pro, everything had been conducted on the normal version of o1, the only difference would probably be in the prompting, I argue that this should not be considered as it is something easily editable. I appreciate your effort, but please do not make such bold claims.

          • »
            »
            »
            »
            »
            »
            4 часа назад, # ^ |
              Проголосовать: нравится 0 Проголосовать: не нравится

            Sorry, we were not sure if we are really seeing the same version of o1 as we tested against the problemset...

            Our testers are telling me that o1 has never thought for more than 2 minutes for them.

            • »
              »
              »
              »
              »
              »
              »
              4 часа назад, # ^ |
                Проголосовать: нравится 0 Проголосовать: не нравится

              I am quite sure this is due to a difference in prompting, if I just tell it to solve normally, I don't see it thinking for long either, you can check out my prompt in the video above

              • »
                »
                »
                »
                »
                »
                »
                »
                3 часа назад, # ^ |
                Rev. 3   Проголосовать: нравится +19 Проголосовать: не нравится

                I think you are using prompts that are mostly copied from (or at least very similar to the prompts of) rewhile so it cannot explain the difference.

              • »
                »
                »
                »
                »
                »
                »
                »
                2 часа назад, # ^ |
                  Проголосовать: нравится +12 Проголосовать: не нравится

                First of all, I appreciate your efforts of seeking the truth, and get quite shocked by your results. If ChatGPT-o1 can solve A-E easily, it is even slightly outperforming the claimed Elo rating of about 2000.

                I generally copied your prompt from your video to try problem E. I passed the screenshot of problem description with the following prompt:


                I want you to think thoroughly about this problem. use proper spacing for formatting. implement a solution in C++. the input and output are given in standard IO.

                i repeat that you must think thoroughly about your solution, ensuring its correctness and that it's fast enough according to constraints before answering. use more reasoning tokens if needed, and do not stop thinking, do not give up. Output your analysis with the solution.


                I used ChatGPT with web browser. I can confirm that I have selected GPT-o1 with o1 reasoning enabled, however, its solutions failed. See my last two submissions for the generated solutions.

                The most suspicious thing I found is the reasoning time. In my case, o1 only spent less then 10 seconds to generate the solution, which is significantly shorter than 7m27s in your video. I selected o1 model, uploaded screenshot, enabled o1 reasoning, and pasted the above prompt. Is there any other difference from your setting? Or does the model behave differently to different users?

        • »
          »
          »
          »
          »
          92 минуты назад, # ^ |
            Проголосовать: нравится 0 Проголосовать: не нравится

          I also tested with DeepSeek-R1. It solved A easily, then failed to solve B-F1. I even manually prompted the property in B, the model simply ignored my advice.

          I am not saying bad to R1. I really appreciate its contribution to the open-source community. And it often outputs an impressive amount of inner-thinking process, which makes me quite optimistic that LLM can eventually reach red (it seems that R1 is not specifically finetuned for competitive programming tasks, just my conjecture).

          • »
            »
            »
            »
            »
            »
            84 минуты назад, # ^ |
            Rev. 3   Проголосовать: нравится 0 Проголосовать: не нравится

            Oh, my bad, yes, r1 did require extra prompts to get B-D. It failed E and F.

            • A: AC immediately
            • B: AC with counter example
            • C: min(k, n) to WA, AC after giving it a compilation of (approach,verdict)
            • D: AC after telling its initial code TLE on test 8
            • E: TLE on test 5
            • F1: WA on sample for 20 tries

            So probably, with sufficiently strong prompts it is stronger than what o1 is expected to be.

            Note that this would NOT help anyone cheat yet; I just found some suspected cheaters using this pattern of wrong answers and reported to the coordinator.

            • »
              »
              »
              »
              »
              »
              »
              38 минут назад, # ^ |
                Проголосовать: нравится +1 Проголосовать: не нравится

              Definitely I understand that our discussions target at promoting fairness in online contests. Testing the capabilities of LLMs is part of my research interest, so I am not the one trying to cheat. I believe that beaaaan also does not mean to make you awkward, but exploring possible cases that proposed problems are solvable with the assistance of LLM (and a bit prompt-engineering).

              However, my general understanding is, LLM coders are likely to outperform many of us in the near future. At that time, designing easy to medium problems that can't be solved by LLMs will become hard or even impossible (if you expect competent human to successfully solve it).

              How will the contests be at that time? If powerful models are accessible (like in chess games), ratings will not explain anything since cheaters can easily become among top 5% or even top 1%. Similar things will happen to online interviews. But let us leave this topic to the future.

              Back to our conversation, I appreciate your attempts to avoid setting trivial problems, and your writing this blog to provoke critical thinking about LLMs. (Thank you for trying to defend human, LOL)

  • »
    »
    4 часа назад, # ^ |
      Проголосовать: нравится +8 Проголосовать: не нравится

    I don't know much about LLMs, but could it be that the o1 model improves with each new interaction/inference? Maybe it can use recent data in some way (such as code on the Internet that came after the contest).

    • »
      »
      »
      4 часа назад, # ^ |
        Проголосовать: нравится +4 Проголосовать: не нравится

      no way LLM like ChatGPT can improve in real time or... are they...?

      • »
        »
        »
        »
        3 часа назад, # ^ |
        Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

        I don't think they improve in real time. They use our interactions to improve their model but that doesn't mean you teach it something and in the very next prompt (in a new chat) it has now rectified its mistake

    • »
      »
      »
      60 минут назад, # ^ |
      Rev. 2   Проголосовать: нравится +1 Проголосовать: не нравится

      Once the LLM is deployed , and you are using it through the web interface, its weights (parameters) are freezed and can't change their weights (which directly affects performance) with interactions with the user. The only possible way of improvement is through the prompting style ( chain of thought, in-context learning, few shot prompting). One of these techniques is a term called memory which helps personalize the user experience by summarizing interactions with the user and retrieving them later in new conversations. As for o1, there is no certainty about what is happening, but reasoning models in general are doing what is like iterative search process for the best chain of though or line of reasoning that would lead to a correct solution. It is like the LLM is guiding the brute force search to be more efficient. Regarding your point, It could be that its incorporating same conversation context in improving the search process by adding more constraints and eliminating possibilities.