nik_exists's blog

By nik_exists, 6 weeks ago, In English
Prompt

CF 800-1200

CF 1500-1700

Overall, R1 performed fairly well, especially for an opensource model that was made as a side project. The pitfalls it did have were similar to o1, though it also had the issue of not following instructions at times. Additionally, for the 800-1200 rated problems, I used screenshots to send the statement, but it misread the screenshot during "Maximize MEX", and I had to switch to copy and pasting the statement into the text box.

With that being said, the model, while very impressive, doesn't seem to show anything we haven't seen before with o1, and I doubt any radical changes will have to be made to reduce cheating, similar to what happened when o1-mini initially released. We've already started to see problemsetters attempt to reduce GPT cheating (see the round 1000 Anti-LLM report), and assuming this becomes more of a trend in the future, the ability to cheat with these tools should hopefully be diminished to a large degree (at least until o3 comes out, but that's another story)

As a quick side note, LLMs are not deterministic, meaning your results might not be the same as mine here (though I'd suspect them to be fairly similar).

  • Vote: I like it
  • +31
  • Vote: I do not like it

»
6 weeks ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by nik_exists (previous revision, new revision, compare).

»
6 weeks ago, # |
  Vote: I like it +6 Vote: I do not like it

How are LLMs not deterministic?? From my understanding the set of weights is the same, so the output should be the same for the same input.

  • »
    »
    6 weeks ago, # ^ |
      Vote: I like it +1 Vote: I do not like it

    (Please correct me if I'm wrong!) As far as I know, most LLMs you can chat with online use randomized seeds in order to prevent users from getting the same response every time, as well as using a metric called temperature which determines how random the responses can be.

  • »
    »
    6 weeks ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    I think it is like simulated annealing(SA).

  • »
    »
    6 weeks ago, # ^ |
      Vote: I like it +1 Vote: I do not like it

    Under the hood, the LLM model does not directly output an "answer" given the input. Instead, it outputs a probability distribution over some set of possible outputs. The answer is then obtained by taking a sample from the probability distribution. Of course, you can just take the response that has the highest probability, and this can be configured for some models.