nik_exists's blog

By nik_exists, 21 hour(s) ago, In English
Prompt

CF 800-1200

CF 1500-1700

Overall, R1 performed fairly well, especially for an opensource model that was made as a side project. The pitfalls it did have were similar to o1, though it also had the issue of not following instructions at times. Additionally, for the 800-1200 rated problems, I used screenshots to send the statement, but it misread the screenshot during "Maximize MEX", and I had to switch to copy and pasting the statement into the text box.

With that being said, the model, while very impressive, doesn't seem to show anything we haven't seen before with o1, and I doubt any radical changes will have to be made to reduce cheating, similar to what happened when o1-mini initially released. We've already started to see problemsetters attempt to reduce GPT cheating (see the round 1000 Anti-LLM report), and assuming this becomes more of a trend in the future, the ability to cheat with these tools should hopefully be diminished to a large degree (at least until o3 comes out, but that's another story)

As a quick side note, LLMs are not deterministic, meaning your results might not be the same as mine here (though I'd suspect them to be fairly similar).

  • Vote: I like it
  • +7
  • Vote: I do not like it

»
3 hours ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by nik_exists (previous revision, new revision, compare).