Hi Codeforces! I am a member of the reasoning team at OpenAI. We are especially excited to see your interest in the OpenAI o1 model launch, many of us being Codeforces users ourselves (chenmark, meret, qwerty787788, among others). Given the curiosity around the IOI results, we wanted to share the submissions that scored 362.14—above the gold medal threshold—from the research blog post with you. These were the highest scoring among 10,000 submissions, so still a ways to go until top human performance, but we aspire to be there one day.
The following C++ programs (including comments!) are written entirely by the model. Special thanks to PavelKunyavskiy for maintaining the IOI mirror, which we used to check our scores. We hope you enjoy taking a look!
nile (100/100)
- Submission (100/100)
message (79.64/100)
- Submission (79.64/100; subtask 1 and partial credit on subtask 2)
tree (30/100)
- Submission 1 (17/100; subtasks 1 and 4)
- Submission 2 (13/100; subtask 2)
hieroglyphs (44/100)
- Submission 1 (34/100; subtasks 1, 2, and 4)
- Submission 2 (10/100; subtask 3)
mosaic (37/100)
- Submission 1 (22/100; subtasks 1, 2, and 4)
- Submission 2 (20/100; subtasks 1, 3, and 5)
sphinx (71.5/100)
- Submission 1 (50/100; 50% partial credit on all subtasks)
- Submission 2 (43/100; subtasks 1, 2, and 3)
Lastly, we hope you find the new model magical and delightful—we can’t wait to hear about the amazing things you’ll build with it. (But please don’t use it to cheat on Codeforces!)
Great work!
It seems that o1 has extremely impressive scores all around; its most impressive score is probably actually hieroglyphs, where a score of 44 would place it fourth relative to onsite contestants! It seems that the model was able to decipher some of the subtasks where we could not!
And how was the performance of the model on Codeforces problems measured? Did it participate in rated rounds? Is it possible to reveal a username of the model on Codeforces?
We evaluated the Codeforces performance of the model via simulation, doing a best effort to approximate how the model would have performed had it participated live. With our Codeforces eval, the model is limited to 10 submissions per problem. We use these submissions to simulate the score; from the score we get a ranking; and from the ranking we estimate the model's rating.
Could be a naive question but: Do you guys (OpenAI) plan on watermarking the code generated by future models? It could make the process of detecting AI generated code much easier.
Watermarking a 50 line code seems impossible, unlike watermarking an image
I can't speak to future plans for OpenAI. That said, speaking for myself (and not OpenAI), I think watermarking is a cool research direction but not a panacea. For many problems, all AC solutions fall into a few broad buckets, and within those buckets, it is difficult to identify AI vs. non-AI solutions if one is allowed to rewrite/obfuscate code.
Out of curiosity: can you share if there are any endeavors in problem setting?
We don't have any results on problem setting, and I could imagine that writing creative problems is a bit out of reach of current models. (I struggle to even get them to tell me a new joke :)) But synthetic problems have been used in the training of models e.g. AlphaGeometry
This is probably obvious, but I want to ask: Did AI use stress testing locally to check for correctness? I suppose it is capable of writing the brute force solution and test locally. Just curious if the 10k submissions could be avoided (or if this could even improve the performance).
Maybe you didnt want to do that because adding human heuristics on top of the AI just for the sake of performance is not the goal?
In the blog post, we discussed this a little:
It would be super cool if one day the AI could do stress testing without human heuristics on top!
Edit: Oh, I got it. The model only submitted 50 solutions, as is the competition constraint. It generated thousands of solutions, but it only submitted 50.
I think you misunderstood here...
There are actually 3 different results:
It can be seen on this webpage: https://openai.com/index/learning-to-reason-with-llms/#coding
Oh, thanks! I’m just lazy to read about it. I prefer to read on codeforces comments :)
So I keep my position: I would expect that a sophisticated heuristic on top of the model with stress testing would, in most cases, be as accurate as the real verdict. That is, score should not improve by allowing more submissions.
When do you think AI will be able to solve Master level problems? Or is that even possible?
No way competitive programmers are the ones trying to ruin the sport
Creating 1 algorithm to solve all problems is the ultimate challenge.
For each task and the 10,000 submissions, if the score distrubution histogram can be shared, it will be more impressive!
How much computing power was used?
what was the prompt after seeing that the code is failing? did it generate some testcases somehow?
Why competitive programming?
Thanks for the posting such details. You guys do so interesting things!
Very insightful.
Edit: Seems like its solution is in fact correct, so 1-0 for the AI against me
Original: In particular I find the results on "message" interesting — it seems like its basic idea of determining a known safe column is not really correct, but given 10 000 submissions I imagine it tried a lot of different ways to communicate a safe column and eventually one went past the grader. That gives one view of why more submissions can be more helpful. I haven't examined the sphinx code, but I imagine in principle a similar thing is possible there, too.
What's wrong with it?
Actually, you're right, I seem to have misunderstood its approach. Not sure why so many people agreed with me.
Are you suggesting it can be hacked? Can you hack it?
but, what about cheaters who use ai?
It is not much different from cheaters who use their friends/submit from multiple accounts. New technology, but the same old problem.
Amazing things you are doing :)
Any plan on participating in ICPC world finals?