The GPT models have made great advancements in competitive programming recently. I saw from Qingyu's blog earlier that the new o3 model claims to achieve a 2727 performance rating.
However, I'm slightly uncertain of how this 'rating achieved' should be interpreted. There is very little detail publicly available on how the benchmarks are computed. The only available detail I could find regarding their benchmark methodology is the following quote from the 'reasoning with LLMs' blog:
Finally, we simulated competitive programming contests hosted by Codeforces to demonstrate this model’s coding skill. Our evaluations closely matched competition rules and allowed for 10 submissions.
This does not really offer any detailed benchmarking methods. Given the limited details, I'm hypothesising that the LLM doesn't need to solve problems anywhere close to the 2700-rated ones to achieve a 2727 performance rating. This is due to the following reasons:
- Codeforces contests have a time-decay factor in the scoring system; human solvers at x rating usually only solve a x-rated problem in the later half of the contest, resulting in only <60% of the total score awarded for the problem;
- LLMs, on the other hand, might solve a problem very quickly if it is able to solve it at all. Therefore, the model doesn't need to solve as many problems as a normal human contestant in order to achieve the same 'performance rating' on Codeforces contests. In fact, speed in solving easier problems is often decisive in the performance rating system of Codeforces.
To study the above, I've collated some data from the recent Division 1 contests. For each contest, we gather the score of the contestant with a performance rating closest to 2727 using this tool, and computed the number of the first n questions the LLM need to be able to solve to achieve that score. The following assumptions were made:
- The LLM only solves problems in increasing order of difficulty (i.e. not a rainboy model);
- If the LLM is able to solve a problem, then it solves it in its first attempt, therefore incurring no penalty for wrong submissions;
- If the LLM is able to solve a problem, then it solves it within the first minute, therefore incurring no time-decay penalty.
The first n problems usually don't sum up exactly to the required score for the performance rating. As a result, the requirements reported are in the format of 'solving up to problem X with a probability p', where p is the probability that makes the score of the first X-1 problems plus p * score of problem X equal to the required score.
The results for the most recent 8 Div.1 contests are illustrated in the following table:
Round Name | Score Required | Boundary Problem | Ratings of Boundary Problems |
---|---|---|---|
Codeforces Global Round 28 (Div 1+2) | 5984 | F, 61.7% chance | NA |
Codeforces Round 990 (Div 1) | 2492 | C, 82.8% chance | 1600/2100 |
Rayan Programming Contest 2024 (Codeforces Round 989, Div. 1 + Div. 2) | 7252 | F1, 50.1% chance | 2200/2500 |
CodeTon Round 9 (Div. 1 + Div. 2) | 7973 | E, 71.7% chance | 1700/2200 |
Refact.ai Match 1 (Codeforces Round 985, Div1+2) | 9801 | F, 43.3% chance | 2100/2500 |
Codeforces Global Round 27 (Div 1+2) | 5917 | F, 29.6% chance | 2300/2500 |
Codeforces Round 980 (Div 1) | 2532 | C, 68.6% chance | 1700/2400 |
Codeforces Round 975 (Div 1) | 3120 | D, 74.7% chance | 1700/2200 |
The last column displays the problem ratings of the two problems on the boundary (i.e. the boundary problem and the one prior).
From the table, we can see that the LLM has a large advantage in combined Div1+2 rounds, as these rounds have quite some more problems than the isolated Div 1 rounds (therefore benefiting from its speed, since it takes longer for human solvers to reach the later problems). For the isolated Div 1 rounds, the LLM needs to have a good chance (60-80%) at solving the 2100-2400 rated problems (while solving 1600-1700 rated problems with certainty) in order to achieve the proclaimed '2727 performance rating'. Extremely impressive nevertheless, although it is quite different from having a chance at solving 2700-rated problems. That being said, I'm sure advancements in LLM will make solving 2700-rated problems a reality in the near future.
Appendix:
The above assumptions are quite optimal even for LLMs. The following table displays another hypothetical scenario, where we assume that the LLM makes 1 incorrect submission per problem, therefore incurring a -50 penalty for each problem. The corresponding results are displayed in the following table:
Round Name | Score Required | Boundary Problem | Ratings of Boundary Problems |
---|---|---|---|
Codeforces Global Round 28 (Div 1+2) | 5984 | F, 76.7% chance | NA |
Codeforces Round 990 (Div 1) | 2492 | C, 92.8% chance | 1600/2100 |
Rayan Programming Contest 2024 (Codeforces Round 989, Div. 1 + Div. 2) | 7252 | F1, 50.1% chance | 2200/2500 |
CodeTon Round 9 (Div. 1 + Div. 2) | 7973 | E, 82.7% chance | 1700/2200 |
Refact.ai Match 1 (Codeforces Round 985, Div1+2) | 9801 | F, 53.3% chance | 2100/2500 |
Codeforces Global Round 27 (Div 1+2) | 5917 | F, 43% chance | 2300/2500 |
Codeforces Round 980 (Div 1) | 2532 | C, 78.8% chance | 1700/2400 |
Codeforces Round 975 (Div 1) | 3120 | D, 88% chance | 1700/2200 |
Note that the boundary problems do not change at all, although the LLM does need to solve them at a higher success rate. So it really is the speed of the LLMs that yield the advantage under this set of assumptions.