Performance Ratings vs Ratings of Solvable Problems for the Recent GPT Models

#	User	Rating
1	jiangly	3898
2	tourist	3840
3	orzdevinwang	3706
4	ksun48	3691
5	jqdai0815	3682
6	ecnerwala	3525
7	gamegame	3477
8	Benq	3468
9	Ormlis	3381
10	maroonrk	3379

#	User	Contrib.
1	cry	168
2	-is-this-fft-	165
3	Dominater069	161
4	Um_nik	160
5	atcoder_official	159
6	djm03178	157
7	adamant	153
8	luogu_official	151
9	awoo	149
10	TheScrasse	146

The GPT models have made great advancements in competitive programming recently. I saw from Qingyu's blog earlier that the new o3 model claims to achieve a 2727 performance rating.

However, I'm slightly uncertain of how this 'rating achieved' should be interpreted. There is very little detail publicly available on how the performance ratings are computed. The only available detail I could find regarding their benchmark methodology is the following quote from the 'reasoning with LLMs' blog:

Finally, we simulated competitive programming contests hosted by Codeforces to demonstrate this model’s coding skill. Our evaluations closely matched competition rules and allowed for 10 submissions.

This does not really offer any detailed computation methods. Given the limited details, I'm hypothesising that the LLM doesn't need to solve problems anywhere close to the 2700-rated ones to achieve a 2727 performance rating. This is due to the following reasons:

Codeforces contests have a time-decay factor in the scoring system; human solvers at x rating usually only solve a x-rated problem in the later half of the contest, resulting in only <60% of the total score awarded for the problem;
LLMs, on the other hand, might solve a problem very quickly if it is able to solve it at all. Therefore, the model doesn't need to solve as many problems as a normal human contestant in order to achieve the same 'performance rating' on Codeforces contests. In fact, the speed in solving easier problems is often decisive in the performance rating system of Codeforces.

To study the above, I've collated some data from the recent Division 1 contests. For each contest, we gather the score of the contestant with a performance rating closest to 2727 using this tool, and computed the number of the first n questions the LLM need to be able to solve to achieve that score. The following assumptions were made:

The LLM only solves problems in increasing order of difficulty (i.e. not a rainboy model);
If the LLM is able to solve a problem, then it solves it in its first attempt, therefore incurring no penalty for wrong submissions;
If the LLM is able to solve a problem, then it solves it within the first minute, therefore incurring no time-decay penalty.

The first n problems usually don't sum up exactly to the required score for the performance rating. As a result, the requirements reported are in the format of 'solving up to problem X with a probability p', where p is the probability that makes the score of the first X-1 problems plus p * score of problem X equal to the required score.

The results for the most recent 8 Div.1 contests are illustrated in the following table:

Round Name	Score Required	Boundary Problem	Ratings of Boundary Problems
Codeforces Global Round 28 (Div 1+2)	5984	F, 61.7% chance	NA
Codeforces Round 990 (Div 1)	2492	C, 82.8% chance	1600/2100
Rayan Programming Contest 2024 (Codeforces Round 989, Div. 1 + Div. 2)	7252	F1, 50.1% chance	2200/2500
CodeTon Round 9 (Div. 1 + Div. 2)	7973	E, 71.7% chance	1700/2200
Refact.ai Match 1 (Codeforces Round 985, Div1+2)	9801	F, 43.3% chance	2100/2500
Codeforces Global Round 27 (Div 1+2)	5917	F, 29.6% chance	2300/2500
Codeforces Round 980 (Div 1)	2532	C, 68.6% chance	1700/2400
Codeforces Round 975 (Div 1)	3120	D, 74.7% chance	1700/2200

The last column displays the problem ratings of the two problems on the boundary (i.e. the boundary problem and the one prior).

From the table, we can see that the LLM needs to solve slightly higher-rated problems in combined Div1+2 rounds to achieve the same performance rating. However, there are many factors that may contribute to this observation. Firstly, note that a question being rated higher doesn’t necessarily mean it’s harder: a problem F in a combined round may rate higher than the same question labelled C in a Div1 simply because there are more problems in front of it, so some contestants that might be able to solve it otherwise may not be able to have enough time to code it after solving the earlier questions. I included a link to each boundary problem for the reader to judge their relative difficulties by themselves. It is also worth note that some Div1 rounds are somewhat ‘speedforces’, where most contestants can only solve 2/3 problems; as a result, the speed at solving this heavily influences the performance. I leave this comparison to be interpreted by the readers.

In my personal opinion, I believe that the problem ratings for the Div1-only rounds more accurately reflect the reasoning difficulty, and that LLMs have an advantage in the combined rounds because it can get some free points by solving the easy questions fast, therefore gaining an upper hand over human solvers. However, LLM may also benefit in some speedforces contests, so it is difficult to reach any conclusive judgement with the limited details available.

For the isolated Div 1 rounds, the LLM needs to have a good chance (60-80%) at solving the 2100-2400 rated problems (while solving 1600-1700 rated problems with certainty) in order to achieve the proclaimed '2727 performance rating'. So the actual 'reasoning skill' reflected is closer to a master-level contestant rated around 2200-2300. Extremely impressive nevertheless, although it is quite different from having a chance at solving 2700-rated problems, or possessing the same reasoning skill of a 2727-rated contestant. That being said, I'm sure advancement in LLMs will make solving 2700-rated problems a reality in the near future.

Note that the probabilities can also be interpreted in another way. Let's assume another plausible situation, where the LLM is able to solve all the questions earlier than the boundary problems immediately (within the first minute), but requires some actual execution time to produce a solution for the boundary problem. Then the above probabilities translate to the percentage of score it needs to secure to achieve the claimed 2727 performance, therefore giving an estimate of the time allowed for the LLM. If I understand the time decay formula correctly, then e.g. the time allowed to solve for 80%/60% score is at around 1:15/2:30 for a 3h contest. Solving the problem at the last minute of a 3h contest grants around 52% score. Looking at the table above, this means that the model needs to solve the boundary problem somewhere in the later half of the contest.

Conclusion/TLDR

Appendix:

The above assumptions are quite optimistic even for LLMs. The following table displays another hypothetical scenario, where we assume that the LLM makes 1 incorrect submission per problem, therefore incurring a -50 penalty for each problem. The corresponding results are displayed in the following table:

Round Name	Score Required	Boundary Problem	Ratings of Boundary Problems
Codeforces Global Round 28 (Div 1+2)	5984	F, 76.7% chance	NA
Codeforces Round 990 (Div 1)	2492	C, 92.8% chance	1600/2100
Rayan Programming Contest 2024 (Codeforces Round 989, Div. 1 + Div. 2)	7252	F1, 65.1% chance	2200/2500
CodeTon Round 9 (Div. 1 + Div. 2)	7973	E, 82.7% chance	1700/2200
Refact.ai Match 1 (Codeforces Round 985, Div1+2)	9801	F, 53.3% chance	2100/2500
Codeforces Global Round 27 (Div 1+2)	5917	F, 43% chance	2300/2500
Codeforces Round 980 (Div 1)	2532	C, 78.8% chance	1700/2400
Codeforces Round 975 (Div 1)	3120	D, 88% chance	1700/2200

Note that the boundary problems do not change at all, although the LLM does need to solve them at a higher success rate. So it really is the speed of the LLMs that yield the advantage under this set of assumptions.

Rev.	By	When	Δ	Comment
en14	raoxj	2024-12-21 15:19:31	139
en13	raoxj	2024-12-21 15:05:25	1043
en12	raoxj	2024-12-21 14:48:18	218
en11	raoxj	2024-12-21 14:44:59	172
en10	raoxj	2024-12-21 14:40:21	32
en9	raoxj	2024-12-21 14:34:28	39	Tiny change: 'ter range (very imp' -> 'ter range under an optimistic set of assumptions (very imp'
en8	raoxj	2024-12-21 14:33:48	140
en7	raoxj	2024-12-21 14:10:33	505	Tiny change: 'ntest.\n\nTLDR: <spoiler s' -> 'ntest.\n\n<spoiler s'
en6	raoxj	2024-12-21 05:11:21	888
en5	raoxj	2024-12-21 04:33:03	2	Tiny change: ' \| F1, 50.1% chance' -> ' \| F1, 65.1% chance'
en4	raoxj	2024-12-21 04:07:54	871
en3	raoxj	2024-12-21 02:53:00	402
en2	raoxj	2024-12-21 02:29:10	10307	Tiny change: 'test 2024 Selection (Codefor' -> 'test 2024 (Codefor' (published)
en1	raoxj	2024-12-21 01:46:43	3521	Initial revision (saved to drafts)

History