Performance Ratings vs Ratings of Solvable Problems for the Recent GPT Models

№	Пользователь	Рейтинг
1	tourist	3985
2	jiangly	3814
3	jqdai0815	3682
4	Benq	3529
5	orzdevinwang	3526
6	ksun48	3517
7	Radewoosh	3410
8	hos.lyric	3399
9	ecnerwala	3392
9	Um_nik	3392

№	Пользователь	Вклад
1	cry	169
2	maomao90	162
2	Um_nik	162
4	atcoder_official	161
5	djm03178	158
6	-is-this-fft-	157
7	adamant	155
8	awoo	154
8	Dominater069	154
10	luogu_official	151

The GPT models have made great advancements in competitive programming recently. I saw from Qingyu's blog earlier that the new o3 model claims to achieve a 2727 performance rating. However, this makes me increasingly uncertain of how this 'rating achieved' should be interpreted.

There is very little detail publicly available on how the benchmarks are computed. The only available detail I could find regarding their benchmark methodology is the following quote from the 'reasoning with LLMs' blog:

Finally, we simulated competitive programming contests hosted by Codeforces to demonstrate this model’s coding skill. Our evaluations closely matched competition rules and allowed for 10 submissions.

This does not really offer any detailed benchmarking methods. Given the limited details, I'm sceptical of if a performance rating of 2727 can be interpreted as 'having a fair chance at solving problems close to 2700 rating'. This is due to the following reasons:

Codeforces contests have a time-decay factor in the scoring system; human solvers at x rating usually only solve a x-rated problem in the later half of the contest, resulting in only <60% of the total score awarded for the problem;
LLMs, on the other hand, might solve a problem very quickly if it is able to solve it at all. Therefore, the model doesn't need to solve as many problems as a normal human contestant in order to achieve the same 'performance rating' on Codeforces contests.

To quantify the above, I've collected some data from the recent Division 1 contests. For each contest, we gather the score of the contestant with a performance rating closest to 2727, and computed the number of the first n questions the LLM model need to be able to solve to achieve that score. The following assumption is made:

The LLM only solves problems in increasing order of difficulty (i.e. not a rainboy model);
If the LLM is able to solve a problem, then it solves it in its first try, therefore incurring no penalty for wrong submissions;
If the LLM is able to solve a problem, then it solves it within the first minute, therefore incurs no time-decay penalty. Note that extending this to e.g. 5 minute changes the outcome very little, as it only translates to a ~1% score decay.

The sum of the first n problems usually don't sum up exactly to the required score for the performance rating. As a result, the requirements reported are in the format of 'solving up to problem X with a probability p', where p is the probability that makes the score of the first X-1 problems plus p * score of problem X equal to the required score.

The results are illustrated in the following table:

Round Name | Score Required | Problems Needed to be Solved | Rating of the Threshold Problem | Link to the Threshold Problem

Codeforces Global Round 28 (Div 1+2) | 5984 | Solving F with 61.7% chance | — | 2048F Codeforces Round 990 (Div 1) | 2492 | Solving C with 82.8% chance | 2100 | 2046C Rayan Programming Contest 2024 — Selection (Codeforces Round 989, Div. 1 + Div. 2) | 7252 | Solving F1 with 50.1% chance | 2500 | 2034F1

Rev.	Кто	Когда	Δ	Комментарий
en14	raoxj	2024-12-21 15:19:31	139
en13	raoxj	2024-12-21 15:05:25	1043
en12	raoxj	2024-12-21 14:48:18	218
en11	raoxj	2024-12-21 14:44:59	172
en10	raoxj	2024-12-21 14:40:21	32
en9	raoxj	2024-12-21 14:34:28	39	Tiny change: 'ter range (very imp' -> 'ter range under an optimistic set of assumptions (very imp'
en8	raoxj	2024-12-21 14:33:48	140
en7	raoxj	2024-12-21 14:10:33	505	Tiny change: 'ntest.\n\nTLDR: <spoiler s' -> 'ntest.\n\n<spoiler s'
en6	raoxj	2024-12-21 05:11:21	888
en5	raoxj	2024-12-21 04:33:03	2	Tiny change: ' \| F1, 50.1% chance' -> ' \| F1, 65.1% chance'
en4	raoxj	2024-12-21 04:07:54	871
en3	raoxj	2024-12-21 02:53:00	402
en2	raoxj	2024-12-21 02:29:10	10307	Tiny change: 'test 2024 Selection (Codefor' -> 'test 2024 (Codefor' (published)
en1	raoxj	2024-12-21 01:46:43	3521	Initial revision (saved to drafts)

Round Name | Score Required | Problems Needed to be Solved | Rating of the Threshold Problem | Link to the Threshold Problem

История