Google Code Jam Difficulty Estimation — 2021 Qualification Round

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	157
6	Qingyu	156
7	adamant	151
7	djm03178	151
7	luogu_official	151
10	awoo	146

Have you been wondering what is the difficulty of Code Jam problems on a codeforces scale? me too.

I make a simple estimate for the 2021 qualification round (link), and I plan to do it also for the upcoming rounds. I share here the process, the results, and I welcome any feedback.

Data Cleaning

I use two data sources: - CJ contest result data, downloaded using vstrimaitis code, see details in his great blog post. From here I get the list of contest participants and what problems they solved.

CF users data, downloaded using CF API. From this, I get the current rating of every CF coder.

I assume that many coders use the same username across different platforms. If for a given CJ contestant I find a CF user with the same name (case insensitive), I assume they are the same person. I assign to each CJ participant the rating of the corresponding CF coder, and I discard all other participants.

Difficulty Estimation

The formula used by CF to determine the difficulty of a problem is not public. However, the main idea is that you have a 50% probability of solving any problem with difficulty equal to your rating. Some details here. So I divide contestants into buckets wide 100 rating points (a 1450 and a 1549 coders fall in the 1500 bucket), and I see what bucket had a 50% rate of success. That's my estimate of the difficulty of the problem. I group together all ratings above 3000 and below 500, or the sample size would be way too small.

Results

Out of the 37398 contestants who submitted something during the qualification round, 11109 have a homonym CF user. Here their success rate on the different problems:

Estimated Difficulty: A <=500 B1 <= 500 B2 <= 500 B3 2000 C1 600 C2 1400 D1 2400 D2 2700 E1 2600 E2 3000

Estimation issues

matching profiles across platforms using the username is a bold assumption. I am discarding many coders, for example tourist, who competes as Gennady.Korotkevich in gcj. And, even worse for the estimate, I am probably matching some profiles that correspond to different persons.
this was a qualification round where you just needed to score a minimum number of points to pass, with little incentive to do more. Many strong contestants didn't seem to care about solving all the problems. See LHiC for example, who just solved problems E1 and E2. This lowers the problem success rate, and inflates the difficulty.

Any thought?

Rev.	By	When	Δ	Comment
en7	areo	2021-04-11 15:05:47	549
en6	areo	2021-04-02 21:02:01	0	(published)
en5	areo	2021-04-02 21:01:17	1108	Tiny change: 'td,dr: A 900, \' -> 'td,dr: A 900, \'
en4	areo	2021-04-02 20:37:22	2249	Tiny change: '\frac{1}{1}$\n\n\nWh' -> '\frac{1}{1+10^\frac{R-D}{400}}$\n\n\nWh'
en3	areo	2021-04-02 15:23:43	1828
en2	areo	2021-04-02 14:40:39	2
en1	areo	2021-04-02 14:40:01	1046	Initial revision (saved to drafts)

Rev.

Lang.

When

Comment

en7