Introduction
In the recent tourist's blogpost about a new strategy for some types of contests, there was an interesting comment from ftiasch. Apparently, some participants try to artificially boost their rating with the following strategy: solve problem C, if it goes fast and nice — submit and continue with the contest, otherwise don't submit anything. The assumption is, other problems in the contest are created by the same author, so if one performs good on problem C, they are more likely to perform good on other problems.
In the comment thread, some people expressed doubts about this strategy. Reading their comments, I wondered: is there any point in discussing this with groundless arguments, when there is large volume of historical data of contest results, and we can just analyse them to understand if the strategy really works. Luckily, Codeforces even provides API, so there is no need to implement custom scrapers.
Collecting and cleaning up data
For the dataset, I've used all the contests with CF rules (it doesn't make sense to evaluate the strategy on other rulesets). Even among CF-rule contests, there is a big variety, in particular, sometimes there are more than 5 problems. For the sake of the analysis, I've truncated all the problems to the standard "A-B-C-D-E" format (discarding FGH when they are present). Also, I've removed from the standings all the users with 0 points, to reduce noise.
Next question is, over which users to compute statistics. It is very likely that things work differently for Div2 users than for Div1 ones. Also, orange users are likely to have more noisy and inconsistent results than red and nutella ones. That's why I restricted stats computation only to users with rating >= 2400 at the time of scraping (note though that all users are considered when computing contest-level stats).
Verifying if the strategy works
First look at the data
Let's look at the scraped dataset. We have 695 contests (many of those are Div2 though, so they don't affect our stats). We have 107975 total users, but only 383 of them are red.
Let's see how red coders perform on different problems. First, what are the success rates for each problem? Here is the data (95%-conf intervals are printed):
A | B | C | D | E |
---|---|---|---|---|
0.91 ± 0.00 | 0.80 ± 0.01 | 0.59 ± 0.01 | 0.35 ± 0.01 | 0.15 ± 0.00 |
As we can see, the rates are even monotonic (so the popular claim "E is much easier than D!11" isn't true on average). What other stats are there besides success rates? An interesting one is how often a user's solution places among the top solutions in the contest (e.g. by score). The data looks as follows (with "top" defined as top-10%):
A | B | C | D | E |
---|---|---|---|---|
0.38 ± 0.01 | 0.38 ± 0.01 | 0.37 ± 0.01 | 0.29 ± 0.01 | 0.14 ± 0.00 |
An interesting observation is that for problem E, top-10% rate is almost the same as accept rate, while for problem A, these rates are very different.
A very naive attempt
First idea to evaluate the strategy is to estimate the following: given that someone solves problem C, how much more likely is that they also solve some other problem? We can think first and discard this idea right away, but for the sake of the analysis, let's compute the stats first. Here is the delta-rates (i.e. P[one solves problem X | problem C is solved] — P[one solves problem X]):
A | B | C | D | E |
---|---|---|---|---|
+0.01 ± 0.00 | +0.05 ± 0.01 | +0.41 ± 0.00 | +0.09 ± 0.01 | +0.06 ± 0.01 |
As we can see, all the lifts are above the confidence intervals (lift for C is huge, which is expected — the rate is lifted to 1.0). But does it mean that we've determined that the strategy indeed works. One can say that obviously NO: there is a strong bias towards contest difficulty. I.e., for easier problemsets, we're more likely to solve C as well as other problem. That is, this data just shows that "solving problem C" might be a proxy for "problemset is easy" property, which indeed means we solve more problems, but doesn't mean we perform better.
Slightly less naive attempt
As we have seen in the previous section, estimating our performance on problems without considering other participants isn't what we want. But what about comparing us with everyone else for each problem separately? We've already computed the data for "being in top 10% scores for a given problem", let's compute deltas conditioned to "we're in top-10% scores for problem C":
A | B | C | D | E |
---|---|---|---|---|
+0.15 ± 0.01 | +0.19 ± 0.01 | +0.63 ± 0.00 | +0.18 ± 0.01 | +0.10 ± 0.01 |
So, the strategy works after all? Not necessarily: one can argue that we're hitting a different bias. If a contestant solves problems sequentially, then quickly solving A and B (therefore having good scores on them) automatically increases score on C (as well as D, E, F). So these huge lifts are expected, and we again estimated not what we wanted.
Proper^W Even less naive attempt
Finally, let's estimate what we want. From the very beginning, we needed to understand what happens to "performance" on problem X given good "performance" on problem C. Here, "performance" should be independent of other problems (as we have seen in the previous section, score doesn't work for this). A natural measure is time spent on this particular problem, but we don't know it. An easy way to estimate it is to sort all submission times and take differences. We assume that participants solve problems one-by-one in some order. Of course sometimes it's not true (I've seen successful submits with distance 5 seconds in the dataset), but should be mostly true.
Ok, now we define "performance on the problem" as "time to solve this particular problem", estimate it from submission log, define "good performance" as "being in best-10% times for this problem". Without conditioning on problem C, "good performance" rates look as follows:
A | B | C | D | E |
---|---|---|---|---|
0.34 ± 0.01 | 0.37 ± 0.01 | 0.36 ± 0.01 | 0.28 ± 0.01 | 0.14 ± 0.00 |
And deltas look as follows:
A | B | C | D | E |
---|---|---|---|---|
+0.14 ± 0.01 | +0.19 ± 0.01 | +0.64 ± 0.00 | +0.19 ± 0.01 | +0.11 ± 0.01 |
So, it seems that the strategy indeed works. The effect looks rather significant, especially for harder problems (almost doubling the probability to perform good on E). Are we done with the analysis?
Debiasing individual strength
Apparently not. At this point, I thought that other biases present aren't significant. But as Um_nik pointed out in comments, this is still biased towards of individual strength of participants. If someone is strong — they're more likely to solve C as well as other problems.
What to do? OK, let's compute similar stats (top-10% by time for a given problem) on user level. This naively assumes that user strength is constant throughout the time, but let's ignore this effect for now. Confidence intervals will be huge, but we can aggregate user-level numbers to one final number, weighting by the number of observations. Final lifts look as follows (don't quote me on confidence intervals):
A | B | C | D | E |
---|---|---|---|---|
+0.11 ± 0.00 | +0.16 ± 0.00 | +0.64 ± 0.01 | +0.16 ± 0.00 | +0.09 ± 0.00 |
Deltas are smaller than before, but still positive. We can also explore stats on a per-user basis. E.g. this is how raw stats look for some interesting users:
User | A | B | C | D | E |
---|---|---|---|---|---|
Petr | 0.76 ± 0.08 | 0.79 ± 0.07 | 0.75 ± 0.08 | 0.75 ± 0.08 | 0.52 ± 0.09 |
tourist | 0.80 ± 0.07 | 0.82 ± 0.07 | 0.81 ± 0.07 | 0.78 ± 0.08 | 0.54 ± 0.09 |
Um_nik | 0.51 ± 0.09 | 0.50 ± 0.09 | 0.54 ± 0.09 | 0.41 ± 0.09 | 0.22 ± 0.08 |
And here — stats conditioned on doing well on problem C:
User | A | B | C | D | E |
---|---|---|---|---|---|
Petr | 0.82 ± 0.08 | 0.83 ± 0.08 | 1.00 ± 0.00 | 0.76 ± 0.09 | 0.58 ± 0.11 |
tourist | 0.81 ± 0.08 | 0.88 ± 0.07 | 1.00 ± 0.00 | 0.81 ± 0.08 | 0.56 ± 0.10 |
Um_nik | 0.61 ± 0.12 | 0.64 ± 0.12 | 1.00 ± 0.00 | 0.62 ± 0.12 | 0.36 ± 0.12 |
So, even for them, lifts exist, but mostly within confidence intervals.
Debiasing content composition and individual strength growth
OK, are we done yet? Apparently, not :( As Petr mentioned in comments, there is yet another bias: it is easier to perform good in a contest where not too many strong people participate. At this point, I ran out of ideas (maybe readers have some), and decided to use rating as a proxy to strength. We were using 10th percentile as the definition of good performance — let's change it: sort all participants by rating, and say you're performing good if your time for this problem is better than sorted_times[your rank by rating].
Raw data for top participants looks rather sad (it is hard to perform better than your rank by rating if you are first by rating; I'm still surprised though that e.g. tourist has 0.12 ratio for problem A — i.e. solves A faster than everyone 12% of times; I hope it is not due to a bug :) ):
User | A | B | C | D | E |
---|---|---|---|---|---|
Petr | 0.05 ± 0.04 | 0.14 ± 0.06 | 0.11 ± 0.06 | 0.26 ± 0.08 | 0.56 ± 0.09 |
tourist | 0.12 ± 0.06 | 0.05 ± 0.04 | 0.12 ± 0.06 | 0.16 ± 0.07 | 0.38 ± 0.09 |
Um_nik | 0.07 ± 0.05 | 0.04 ± 0.03 | 0.07 ± 0.05 | 0.12 ± 0.06 | 0.41 ± 0.09 |
Average lifts:
A | B | C | D | E |
---|---|---|---|---|
+0.05 ± 0.00 | +0.08 ± 0.00 | +0.69 ± 0.01 | +0.13 ± 0.00 | +0.11 ± 0.00 |
Also, data for top participants conditioned on problem C (conf intervals are huge, so it is mostly meaningless):
User | A | B | C | D | E |
---|---|---|---|---|---|
Petr | 0.17 ± 0.21 | 0.25 ± 0.24 | 1.00 ± 0.00 | 0.50 ± 0.28 | 0.67 ± 0.27 |
tourist | 0.15 ± 0.20 | 0.15 ± 0.20 | 1.00 ± 0.00 | 0.23 ± 0.23 | 0.46 ± 0.27 |
Um_nik | 0.25 ± 0.30 | 0.00 ± 0.00 | 1.00 ± 0.00 | 0.25 ± 0.30 | 0.62 ± 0.34 |
Technical note: ratings were used at the time of scraping. Using ratings at the time of contest would remove bias of strengths changing over time.
So, in the end, it seems we have enough evidence to say that the strategy works. How ethical to use it is a separate question (one just gets some constant increase to their rating, at cost of not having fun solving more contests, and also likely feeling guilty), but it is outside of the score of this post.
Conclusion
Using some basic data analysis, we determined that the described strategy indeed works quite well. There are many open questions though:
- Is the explanation "it is easier to solve problems from certain authors" correct? Other potential explanation is that someone is just having a good day (slept well etc.), and "quickly solving C" is a proxy to it;
- Does this generalize to div-2 participants?
- Is it true that top participants perform worse on "purple" rounds?
- Is there such thing as "affinity" of participant and problem author? Is it correlated with rating difference?
- ... and many other questions.
It would be very interesting to see more data analysis of contest results — it might help to understand problem-solving process better and develop more interesting strategies.
Thanks a lot for putting this analysis together!
I am not so sure that your final experiment shows if the strategy works. Here's a bias that would also show up in such experiment: suppose more people rated higher than you participate in the contest. Then you're less likely to be in top 10% for problem C, and less likely to be in top 10% for any other problem, so there would be a measurable correlation.
Or maybe you're saying that top 10% is large enough number that the fluctuations on the overall composition of contestants are negligible?
It would be great if you could share your analysis in a reproducible way (was it a Jupyter notebook?), then the barrier of entry to doing such analysis would be greatly lowered for others :)
Actually, it seems that this bias should definitely be present and significant. Higher rated contestants are more likely to be in top 10% on any problem, so we would see a non-zero correlation just because of that?
True, we should also remove this individual strength bias somehow... Let me think about it.
OK, updated the post. Addressed your concern by changing definition from "top-10%" to "better rank on this problem than your rank by rating".
I think that all that you prove is that strong participants are strong (if you are good at problem C then you will be good at other problems too).
Let's change 'problem C' for 'problem D' or even 'problem B' and see if the effect disappears.
Yes, there is such bias. I didn't think it is significant, but apparently it is. So this needs more work.
FWIW, similar deltas conditioned on other problems:
A:
B:
D:
E:
OK, updated the post. Now I aggregate data on user level, and then average it. This should remove individual strength bias.
Wow, nice analysis. That's a lot of work. Would you mind to share the cleaned up data? It would be nice to play with it. Do you think it would be hard to check the correlation with number of writers for the rounds?
I could point to one more reason why the said strategy might work for some people even if the claim about correlation is not true.
There is a skill/rating range where you solve (div 1) A and B quite consistently. The gain/loss of rating usually is resolved by whether you solve C and how fast you do it (including time lost on A+B). In such a case, solving C quickly gives you high chance of having A+B+C at the end of the contest, which is enough for rating gain. Obviously, you also increase your chances of getting D or E (because you have more time for them) but I don't think it's the biggest gain here.
One more thing is that the order of solving problems (given infinite contest length) is sort them by [time to solve]/[max score]. In case of good performance on C, it might even happen that starting with it was actually the optimal choice.
Since two people (you and Petr) asked for raw data already, I guess I'll share it (data + Jupyter notebook), but first I want to clean it up :) It's nicer to share notebook with best practices than a messy code with hacks — otherwise people might learn bad things from it.
Interesting analysis. I'd like to add my point of view.
I am orange, and looking at my record you can see that I wasn't quite successful in crossing that red barrier. I tend to do quite well on A and B, but struggle on C. So I considered the same strategy, but not because of viewing C as an indicator on how I would do on other problems. It is because C gives me too much variance — either I solve it and end up in the top 100 raising a solid amount of rating points, or I don't and lose.
I have decided not to go for this strategy for its obvious drawbacks:
nice, very logical answer.
I used to solve problems one by one from problem A, and have been using similar strategies to the one ftiasch mentioned in last three contests. I'd like to share my points of view.
My strategy works like this. First, open problem A. If I come up with an easy-to-code solution very quickly, I'll solve it first. Otherwise, I'll read C, D and E. Then find the problem which I'm good at, and triy to solve it. After this, going back to solve problems A and B if I can.
1.Comparing to just trying to solve harder problems first, mine avoids losing too much score on easy problems. And it can somehow avoids getting into the situation which fail to solve hard problems and get very low points on easy ones.
2.As cf contests often took place at midnight in China, I always felt asleep during the second hour. Problems C, D and E usually require a lot of thinking and coding, so it'd be much harder to solve it during the second half of the contest for me. I used to ended up just with problems A and B, and got a low rank. So solving harder problems first might get more problems solved.
3.Solving harder problems often costs me a lot of time. At that time, checking up on the standings, I can get a grasp of difficulties of unsolved problems. Then I can make plans to get as high score as I can during the left of the contest.
Auto comment: topic has been updated by ilyakor (previous revision, new revision, compare).
Very good one!
Yes. It is excellent, mathmatical, and exact analysis. Very appropriate anaylsis. About 2 months ago, I made analysis and writing blogs, about mystery series of distribution graph. (Link1, Link2) But I did not make analysis the expected value, ratio, ± value, and others. In addition, this blog is written for many points, for example: naive attempts, Debiasing individual strength, Debiasing content composition and individual strength growth. I think your analysis is 109 + 7 times better than my analysis.
I think that everyone think I will write comment on this blog. Needless to say, it was very helpful for me. I studied very much for this blog. If I make analysis in future, I will use this style very much.
Thank you for ilyakor, and everyone who adviced to me in mystery series!