A short and simple anaylsis of Deepseek R1

#	User	Rating
1	tourist	3857
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3463
9	ecnerwala	3451
10	heuristica	3431

#	User	Contrib.
1	cry	165
2	-is-this-fft-	161
3	Qingyu	160
4	Dominater069	158
5	atcoder_official	157
6	adamant	155
7	Um_nik	151
8	djm03178	150
8	luogu_official	150
10	awoo	148

Prompt

CF 800-1200

1998A - Find K Distinct Points with Fixed Center
- AC in 199 seconds Submission
2043B - Digits
- AC in 147 seconds Submission
2021B - Maximize Mex
- Initially misread the samples from screenshot (Thought for 410 seconds)
- WA on test 1 in 253 seconds Submission
  - It also ignored the instruction given to use map instead of unordered_map
- WA on test 1 in 98 seconds Submission
- TLE on test 2 in 56 secondsSubmission
- WA on test 1 in 308 secondsSubmission
  - Originally used unordered_map and set, manually changed it but it still failed

CF 1500-1700

1903D1 - Maximum And Queries (easy version)
- WA on test 1 in 256 seconds Submission
- Spent 339 seconds and rewrote the exact same code
- WA on test 1 in 190 seconds Submission
- WA on test 1 in 276 seconds Submission
2051E - Best Price
- AC in 267 secondsSubmission
2044G1 - Medium Demon Problem (easy version)
- AC in 261 secondsSubmission
2027C - Add Zeros
- AC in 327 seconds Submission
2008E - Alternating String
- AC in 306 seconds Submission
2031D - Penchick and Desert Rabbit
- WA on test 1 in 236 seconds Submission
- WA on test 1 in 230 seconds Submission (after this one, I told it the exact test case it was failing on)
- WA on test 1 in 341 seconds Submission

Overall, R1 performed fairly well, especially for an opensource model that was made as a side project. The pitfalls it did have were similar to o1, though it also had the issue of not following instructions at times. Additionally, for the 800-1200 rated problems, I used screenshots to send the statement, but it misread the screenshot during "Maximize MEX", and I had to switch to copy and pasting the statement into the text box.

With that being said, the model, while very impressive, doesn't seem to show anything we haven't seen before with o1, and I doubt any radical changes will have to be made to reduce cheating, similar to what happened when o1-mini initially released. We've already started to see problemsetters attempt to reduce GPT cheating (see the round 1000 Anti-LLM report), and assuming this becomes more of a trend in the future, the ability to cheat with these tools should hopefully be diminished to a large degree (at least until o3 comes out, but that's another story)

As a quick side note, LLMs are not deterministic, meaning your results might not be the same as mine here (though I'd suspect them to be fairly similar).

Comments (5)

Write comment?

nik_exists

6 weeks ago, # |

Auto comment: topic has been updated by nik_exists (previous revision, new revision, compare).

→ Reply

tiojuan

How are LLMs not deterministic?? From my understanding the set of weights is the same, so the output should be the same for the same input.

kevlu8

6 weeks ago, # ^ |

(Please correct me if I'm wrong!) As far as I know, most LLMs you can chat with online use randomized seeds in order to prevent users from getting the same response every time, as well as using a metric called temperature which determines how random the responses can be.

WhileTrueRP

I think it is like simulated annealing(SA).

runoxinabox

Under the hood, the LLM model does not directly output an "answer" given the input. Instead, it outputs a probability distribution over some set of possible outputs. The answer is then obtained by taking a sample from the probability distribution. Of course, you can just take the response that has the highest probability, and this can be configured for some models.

nik_exists's blog

CF 800-1200

CF 1500-1700