New google competitive programming model seems to have a potential rating between Expert and CM

15 months ago, # |

← Rev. 11 →

+129

Sounds interesting, though taking 10 attempts to solve 43% of all problems does not seem to be around the 85th percentile if you think about WA/TLE penalties (they did seem to take into account time penalties though).

Did they correct for test data leakage, since they mentioned that they fine-tuned on the Codeforces dataset? They mentioned that Gemini was the model that made all the difference, so I am curious as to whether that model was trained on code or editorials for the remaining problems (since it is easy to overfit on the tiniest of data when you have huge models).

Also curious about which divisions the contests on which AlphaCode 2 performed better than 99.5% of the participants were for. I would be really impressed if they could do well on the latter half of a Div1 contest without any test data leakage, since Div2/3 contests seem to have problems with more limited ideas than Div1 (especially Edu Rounds, which are supposed to be standard in some sense, and thus are more "learnable" by machine learning models).

To be honest, sampling a million concrete code samples and sifting through them to make clusters (which is what they do) sounds like a really bad approach for a human or any algorithm that claims to have "learnt" something — at some point it just devolves into guessing and tuning your ranking functions. This is in the same vein as "generative" models not generating things based on reasoning, but based on some transfer distribution that comes from probabilities, which is a far cry from being verifiable. TL;DR — in its current form, it seems like proof by AC and that the models are not really learning anything in a rigorous sense. I don't expect any PAC-style estimates relating these models to mathematical rigor any time soon, but would be pleasantly surprised if there exist such estimates. Of course, the ideal scenario is to solve the IMO Grand Challenge.

Edit: there is also the AIMO prize with the prize fund of $10^7.

Edit: Since this is the top comment, it is worth noting here that there is a strong chance of test data leakage, because of the following:

The AI solution advertised: https://codeforces.net/contest/1810/submission/234586634

Someone's solution: https://codeforces.net/contest/1810/submission/204262433

Someone else's solution: https://codeforces.net/contest/1810/submission/200078083

I'm told that the original code for the last of the above submissions is by a Chinese competitive programmer and the code was posted on a Chinese blog website (seems very plausible given that this happens for a lot of hard problems, though I do not have an exact link to validate the claim), so test data leakage from other websites is also possible.

I'm even sure there is a solution that is much closer to this solution than the above one, but still the structure being that similar (even though there are other solutions with a distinct structure) provides compelling evidence to start investigating whether these claims suffer from test data leakage or not.

I also looked into the other AI submissions (supposedly from their other alts), and they seem to be other minor changes on top of the other submissions.

Obviously, the best way to see whether their model can successfully solve competitive programming problems is to test their performance in real contests.

→ Reply

15 months ago, # ^ |

In the report they say that they tested on recent Div. 2 and Div.1 — 2 contests. If the model solves 43%, then it means that, roughly speaking, it solves 3 — 4 problems in Div. 2 rounds and at most 4-5 problems in Div. 1 — 2 rounds. If so, then this model can solve at most 2 problems in Div. 1 rounds.

→ Reply

15 months ago, # ^ |

← Rev. 2 →

That is assuming that the same kind of semantic limitations that apply to human beings also apply to AI. Comparing it to chess, it seems that there would be competitive programming problems that would inherently be easier for an AI to solve (owing to their solution methodology), but not for a human, and vice versa.

→ Reply

15 months ago, # ^ |

Agree, it might happen, my estimations are not precise. Personally I hope that the model that will surpass human in CP will not sample millions of "solutions" and select the best one. Although as I understood, Alpha zero (as well as Stockfish) just computes many chess lines and use AI to correctly evaluate positions at the end of every line, which is also kind of bruteforce

→ Reply

15 months ago, # ^ |

I agree, though I guess most people will agree that the best way to measure the performance of any model in competitive programming is to actually try it out during contests and let it get to a stable rating.

Still not the best measure for reasoning, since you can guess solutions for a lot of easier problems reasonably quickly.

→ Reply

mEmory_liMI1

15 months ago, # ^ |

There is also a thing, that it might solve a harder problem but not solve problem A in Div 3 simply because of not understanding the problem statements. According to their presentation, it could solve problem G from Div 1 + 2 CodeTON Round which only 21 people could solve during the round

→ Reply

15 months ago, # ^ |

Probably the fact that they used the last rounds for inference indicates that they tried to avoid leakage?

→ Reply

15 months ago, # ^ |

As TwentyOneHundredOrBust points out below, it is possible that they used tags. But if they just use it as an augmentation technique, it seems it avoids leakage through those means.

Also, it is possible that the last few rounds are not really that recent (it is completely possible that the data used for training included those samples too).

→ Reply

silxi

15 months ago, # ^ |

This is one of the authors' response to concerns about test data leakage: https://twitter.com/RemiLeblond/status/1732677521290789235

→ Reply

15 months ago, # ^ |

Thanks for the link, I had already seen this tweet before commenting on the official response on this blog below. I can only wait till they substantiate their claims with live contest performance.

I am just curious whether their claims in their paper hold up or not because I am not very convinced by their arguments (it is very hard to avoid contamination when it comes to the internet), so the best way is to test on something that is impossible to see yet, which is the future unless someone creates a time machine.

→ Reply

silxi

15 months ago, # ^ |

I think we all agree with you about live contest validation. Personally I am also a bit skeptical of their claims (though it wouldn't be the first time I underestimated LLMs recently).

→ Reply

15 months ago, # |

← Rev. 2 →

+40

We also randomize targeted metadata included in the prompt, such as the problem difficulty rating and its categorical tags

Isn't that answer leakage?

Where is the appendix where it explains how the evaluation is done and how training data leakage was avoided? Did this gigantic foundational gemini model already see the editorials in its training data? Does solving 43% of problems really correspond to 1800 elo? (Solving ABC fast in div2 doesn't seem like it would.)

This would hold a lot more credibility if they just let the model run in a couple of live contests.

This report is very disappointing. The alphacode 1 paper was significantly higher-quality than this. Hopefully they release a much more thorough version.

→ Reply

15 months ago, # ^ |

1) Probably the fact that they used the last rounds for inference indicates that they tried to avoid leakage? 2) If they sample millions of solutions, then tag information does not matter much, since they can sample , say 100 mln solutions to cover all tags.

→ Reply

15 months ago, # ^ |

The gemini foundational model is trained on a ton of data. Trying to avoid leakage isn't the same as avoiding it. How recent were these rounds? Weeks, months? We don't know, because it wasn't described. They mentioned combined div "1+2" rounds, the second-to-last one of those was in September. Do we know that gemini is at least 3 months out of date?

It would be much more reliable to see it in live contests, where it's guaranteed that there's no leakage.

→ Reply

ecnerwala

15 months ago, # ^ |

+41

The "randomized tags" means that when they sampled, they randomly tried giving it different subset of tags (e.g. "try solving this as a dp problem", "try solving this as a greedy problem") as a way of introducing diversity into the samples. I'm pretty sure they didn't "cheat" by knowing the correct tags, they just tried them all. AlphaCode 1 did the same thing.

→ Reply

indian_rounds_LULW

15 months ago, # |

+22

Looks like this could be the model in question (thanks to this comment)

Curious, how it can flop on problems A-C and make 5+ WA submissions in a row (an average participant would have already quit at that point), but solve a 3000+ problem on a first try

Nonetheless, that is some impressive stuff, but until these llms learn how to apply binary search on the answer I think the competitive programming as we know it is relatively safe

Waiting for a more detailed report from DeepMind, or just an opportunity to test the Gemini model in action (which I believe is just a foundation for AlphaCode 2)

→ Reply

TimonKnigge

15 months ago, # ^ |

← Rev. 2 →

+16

If you go to the contest's status page and filter for Problem G, it's pretty easy to find a bunch of users that made dozens of quite distinct submissions within the same minute. Not sure if they're the model or some other automated system though.

For example -- purringgrattini, lopsidedtuffoli, faultyfiori, silentmaltagliati, huskyjackrabbit

→ Reply

NastyChicken FriendlyIceberg UncertainBreath StylingSequence crowdeddingo OverratedBlackbird BeautifulDisturbance DazzlingFlyingfish ProperOstrich DearestPolecats crowdeddingo largevole squigglyponie wellmadehinds DiscolorAbsence slimyhamster invinciblecolobus varioussmelt beautifulgiraffe combativeoryx helplesscoati allegedlapwing welltodoraccoon squalidplover pushyhornet nonstopmeerkat unstablebuzzard grippingkookaburra discretebuffalo phobicgrouse frizzywhiting unfriendlyspoonbill dependentturtledove balanceddinosaur uptighttuna unusualpeacock huskyjackrabbit absoluteferret elitepintail mortifiedmare handylemur rhetoricalchough disloyalibis uglythrushe warlikevulture disloyalibis jovialbaboon frivolousgrattoni gorgeousant valuablemezzelune spiffytufoli carefulsorprese sphericaltortelli purringgrattini silentmaltagliati swelteringtrottole fearlesspassatelli speedyquadrettini lastspaghettini unkemptptitim faultyfiori lopsidedtuffoli apathetichassium rattylithium buoyantrhenium faturanium peskysodium similarberkelium regularthulium accurateniobium abruptfluorine acidiclutetium snobbyosmium fanaticalstrontium chubbybromine deafeningseaborgium lyingscandium heartfeltneptunium brainyplatinum aromaticargon hiddenarsenic toothsomesulfur discerningcaesium hissingpolonium uttermostbohrium tidyiodine AdamantChicken2

15 months ago, # ^ |

← Rev. 2 →

+12

Here is a longer list, based mostly on test case 11 TLE submissions to 1832E - Combinatorics Problem (for some reason these bots love to spam TLE on that problem). This list is ordered roughly from oldest to newest account.

There are probably tons more accounts that I didn't find.

→ Reply

15 months ago, # ^ |

The reason is probably that there is a literal formula in the statement and the AI can't resist the stupid temptation to brute force its way out of the problem. Or, a bit less likely, that it is just a genuinely creative problem that happened to end up in their out-of-sample data.

→ Reply

15 months ago, # ^ |

← Rev. 2 →

Also, I wonder if MikeMirzayanov approved of this bot spam. Given that it could have been adversely impacting user experience by overloading the judge system, and the usage seemed somewhat unauthorized in that sense, it could be a major issue but I am not really sure. There are also potential legal issues lurking around, regarding usage of code in commercial contexts if the Codeforces TOS has anything to say about it.

→ Reply

15 months ago, # ^ |

using https://cfviz.netlify.app/ to check out the maybe-model AdamantChicken2

Problems tried: 33
Problems solved: 33
Average attempts: 3.12
Max attempts: 10 (1810-C)
Solved with one submission: 18 (54.55%)
total submissions: 103

to me this does not look like an AI doing a contest. Every problem tried was solved.

→ Reply

indian_rounds_LULW

15 months ago, # ^ |

← Rev. 3 →

-45

It's just that the solution shown in the presentation looks very similar to 234586634, only with few differences in variable names and types. Even endl was used both times (a person, who was doing competitive programming for some time wouldn't do this on accident, unless they are new to CP, or just copied another solution).

Also AdamantChicken2 could really stand for AlphaCode 2, but for now we can only speculate.

Also bear in mind that in some contests only problems A and B were submitted and later on solved, meaning that, perhaps, the model couldn't solve other problems at all (failing on samples etc).

UPD Apparently some drooling noobs with red nicknames still use endl, so my claim is not that relevant.

→ Reply

Um_nik

15 months ago, # ^ |

+27

endl is easier to type, and if the problem is not output-heavy (which is the case for the problem you linked) it is totally reasonable to use endl.

→ Reply

indian_rounds_LULW

15 months ago, # ^ |

-8

Yeah endl here was totally reasonable, it was just a Kripparrian reference

→ Reply

Gemini: Excelling at competitive programming

15 months ago, # ^ |

The account AdamantChicken2 seems to have been specifically tested on problems that previous bot accounts were able to solve.

→ Reply

Entropy.

15 months ago, # |

I wonder what it will be able to a decade later. So much improvement just happened in less than 2 years.

→ Reply

bigSchrodinger

15 months ago, # ^ |

+32

I do not see any improvements tbh. Just something that was fed more training data.

→ Reply

IBaloff

15 months ago, # ^ |

I also don't see any progress here, literally the same thing but just higher rating. My bet is that in a decade these models will perform much better, but there will be no real improvement.

→ Reply

Psychotic_D

15 months ago, # |

← Rev. 2 →

+11

I think we can believe these data only when they make one account, take part in contests, and continuously achieve good ranks.

Yeah, and one more thing is that in the contest testing process now we must include AIs as well, at least for div. 2C and div. 2D.I know we can't do more with Div. 2A and B-level problems because now it seems more possible to solve them using AI.

Remembering Alon Musk here, "Mark my words, AI is far more dangerous."

→ Reply

bigSchrodinger

15 months ago, # ^ |

+56

Alon Musk is wise man

→ Reply

vrintle

15 months ago, # ^ |

+53

Alon Musk is wise man

Yeah, the CEO of Tasla Motors.

→ Reply

Thaumic_Executor

15 months ago, # ^ |

-33

You both guys have typo lmao

→ Reply

Pirate_King

15 months ago, # ^ |

Whoosh

→ Reply

egor.okhterov

15 months ago, # |

→ Reply

md_nihal

15 months ago, # |

← Rev. 3 →

I think the problem statement they tested was from the past codeTon round4 Which is basically

This

So There is a possibility that it already knows the solution of this, as submissions and editorial are there (obviously from reading this data) as part of training data

So let's see if it can solve problems in live contest

→ Reply

15 months ago, # ^ |

← Rev. 3 →

+11

It also seems like they cherrypicked this example, since the virtual participation of the bot is only able to solve A, C and G out of the 8 problems in the set, taking 8 submissions to AC on C.

My hypothesis is that it was able to solve A on its own, needed a lot of effort on C (which is still commendable) despite the possibility of test data leakage, and was able to regurgitate its learnt solution for G.

Also found it really funny that AI tries to hack itself/hardcodes solutions to samples: https://codeforces.net/contest/1810/submission/234586193

Clearly shows the biases it gets from training on CF code, and might point towards test data leakage if the code is anywhere close to being relevant to that problem.

→ Reply

15 months ago, # ^ |

It trying to "hack itself" 234586193 could just be because that is an easy way for it to get AC on the sample. I don't think that has anything to do with data leakage.

→ Reply

15 months ago, # ^ |

Yeah, just that the existence of a submission that hacks itself which has mostly correct solution otherwise would strongly suggest data leakage.

It doesn't seem to be the case for this solution since it's an idiosyncrasy of the solution filtering process.

→ Reply

fonmagnus

15 months ago, # |

At this rate, someday the term "Competitive Programming" will evolve from "competing to solve programming problems" to "competing to create AI to solve programming problems" lol

→ Reply

ShaoNianTongXue5307

15 months ago, # |

+10

cAn aI s0Ive nPc Prob15m?

→ Reply

15 months ago, # |

+192

Here is the submission 229535002 for the 3200 rated problem 1810G - The Maximum Prefix shown in Google's video

for (long long i = 0; i <= n; i++) cin >> h[i];
for (long long i = 0; i <= n; i++) dp[0][i] = h[i];
for (long long i = 1; i <= n; i++) {
  for (long long j = 0; j <= n; j++) {
    dp[i][j] = ((p[i] * dp[i - 1][j + 1]) % mod + ((1 - p[i] + mod) * dp[i - 1][max(0LL, j - 1)]) % mod) % mod;
  }
}

Here is the 8 months old editorial for the same problem

for(int i = 0;i <= n;i++) scanf("%d",&h[i]);
for(int i = 0;i <= n;i++) f[0][i] = h[i];
for(int i = 1;i <= n;i++) {
  for(int j = 0;j <= n;j++) {
    f[i][j] = (1LL*p[i]*f[i - 1][j + 1] + 1LL*q[i]*f[i - 1][max(0 , j - 1)] ) % mod;
  }

Is it time to bring back this meme?

→ Reply

15 months ago, # ^ |

← Rev. 3 →

+52

I think it's practically confirmed that there has been test data leakage.

The 6 accepted submits from phobicgrouse:

The inner DP loops look virtually identical and have the same structure as the editorial.

6 random AC submits from the in-contest leaderboard:

Completely different loop structures.

→ Reply

15 months ago, # ^ |

no test data leakage confirmed. The report states AC2 uses "A sampling mechanism that encourages generating a wide diversity of code samples to search over the space of possible programs". It also mentiones that the database uses "30 million human code samples."

This means, that if a contest uses a problem that can be solved by the code of another problem, the AI may guess it correctly. The AI would be very good at finding duplicate problems, for example.

→ Reply

15 months ago, # ^ |

← Rev. 2 →

+40

It also doesn't mention anything about preventing test data leakage. It also mentions that a giant foundational model which has probably learned from everything on the internet (this means editorials) was plugged into the alphacode framework. It also writes code (or should I say "generates a wide diversity of code samples") that is nearly identical to the editorial for this problem, while human competitors write highly distinct code. I think the conclusion is obvious.

→ Reply

15 months ago, # ^ |

the conclusion is not obvious. If I had to guess, I would say I am 80% certain that there is a data leakage. But I base that assumption mainly on my opinions. The AI's submission looks fishy, the report seems to overstate the AI's capabilities and is written poorly.

But its still an opinion that makes us reach the conclusion, that a data leak probably happened. I agree with that, but I do not agree with your choice of words being "practically confirmed". I believe proof has to have higher standards.

For example, this is how I would estimate the situation (opinion based):

Data leak: 80%
Problem G is almost a Plagiat and a very similar solution exists: 8%
report was falsified: 7%
Problem G is a Plagiat: 4%
AI is good enough to solve that specific problem: 1%

→ Reply

15 months ago, # ^ |

+19

That doesn't confirm anything. As far as I can tell from reading the report and watching the video, Google has not mentioned anything about avoiding leakage of data. That in itself is alarming.

→ Reply

15 months ago, # |

+61

I just realized AdamantChicken2 shortens to AC2 -> AlphaCode 2.

→ Reply

15 months ago, # |

+27

whenever I read that AIs solve human tests (cp, university exams, school exams, etc) I cannot help but find the discussions misinterpreting the results.

Computer Programs are fast. If AlphaCode 2 can solve 43% of the problems within contest time, then it may not be able to solve 44% within 1 year. It already had near infinite time. Now imagine how good AlphaCode 2 would place in a contest that runs for a year. How good would the AI place if very unique problems with long solutions appear. Not good. Under these conditions, the AI may not even reach pupil status.

Also Note that according to the report, the AI is not expert-cm. The AI placed in the top 0.5% in 2 contests. Meaning it has contests where it places in red, but therefore also has contests where it does in fact places like pupil. This should not shock anyone, the report clearly states that the AI is basically good at guessing which of the millions solutions to random problems work for the current problem.

→ Reply

ShaoNianTongXue5307

15 months ago, # |

+27

why they not simply let AI doing some rated contest that make sure problem is new but not let AI vp old contest spark unnecessary arguments

→ Reply

15 months ago, # ^ |

← Rev. 2 →

+15

Desperate attempts at publicity and staying relevant. It is suspicious how they don't address test set leakage either, at all.

Edit: a bit less cynical response: Google releases have traditionally been quite controlled, so it makes sense to not expect them to release the bot in the wild for open tests as of yet. The small parts of the machinery behind what is AlphaCode 2 might be really good, but the AI itself doesn't seem to be quite there yet — not implying that this can't change in the future.

→ Reply

AdamantChicken2

15 months ago, # |

+132

Thanks for the interest!

To address some questions in the comments:

Regarding possible data leakage: this was indeed a huge concern for us and we went through extensive lengths to avoid contamination. Please see this post on X for more details on how we did it. TLDR: we're confident that AlphaCode 2 did not see this specific problem (or its editorial/solutions); however, given that prefix sums appear regularly in competitive coding, the model has likely seen somewhat related (but distinct) problems and managed to adapt this knowledge to come up with its own solution.

An additional data point: AlphaCode 2 has a solve rate of nearly 100% on problems it was actually trained on, which is very far from its performance on any held-out contest we evaluated on.

On the potential leak of tags: we are using randomly-chosen tags for each sample we generate. Interestingly, for that particular sample the tags did not contain "dp" :)

PS: Extra kudos to TwentyOneHundredOrBust for figuring out the easter egg in the account name :)

→ Reply

15 months ago, # ^ |

← Rev. 2 →

+162

Thanks for addressing some of the concerns, including mine.

It would be really cool if AlphaCode 2 participates in a statistically significant number (enough to be able to derive narrow enough 95% confidence intervals) of the upcoming rated Codeforces contests, since I am still not completely convinced given that the code is so similar to other solutions.

And it would be the best form of testing too, since Codeforces problems are scrutinized extremely carefully during their preparation, often by veterans in the community who have seen a multitude of problems and can usually detect with a high amount of accuracy whether any problems are duplicates of past problems or not.

If done, this would truly be a parallel to Deep Blue's match with Kasparov, which consisted of multiple games of chess, with the number of them being barely enough to be able to draw conclusions.

→ Reply

Dominater069

15 months ago, # ^ |

+56

+1 support to the proposal

→ Reply

stefdasca

15 months ago, # ^ |

+37

+1, I want to see AlphaCode 2 compete in some rounds to see how good it really is when it faces new problems.

→ Reply

Enteromorpha

15 months ago, # ^ |

+18

The progress made by AlphaCode 2 is impressive. However, with its potential release to the public, there is a growing concern about how it could affect the competitive programming scene. What measures might be implemented to preserve the integrity of online competitions, particularly on platforms like Codeforces, against the sophisticated capabilities of models like AlphaCode 2?

→ Reply

adamant

15 months ago, # ^ |

+347

→ Reply

15 months ago, # ^ |

+83

So is your explanation to this that it is just a notorious coincidence? Your submissions use both the same variable names and formatting as the editorial, see for example 229535001 and 228987836. They are so similar that had two people submitted this in a competition, then I would have thought that they were cheating.

Btw it really would be great if you test on live contests / fresh problems. That way there wouldn't be any doubt about contamination.

→ Reply

15 months ago, # ^ |

It's a problem with such a short solution that some correct solutions are bound to look almost exactly the same

→ Reply

Pirate_King

15 months ago, # ^ |

+39

You were able to come up with an AI model able to solve CF problems but couldn't think of running it on live contest and had to go to "extensive lengths to avoid contamination" ??

→ Reply

15 months ago, # ^ |

← Rev. 2 →

you are mixing up 2 different things.

Announcing an AI's strength should come with live contest results. I don't know why they did not do that either. And it makes their research look unprofessional. Especially considering its Google.

But training an AI has to be done in an offline-setting. It would not be feasible to only train the AI every 3 days for 2 hours, only when a live contest happens. In order to train an AI, you have to be able to judge its performance and for that it is necessary to go to "extensive lengths to avoid contamination"

→ Reply

Pirate_King

15 months ago, # ^ |

← Rev. 2 →

+27

I think you are misunderstanding what contamination means.

They are saying that they first trained their model on lots of problems and then selected some new problems to test the model on. Here they went to great lengths to ensure that the question they are testing on is not contaminated, that is the questions are completely new and the model has never seen that question in "training" phase.

So what i wanted to say was why go to such lengths when you can test your model on a live contest where the problems are likely new and never seen before ? For training you can use as much historical data as you want

→ Reply

15 months ago, # ^ |

-20

I already answered why the researchers have to ensure test data is not contaminated. Let me try again.

If the test data is contaminated, your resulting AI will be weaker than possible. This is obvious, right?

You cannot train an AI on live contests, depending on the AI, that may take centuries. This is obvious, too, right?

→ Reply

15 months ago, # ^ |

+11

No one said training should be done on live contests. Evaluation should though. The model here doesn't seem to be trained via reinforcement learning (at least not a major part of it), which is the misconception you seem to have.

A garden variety simplified model training process looks like this (if you ignore cross validation and similar stuff):

You have a training set that is used to train your model.
You leave out a validation set for things like early stopping and so on — this is different from training but you still implicitly optimize on this set.
You have a test set on which you compute results of your model.

All of these sets should be completely disjoint (at least independent), but should also come from the same data distribution, in order for the results to make sense.

The disjoint-ness of training+validation and testing data is what the researchers at Google claim, and this claim is being questioned in a lot of the above comments.

The solution a lot of the above comments are suggesting is to ensure that the test set comes from a set which is impossible to get contaminations from, unless someone has a time machine — which is the future.

→ Reply

mystic777

15 months ago, # ^ |

← Rev. 2 →

+22

If you're so confident in your model, why don't you make it compete with actual human competitors in a real live contest? Then we'll see if the claims are really true, otherwise there isn't much to backup the claim.

→ Reply

VLamarca

15 months ago, # |

+79

So the summary is: The bot solved A, C and G from a Div1+2 round, nothing else. Its participation was virtual. Gs solution is extremely similar to authors solution. They promise there was no leakage. To me this is pathetic

→ Reply

15 months ago, # ^ |

+13

Difficulty measure for such a model may be quite different than what we perceive as humans. G's solution is very short and its statement is written in clear mathematical terms rather than some Alice getting arrays as birthday presents. To me this is believable

→ Reply

15 months ago, # ^ |

+32

Extraordinary claims require extraordinary evidence

→ Reply

15 months ago, # ^ |

+11

I don't think this is extraordinary. Length of the solution and formality of the statement are probably the two factors that I would expect to be affecting the performance the most

→ Reply

VLamarca

15 months ago, # ^ |

+13

You do realise that the bot participated in the round 6 months after it happened right? Why the heck dont they make it participate in a live round? I do agree that the solution is very short and more likely to be guessed by brute force approach. But truth is Google showed a bunch of unprofessional signs by the way this "reasearch" results were presented.

→ Reply

15 months ago, # ^ |

I do realise that. I am far from convinced that it will revolutionize the playing field and show enough consistency and participating in an actual rated contest would be the only way to do so for me. But that's not what was the topic of your original comment, which regarded strictly the believability of what they have already claimed regarding solving that particular problem G. Which is far from being obviously false to me, as it seems it is to you

→ Reply

not_wiz

15 months ago, # |

High time to ask, "Is it the end of Competitive Programming? Should I quit grinding and hustling?"

→ Reply

15 months ago, # ^ |

→ Reply

not_wiz

15 months ago, # ^ |

Considering AlphaCode2, the moment it is made available to public, don't you think it will kill this domain?

Like even if it is able to solve Div-2 problems and not hard Div-1 problems, it would destroy the CP for folks falling in the spectrum of 0-1900

→ Reply

15 months ago, # ^ |

+22

I'm not convinced that any such tool currently is good enough to beat a 1400 rated programmer with a win rate of > 50%. This number is what official claims seem to indicate, but it could very well be 1000 (reason below).

There is way too much contamination, and the general avoidance of temporally separated testing mechanisms (like how tests on CF work) inflates their perfomance metrics. There have been papers that talk about this phenomenon of true-out-of-sample decay, especially about Project Euler and Codeforces for ChatGPT.

→ Reply

not_wiz

15 months ago, # ^ |

Hope currently lasts as long as possible

→ Reply

triple__a

15 months ago, # |

+70

why not evaluating on their own codejam/kickstart problemset :)

→ Reply

Kyou_mo_kawaii

15 months ago, # ^ |

+17

Code jam is dead... unless you mean this is a good excuse to bring it back. The PR stunt could be: beat AlphaCode2 in a contest and win a tshirt?

→ Reply

RajAyush.

15 months ago, # |

In This video, it is claimed that Gemini can solve a 3200 elo problem. Is this an existential crisis for newbie competitive programmers like me?

→ Reply

15 months ago, # ^ |

+30

Depends on whether you can implement a DP solution given the editorial code.

→ Reply

15 months ago, # ^ |

No. This problem seems to be a total outlier and it is beyond me, why they feature it on their video. The second highest rated problem the AI (AdamantChicken2) solved is 2100.

And besides that, the AI has NOT been trained under live contest constraints. For example, this submission is for the Problem G. You can see the AI (purringgrattini) cheating the submission to gain more output information. This is not anything you could ever do inside a live contest.

→ Reply

HHY_zZhu

15 months ago, # |

+25

Interestingly, in AdamantChicken2's code, the "quick power" is written as "qmi" (quick mi), it seems that AdamantChicken2 also knows Chinese phonetics.

→ Reply

15 months ago, # ^ |

← Rev. 2 →

+71

I found some chinese in one of their submissions 232778386.

I also found that they are using a function called Gamal() 232777058 232778378 for cin.tie stuff, which seems to straight up be taken from the template of Gamal74 205625212.

Note that all of these submissions I linked are for the same exact problem.

→ Reply

Xellos

15 months ago, # ^ |

+16

No test data leakage confirmed!

→ Reply

15 months ago, # ^ |

test data != training data

→ Reply

TimonKnigge

15 months ago, # ^ |

If it's a template then it could appear in earlier training data as well.

→ Reply

15 months ago, # ^ |

← Rev. 7 →

+42

Agreed. It seems that Gamal74 started using the Gamal() template in July 2022. This is the oldest submission using it that I could find 162836061. So the training data is at least as recent as July 2022.

However it is a bit suspicious that the bot is using Gamal() on that specific problem. For example, it is the only problem that Gamal74 has solved in that entire contest. Also, I've never seen the bot use Gamal() when solving other problems. It seems to only do it for this problem.

EDIT: I just did a more thourogh check. Some of the bot submissions 232777056 232777056 use a newer version of the template. The first time I can see this version of the template being used by Gamal74 is in late September 2022 173629356 .

→ Reply

chuka231

15 months ago, # |

+14

Would like to see future untrained problem performance

→ Reply

roycf123

15 months ago, # |

Not implying that this is impossible, but it actually seemed a bit unnatural to me that an AI could accurately do tasks that require high amount of precision for correctness, like solving a very specific math problem or a programming statement with a very high confidence rate.

For drawing, writing and all it makes sense, where just the overall pattern/structure matters...

→ Reply

peltorator

15 months ago, # |

← Rev. 3 →

+39

I guess the comments above talk about plausibility, but I honestly don't care that much whether it is true or not. I am pretty sure it will happen at some point anyway, so whatever. And I am not surprised by this in any way! Because we know that the most important thing in solving competitive programming is practice, and that's exactly the thing that machines are much better at than we are. I genuinely believe that if you give a chicken a million years to practice, it can become a candidate master or a master. I am not trying to insult candidate masters with this comment, it is nevertheless hard to become one! I just mean it in the same way as that if a chicken studied chess for a million years, it would become better than Magnus Carlsen. It's just about human resources. If some 3500 legendary grandmaster would call a problem "an exercise" from their experience mountain, why a computer can not learn to do such exercises? And the reason why I still am not impressed is because indeed these are the problems you just need to learn to go through, and the real stuff begins later when a 3500 will have to start thinking. And I don't see AI being able to solve any of these problems any time soon so whatever...

→ Reply

Meguhine

15 months ago, # ^ |

But what we human need is a thinkable machine rather than a omnipotent database. If a new kind of cp problem (required with some new algortihms) comes out in the future, would a database finds out the corresponding answer? No, it will just try every combination of the alogorithms in its databse, and eventually tell us it doesn't know.

→ Reply

peltorator

15 months ago, # ^ |

Idk. If so, it is a good shake for authors to come up with novel ideas.

→ Reply

HHY_zZhu

15 months ago, # |

+42

If I understand correctly, this large language model is like a clever monkey typing in the infinite monkey theorem? Because it seems like it's just randomly putting together some code and then trying them one by one, but using some clustering method to speed up this process.

→ Reply

Vasanth73

15 months ago, # ^ |

-8

It is nowhere as brute-force as you are describing. Machine learning is much more like humans learning. It has a vast number of parameters which enable it to make a lot of accurate guesses. From which this particular model submits the top 10 most likely-to-be accurate codes and is considered to get it right if any of the ten gets AC.

→ Reply