This years IOI syllabus and contest rules were published a few days ago. They can be found here : http://ioi2017.org/contest
What does this mean? Does this consider submitting randomized solutions as cheating?
№ | Пользователь | Рейтинг |
---|---|---|
1 | tourist | 4009 |
2 | jiangly | 3823 |
3 | Benq | 3738 |
4 | Radewoosh | 3633 |
5 | jqdai0815 | 3620 |
6 | orzdevinwang | 3529 |
7 | ecnerwala | 3446 |
8 | Um_nik | 3396 |
9 | ksun48 | 3390 |
10 | gamegame | 3386 |
Страны | Города | Организации | Всё → |
№ | Пользователь | Вклад |
---|---|---|
1 | cry | 167 |
2 | Um_nik | 163 |
3 | maomao90 | 162 |
3 | atcoder_official | 162 |
5 | adamant | 159 |
6 | -is-this-fft- | 158 |
7 | awoo | 157 |
8 | TheScrasse | 154 |
9 | Dominater069 | 153 |
9 | nor | 153 |
This years IOI syllabus and contest rules were published a few days ago. They can be found here : http://ioi2017.org/contest
What does this mean? Does this consider submitting randomized solutions as cheating?
Название |
---|
Yes or no, depending on the circumstances. In practice, no.
If you keep submitting the same random solution and switching the seed until you get more points, you can get disqualified. This approach doesn't demonstrate any programming skill and it's very unlikely to work anyway.
Randomised solutions that give correct results and are supposed to give correct results should be ok. Even if you submit something and it works by accident, that should also be ok. (There's no reverse engineering in that.)
I think the result of a randomized solution should only have slight dependencies on the seed used to generate the data?
The behavior that this is going after is more like guessing the data generator used to make the data, as well as the random seeds, and then sending in a solution that detects and solves only those cases. Of course, that is nearly impossible to do. But keep in mind that earlier submissions can also compute on these same data and return quite a bit of info through the feedback mechanisms.
Such a rule was also proposed last year, but it was not used because team leaders didn't want it. Let's see what happens this year.
The rule was rejected, mostly because of very bad wording, which give possibility to treat very different behavior as cheating (examples are provided below, in other comments). This wording is better, but still not perfect.
As I understand, the idea, is you can reverse-engineer data to fix bugs/find hard cases, but your solutions should solve problem, not exact testcases. So, it should not fail (at least it should not be obvious, it will fail), it change random seeds in generator, shuffle number of vertices in graph, and make other changes which not really change test. Still, this is not formal enough, because your solution can just contain rare bug
The second part of rule, obviously permits assertions and etc. But I still not sure, something like "I checked there are no test with too many vertices of degree 21, so make this case in my solution work slower, then formally needed" is formally ok, but i think it should be, because it not really differs from person, who just send solution.
Let's try call rpeng — the main inspirer of this rule, as I understand. Maybe he can explain better. Also, I think, if community have ideas how to make formal wording closer to spirit of the rule ISC want, it would be great.
PS. I'am not ISC member, it's only my understanding, based on talking on ioi 2016.
At first I thought this was just a stupid idea. I'm pretty sure a similar idea was presented last year and it was shot down for being vague and unenforceable.
But thinking a bit more, and reading closely, I think perhaps this rule is okay?
For example, I thought at first that the following case would result in unfair judgement: Say there is a problem with a tree. A contestant incorrectly believes that node 0 will always end up with even degree, so she adds:
and ends up with a crash judging result. Suddenly (I thought) the contestant is disqualified for discovering details about the test data! But actually this is okay by the rule, because if she fixes her solution to handle all cases where node 0 has odd degree, then she has not "adapted to specific test cases" but rather solved all cases with node 0 having odd degree.
I hope that it will be clarified that doing this intentionally is also okay(?) That is, if you get WA, you are allowed to check what type of case it is getting WA on by submitting with asserts.
Maybe this is already obvious to everyone else and I'm just thinking too hard ><
(oh but also, there is still the problem of how to enforce the rule...)
edit: actually I guess they could just actually go through with testing solutions on another data set with a different seed, and then manually check any solutions that get a different score/different correct subtasks.
Maybe not allowing the use of assert() function.
Well that's another issue entirely and I think it's entirely unenforceable. For example say you accidentally (or "accidentally") have some code that accesses array out of bounds and crashes, but you only call it in a specific case. Then it acts as an assert.
Well that's a really bad idea. You can just force a TL in this sort of case (or at least that's what I'd do) and you can get the same kind of information but in a more painful way and by soliciting the grader more than needed. I guess that enforcing the rule is the really hard part, but it's good that asserting stuff to find out in which sort of case your solution is failing (to resolve the bugs) is not considered unfair (as it really isn't because finding the bug is not what OIs are about, but rather finding a right idea)
I don't think it's possible to ban infinite loops.
The Soviet government would not hesitate to ban infinite loops if it was informed that they cause programs to crash and serve no useful purpose otherwise. The engineers responsible would be charged with sabotage and proclaimed enemies of the people.
Darn, I wish I could proclaim cows who use certain unholy constructs (e.g. Fenwick trees ^_^) as saboteurs/enemies of the herd, then banish them to the iron mines in upper Michigan (aka. Siberia for Bovinia?)
All you have to do is get them to publicly confess. There are a couple of reliable tricks to this end.
keep in mind that most of the (large) test data are generated randomly in the first place: your score in such cases shouldn't fluctuate too badly if the data generator seeds change.
One should be rewarded for 10000 points for having such solution. In the new rule contestants may only submit at most 50 solutions for each task. People who can solve problem in highly test-data-dependent manners under such rule don't even exist.
You'd be amazed how easy it is to turn in such solutions, especially under the 2012 — 2014 format of seeing the outputs of all test cases. Keep in mind that contestants often have 2~3 hours to work on a single problem at the IOI.
A pattern on these full feedback contests seems to be that as soon as it's information theoretically possible to extra key bits to break the test data, someone will figure out such a solution. This puts a huge burden on the problem setters in terms of closing all the leaks, for example, the memory usage returned in IOI2015 could be used for this.
On the other hand, there are statistics showing that the 50 submissions was indeed a very useful resource for the majority of the contestants. So the goal of this kind of guidelines is to free up problem setting resources that would otherwise be devoted to these security checks.
That's a strange rule, and I'd like to see a rationale behind it. In absence of such a rationale, I'd say the rule is far from perfect:
Most importantly, it puts an additional burden on the contestants: they now have an additional poorly defined measure of solution quality to keep in mind. This better be justified by a real concern.
Next, it does not play well with certain types of problems, even ones which appeared before. For instance, IOI 2013's Art Class is essentially test-data-dependent. If such a problem appeared in IOI 2017, with the current wording of the rule, many contestants would be anxious about what they can and what they can't do in their solutions. Generally, the rule is problematic any time the problem has only few outcomes (say, a yes/no problem with very few test cases). Some contestants will be overthinking this rule in such problems, and it can't be fully helped.
Furthermore, there definitions are not clear. What would be an equivalent set of test cases if there was no randomness involved in generation of certain tests? How much is "significantly fewer"? If a contestant submits the same solution with different random seeds for 2, 3, ..., 50 times, how much repetitions is cheating? Or just how lucky they have to get on their second=last try to be considered cheaters?
Finally, what's the use case of the new rule which gives anyone a benefit? For example, if problem setters overlooked a certain flavor of test cases, the rule does not cover them, since changing the random seed usually does not help to cover another class of tests.
(Note: I am not defending the rule. I believe there are better solutions than the proposed rule.)
Let me mention a situation in which this rule may be acceptable. (A similar situation has occurred in IOI but it might be more clever.)
Imagine there is a problem which output yes/no, with hundreds of test cases. Then a contestant can do the following:
In this way the contestant can get full points with almost no knowledge of the problem.
This is true, I have seen friends of mine do this on training contests. But isn't that authors' fault? I think in such cases, the problem should ask for some restoration of what the problem is looking for, in case the answer is Yes. If this is not possible, then maybe this is not a good problem for a competition with full feedback? For example, other bad types of problems are the ones in which the number of different inputs is quite low (say 15-20). In this case, will they disqualify you for precomputing all possible answers locally in say 1 hour? As you said, you are not defending the rule and the same way I am not arguing with you, just expressing my opinion under your comment since it kinda matches with what I wanna say.
Yes, under an ideal situation where infinite resources are available, it's fairly easy to defend against these things by simply having 10 times as much test data. However, in practice such issues do show up frequently on IOIs, and attempts to address them by over-engineering the data has even (indirectly) led to judge crashes (IOI2013 day 1).
The current constraints of 50 submissions, 50-100 test cases (which usually translates to only 20~30 of max size), 2~3 seconds per case actually comes quite close to the limit of judge load, especially during the last hour.
IOI 2013 day 1 wasn't because of too much test data, but 20s time limit.
I also heard that it was because of the big TL, as Xellos said. But whatever the case is, still it is their fault. Don't allow 100 submissions if you cannot handle it.
To clarify, the system could handle it, but there was an unrelated condition in the testing system that was supposed to avoid freezing submissions (in a brilliant stroke of irony): if some submission is tested for >X seconds, the machine it was tested at must have failed, shut it down and send that submission to test elsewhere. That limit X was less than 20*numberoftests, but more than 20*numberoflargetests, so it only got triggered once someone submitted a code that TLEd even on small tests and it got missed during testing.
Yes, the issue was with the 20s TL. However, my impression is that the TL was raised due to feedback from the beta test, when testers realized that the original data (half as big, 10~15s TL) was too easy to break via repeated submits.
I'm not sure what the precise issue was. But my impression is that it's either by randomly prioritizing the access patterns, or getting two solutions that combine to pass a lot more of the test cases since certain brute forces do better on certain input distributions.
Correct me if I am wrong, contestants are only given the first error (like in codeforces) per subtask since IOI 2016.
The keyword is `significantly fewer', aka. going from 100/100 test cases to say, 2/100 test cases. We felt that this rules out most of the adaptive submission behaviors such as using asserts to see what's wrong with the data and etc.
The most common form of this exploit is to use the feedback mechanism to extract hashcodes of testcases, and get much higher points by doing 'if data hash = !@#!@#, do this'. This lets contestants to:
It's not fair neither a good idea at all
But these guys are really cool :D
From GA minutes:
Could you share more info on this? Especially how can this be done when contestants can only receive the verdict of the first error per subtask?
Verdict and test case number. In a way, that means that they also know the verdict (correct) of all the previous tests, unlike ACM-ICPC world finals.
Imagine that there are exactly 3 test cases (toy example).
I send a submission with
if (N < 10) cout << "YES" << endl;
while (true);
I receive:
WA on test 2
So now I know that test 1 has a "YES" answer and N < 10 (otherwise i would have failed in test 1, not test 2), but test 2 has N < 10 and a NO answer (otherwise I would get TLE). N < 10 is not enough to separate test 1 and test 2: I try then:
if (N < 5) cout << "YES" << endl;
if (N < 10) cout << "NO" << endl;
while (true);
But get "WA on test 1"
So test 1 must be 5 <= 5 < 10, becaused it was a correct yes on first submission but it is getting WA. So I try with a different ordering:
if (N < 5) cout << "NO" << endl;
if (N < 10) cout << "YES" << endl;
while (true);
And get
"TLE on test 3"
So now only test 3 remains. I try:
if (N < 5) cout << "NO" << endl;
if (N < 10) cout << "YES" << endl;
cout << "YES" << endl;
And get WA on test 3. So I send my final program:
if (N < 5) cout << "NO" << endl;
if (N < 10) cout << "YES" << endl;
cout << "NO" << endl;
Which passes all three test cases.
If this is done much more cleverly, very few submissions could be enough to get meaningful information.
Just thinking. How many people (approximately, per year) have tried to exploit the IOI system in such a way?
My impression is between 2 — 5 for the past 2~3 IOIs: it's usually high performing contestants who already have quite a bit of familiarity with such feedback systems.
In IOI, frequently some testdata have problem and affecting contestants. So, contestants might not believe correctness of testdata, make some assertions, lead some so-called "reverse-engineering" and then get disqualified. They are affecting students in wrong verdict like giving WA but AC and spends more time debugging, giving TLE but WA and looking for infinite loop or time complexity or many other cases.
Moreover in IOI, there are problems which have strange scoring criteria such as IOI 2015 scale, which same solution can be scored different depending on testdata. In this case, testdata is almost predictable and order is matter. I also tried random solutions for 1-2 points changing random seed. Is it also considered as cheating?
Problemsetter should make various solutions and testcases considering lots of corner-cases and solutions. Solution with intentional code with fileIO, using grader bug should be considered as cheat, but if not, some random solutions got accepted, it is mistake of problemsetter. I hope NOT contestants get disadvantage due to problemsetter.
Many contestants depends on full-feedback of IOI and there are some disadvantages of this system. This is also rule from that disadvantage. I think this rule is saying like we are going to change full-feedback system.
For fun, here is how to hack IOI 2015 Towns:
The problem is set up so that there exists a tree with N leaves. Initially, you don't know anything about the tree, but you can ask queries of the form getDistance(i,j), where i and j are leaves. You want to compute something about the tree while using at most c*N queries, for a fixed constant c. To prevent hacking, the problem was set up so that there were multiple test cases per file. But it was possible to isolate a failing file and only process the output for that particular file.
In a suboptimal solution, there are two phases: The first is to find the radius of the tree, and the second is to use the radius of the tree to compute the answer for the tree. A standard method to find the radius is to first find the vertex that is at the end of a diameter of the tree. Computing any individual phase could always be done, but computing both phases might exceed the allotted number of queries. So the idea is to compute only one phase at a time.
As Richard mentioned above, in 2015 it was possible to extract bits from the judge through the memory usage (which was 1GB if I recall correctly). Thus to communicate a number X back, you could declare an array of size X KB and look at the memory usage. Doing so, for each test where we needed too many queries, we could simply output the test index through the memory. Then, when we got to that test, we could compute the first phase, outputting a vertex that was at the end of a diameter. Hardcoding this vertex into the code, we could then run the second phase within the query limit. Doing this for each failed test case, it was possible to get full score on the problem in less than the submission limit.
If it is similar to last year, the judging system will not show you the result of all the cases, only the result of the first non-AC case, the same as CodeForces (WA on test 17, for example).
Isn't it already extremely difficult with 50 submissions and this style of mistake reporting, to hard-code or reverse engineer the data (as long as there are a significant number of test cases)?
To make it even harder, they can even stop telling you the test case you got it wrong if it is past the first 20. So you can't even know if your cheating worked, or if your greedy code is good or not.
Also, in the case of Holiday 2014 and Railroads 2016, it is totally the fault of the problem-setters to have poor data which allow certain greedy algorithms to pass...
It is unfair for contestants to worry about being disqualified for actions with no intent to cheat. Often asserts and such are very helpful to debug REAL bugs (as opposed to figuring out the case). This was why I was very uneasy when the rule was proposed last year, as I thought it was ridiculous to ban any sort of "asserts" which help me to maintain my code correctness and assure myself of the logical assumptions I'm making. Technically, it was still gaining information about the test data (e.g. if it RTE, I know why) but it would NOT be figuring out the exact answer to pass that specific case, it would be to fix a MAJOR bug.
The point is, there are many better ways from the contest organiser's position to prevent cheating, especially because this current rule is pretty poorly worded — "significantly" is extremely subjective, "specific" could mean anything from a certain type of case to the exact numbers in the case.
For Holiday 14 and Railroads 16, chances are the same greedy would have passed (or at least gotten most of the points) if it's ran on a regenerated set of data (this was in fact checked for Railroads, so at least there it's more of an issue with the data generators themselves). These are not the cases that I'm referring to.
The issues in IOI2014 and IOI2016 that I was referring to are solutions that distributed the running time of a brute force algorithm over multiple submits, and used hash codes + feedback to aggregate information about these. Alex Wei (Yummy) posted about the issue with the IOI2015 task above.
I don't think the criteria of 'getting very different set of scores if ran with data generated from different seed' can cause significantly more asserts to trip. My personal belief is that the real indicator for such behavior is extracting out hashes of testcases, and using them in if statements in programs. However, as you pointed out, assert(?) does return a hash code, and that should be used for debugging. That's why that wording was removed.
At the end of the day, I wouldn't even call such reverse engineering of data as cheating. After all, there are problems that are meant to be solved in these ways: https://ipsc.ksp.sk/2014/real/problems/e.html. On the other hand, they're fairly easy to spot once one looks at a code. So the current rule is an attempt to give a subjective criteria, while giving a clear criteria of what's NOT type of behavior (aka. code will do similar things for a different set of random data).
Thank you, I think that clears up the rule a lot.
It doesn't, then, seem to be something meant to impede any contestants without intention to hack or abuse loopholes. That was what I was worried about. I now see it more as a way for the Committee to have a 'disclaimer' so that they have the power to disqualify any misconduct if required.
I wouldn't intentionally extract hashes or try to guess the test case, but I would often infer through reasoning what the bug might be, due to certain behaviour in the case. Sometimes incorrect fixes can pass because data is weak, and I'm hoping that this is acceptable.
Could you please confirm if you could be disqualified for this (an incorrect algorithm which is not engineered to specific cases, but works on a very large majority of cases due to the way it works)?
One example my friend posed: you have an N^4 solution which is correct, and an N^3 solution which is usually correct. Can you run both together to pass the problem? Is this considered cheating now? It was a fairly understandable technique, but now I'm not so sure.
This is the problem I have with this: there are so many 'what if' possibilities.
Once one becomes acquaintanced with the problem itself, it is pretty obvious that it is a problem (we do not want it to be easy for contestants trained in using this technique, to be able to easily solve many test cases without trying to solve the IOI task itself at all).
Also, I think it is pretty obvious that perfectly defining "what is this kind of cheating" in a fixed written rule is impossible in practice. So, I think that there are only two (three if we add "using a huge number of test cases so that this is not possible", but that is probably not practical at all) practical solutions (of course, "do nothing" is a no-solution, but possible):
1) Reduce the feedback given to contestants, so that this is not a problem at all in practice. For example, an ACM-ICPC World-Finals-like full feedback, where you get a "Time limit exceeded", "Wrong answer", etc verdict for the task (subtask) as a whole (as opposed to each single individual test case) is probably enough feedback to give a user a very meaningful judging result, but most likely not enough to perform any of these attacks.
2) Clearly (on an informal level) define the idea, spirit and real problem to solve behind the rule, and trust the ISC on deciding whether something is "clearly hacking the test cases on purpose with no intention at all to solve the problem algorithmically, but just exploiting the full feedback to solve those particular test cases" (or whatever careful wording of the main spirit of the rule is chosen).
Personally, I am fine with any of these (for example, I trust that the ISC will be very reasonable and follow the spirit of the contest: specially considering that they will have to explain any such disqualification to the whole assembly). But I would strongly recommend just going for the 1) approach, because it is fairly obvious that many people will not trust the ISC subjectivity (and it's perfectly fine: I see no reason why force the community to trust the ISC on such specific matters, if the community is not comfortable doing so and there is no absolute need). So, in practice, 2) will likely face strong opposition from the community again (just see the reactions to this post), while I don't think that most people will be strongly against 1) (I might be very wrong on this last assumption) if such feedback is reduced specifically having these problems into account, and "full feedback" (as in, meaningful useful feedback for every submission, like the ACM-ICPC world finals example) is retained.
Orthogonal idea: Why give info about every single kilobyte and millisecond used on every test instead of giving ACMish "AC/WA/TLE" for every subtask? I am a very big fan of instant feedback, but see no reasons to provide unnecessarily detailed info. I guess I am not bringing innovative ideas here and that has been very likely widely discussed, but I would like to hear reasons behind that very full feedback.
Only the first error per subtask, without time and memory usage, is given since IOI 2016.
So, what is the whole discussion about if we can extract very limited number of bits from judge (<=2 per test I guess)
50 submissions = 100 bits, IDing 10 test cases only requires about 10 log_2 10 = 30 bits. There are many problems where a `good' set of test cases can only TLE pruned brute force solutions on <= 10 out of 50 large cases.
Is 10 / 50 tests "significantly fewer" or not?
This is indeed an issue: if this establishes a clear threshold, then it would be ok to do this kind of hashing trick if the program got within 2~3 cases of correct. At present such cases will probably be excluded, although something like 10/20 will be an issue. Are you suggesting having a clear threshold? E.g. 1/4 of the number of test cases in a batch, or an exact number of test cases?
On the other hand, one wording that was considered, but dropped was, 'we reserve the option of rejudging based on a set of data generated under a different seed prepared before the contest'. Of course, the issue then becomes when can that option be invoked.
"rejudging based on a set of data"
This gives me an idea for a particular variation of the 1-"Less feedback" approach (I assume that somebody must have thought about it before, but I have not heard of it being discussed):
Prepare two analogous and interchangeable datasets (for example, having just a different seed). Completely full feedback (everything you want: every single test case, memory, time, everything) is given for the first dataset. Zero feedback is given for the second, which might even be run just after contest. Just like codeforces pretests, but with both datasets having similar characteristics, so that completely passing the first one (without cheating) gives very strong confidence in passing the second one.
On most cases "it is just like full feedback without hacking problems". But of course, that is not 100% true because there is always the possibility of completely passing a subtask in dataset1 (having had just a little luck maybe that a small bug is not caught be such dataset) and not passing it on dataset2.
But the test case number of such error is given. It seems hard to me, but I have been informed that in many instances that is still too much information and can be "mechanically" abused (specially for Y/N outputs. Entirely banning Y/N outputs seems like too much, since sometimes just deciding Y/N is algorithmically different from giving a solution, and the ISC might want to distinguish those two approaches). For example the ISC reported on some contestants being very close to succeeding with such hacks during IOI 2016.
There are task types at IOI that naturally return relative scoring info, so while this may fix it for batch tasks, it wouldn't address the issue entirely.
The memory/time is indeed excessive, but the same issue is present as soon as each submit returns 4~5 bits of info in the form of the # of cases that got A/W/T (as opposed to the 1.5 bits that ICPC does).
Also, more detailed feedback (in terms of # of test cases) is much more helpful for new contestants who haven't had much experience with the feedback system before. My impression is that a lot of ICPC contestants take a while to get used to the single case feedback mechanism.
For me this rule is at least kind of awkward, if not "against the spirit of progamming competitions" (in Petr's word).
Feedback contains only [WA / TL / ..] + testcase ID and it is fairly safe: even if the output is Yes / No, you need about O(log(# of test case)) submissions to hook that test case so toally you need O(log(# of test case) * (# of test case you can't solve)) submissions, this can be much more than 50 if there are hundreds of test cases.
And still, people can cheat if he fails on only 1 or few test cases (since you say he must solve 'significantly fewer' test cases after changing test cases in order to become a cheater).
There are many other solutions:
Don't give feedback of testcase ID (if you think it makes the system not safe enough). Or you can give percentage of TL / WA / RE. Atcoder uses something similar (you can even know how many test cases you have). It sounds vulnerable but over years I never had a chance to 'cheat' using that feedback.
Re-generate test case for each submission. GCJ had this rule for small submissions and it works fine even if you can see the test data.
Just use stronger test cases, and don't use tasks that are vulnerable.
After all, this issue is better to be solved on the organizer's side than put responsibility to the contestants. And this rule need to be translated in many languages and people may get different interpretation of it.
2 (regen data) is not doable due to post-contest appeals, especially in the off-chance that similar programs get different scores.
3 (use strong data) has proven to not work over the last 4~5 years: having 3 hours and unlimited # of submits to play with a single task is a rather unique phenomenon.
1 (give more limited feedback) the # of T/A/W has proven to be sufficient for this type of attacks.
Also, IOI judging resource is actually rather limited because unlimited submissions + programs ran on all test cases + 300 contestants. So 100 cases seem to be a practical limit. In that setting, it's fairly possible to `fix' 8~10 Ts using this kind of approach.
"1 is not doable due to post-contest appeals"
This is purely out of curiosity: Why were post-contest appeals not possible, or somehow worse, in previous IOIs (say, ten years ago) when zero feedback was given other than the sample test case? I do not understand what stops appeals once the full test data becomes public.
Opps, I got the numbering in the responses wrong, will edit. I was meant to say 2, regenerating test data for each submission, is not doable due to appeals.
Ah yes, that makes a lot of sense :)
How do you think about shuffling the test cases before judging every submissions? All submissions are judged on the same test data, but in different orders of test cases. Whenever the judge discovers a test which the submission fails, abort judging this subtask and give the verdict of this test to the contestant.
IMO this rule will be removed again after a very ambiguous case appears where it's not clear if it should be applied or not.