Hello, community! I wonder how plagiarism detectors are coded? I think CodeForces only checks for exactly same codes. Maybe it's time for community to help our admins to implement a good and reliable plagiarism detector. Actually, I have some ideas!
First of all, following idea does not work for problems with very little amount of code like Div2A or Div2B problems (even some harder ones).
The actual point is saving the order of main operations written in the code. By using the word main, I mean we need to ignore lines such as declaring variables or libraries. For example, let's take a code from this post.
For this code, our script generates this list:
input
loop {
input
}
loop {
input
vector_pushback_operation
if {
equal_operation
}
}
loop {
if {
inc_operation
} else {
inc_operation
loop {
inc_operation
}
}
}
output
Of course, this is a just rough idea and it must be improved to make a good detector. But guess what! These simple and ridiculous lines are EXACTLY SAME for all codes in the link aforementioned. I have some other ideas to improve this, but first, of course, your opinions are important for us.
So, what do you think? What are your suggestions? Will admins use our help? =))
Where is the script?
Oh, sorry if my writing implies I wrote a script :))
I didn't write it, but it doesn't seem very hard to write (but it can be annoying as well as other long projects) :)) Of course, the list will not be exactly like that (I mean words in English), but the code can kinda compress every operation and give it a number.
Though these are just my humble thoughts :)
In my humble opinion, matching the structue of the code can raise many false alarms. I mentioned that in commenting to another blog, but I am going to repeat that again. Those who spent their time to solve this problem should not be penalized for coming up indepedently with the same idea and writing similar codes. The automated program for plagiarism detection program should not be taken for granted. Complaints about false alarms should be dealt with professionally and fairly and not igonred. The benefit of the doubt should always be given based on the credibility of the community members. Codeforces should not naively celebrate the unfairness of false alarms.
A final note, I wonder why you did not use you own code to share your suggestion for improving plagiarism checking, and chose instead to use someone else's code without his consent.
Of course, I mentioned many times that the checker must be improved in a great way so we can rely it. Surely, our admins know the best and they will not make wrong decision on this.
I think your final note is pointless. Firstly, source codes are fully available in CodeForces, so I don't think I need an announce to the author. Moreover, this blog talks about improving the plagiarism detector, so it would be better to give an example in similar codes. I do NOT accuse anybody there, and as you can see I didn't mention anybody in the blog, even didn't write the author of the code.
Well, if you study the foundations of detection theory, you should be able to appreciate that the best claim for automatic detectors is an upper bound on the probability of false alarm events. A perfect false-alarm free detector is a theoretical limit to practical detectors.
RE: The final note
In my humble opinion, the availabiliy of source codes in Codeforces should not justify using someone else's code copy-and-paste in your blog without his consent.
I don't claim anywhere that I will write the checker. We have very professional and experienced people on this community and the purpose of this blog is kindly asking their help. I shared my own idea because I think this can be improved enough to be a good checker.
Note that again, I do not think that checker should be applied to all problems, of course, we will have some exceptions. And I believe our admins are enough experienced to decide the best.
Well, why did you choose to respond to his comment without his consent? If you demand consent for everything, you're going to end up with these ridiculous questions.
"So, what do you think? What are your suggestions?"
This is last sentence in the blog! Its owner asked for comments.
Why did you choose to respond to my comment without my consent?
I did not copy-and-paste someone's code and put it in my blog to make such comment. Try to comprehend fairly and objectively the difference between the two questions.
That has nothing to do with my question, stop dodging it!
As your comments imply that you defend the author of the code, take it in this way. The code above is not his code, I just wrote it, and magic! For this problem ideas of most people are same, yes? And amount of the code is small, yes? So, just a coincidence.
Shit, it went off topic. I'm asking comments about improving/creating a detector, here we don't talk about taking someone's code without consent. Why you guys are acting like a child?
Lol sorry for trolling on your blog, but the idea of demanding consent for code is just asking for it.
On topic, I agree that plagiarism checking is still failing, but when the code is simple enough, there's going to be a lot of false positives when considering only code structure. In the end, you can't distinguish between two people writing code that does the same thing and copypasting main into a different template + changing spacing or variable names.
_ If you demand consent for everything,_
Who said that? You reached the same conclusion about false alarms. Yet, you made a false generalization. Stop making generalizations and do not celebrate unfairness to someone.
The main topic of my original comment that to remind that it would be unfair to ignore complaints about false alarms. Yet, you choose to focus on the footnote about the code in the blog. I have no further comments and do not wish to pursue any discussion whether with your consent or without it.
There was a VK Cup wildcard round, whose task was to write plagiarism checker. But looking on what's happening, on what is now detected and what is not, seems none of the checkers submitted on that round are used in production now.
Well, I wasn't aware of that. I think some years ago we had better plagiarism detector (because some non-obvious solutions were getting skipped). I don't know what can be cause of not using better checker.
Hi. Yes, 4 years ago we hosted special contest to develop antiplagiarism tool. The contest was really successful, winner code worked great. I don't think it is easy to write much better tool without deep research. Each round I use slightly improved version of winner's code. Every round I find tens of cheaters. For extremely easy problems I use almost complete equality of codes, because of aware of false-positive verdicts. See, finding cheaters automatically it is always a trade-off between probability of false-positive verdicts and accuracy of finding cheaters. I am sure that we are making significant efforts to find cheaters and overtake other platforms in this question.
Hi, Mike. I'm really glad to hear that! I know we really need to be sensitive with false verdicts and I understand your true responsibility. But every round there are some obvious cheaters which system couldn't detect them. Can it be improved in a better way? Anyway, I believe your experience and your decision is probably the best one.
I think such cheaters are mostly cheated on easiest problems where I can't use advanced techniques of detection, or they are really on the edge between possibility of detection and massive false-positive verdicts.
Yes, you are right. Thank you for your attention and replies.
I think it's time to close this topic before having more conflicts with some users :))
What actions are taken against cheaters? a lot of times I see users just resubmit their code to get skipped if they do bad in the round, and I've seen some of them do it more than once and no actual actions are taken against them. I mean what's the point of detecting a cheater if no punishment is applied !!!
In my opinion, maybe we can use moss instead of making it from scratch again since it is reinventing the wheel (?). For a certain problem, probably we can use a certain threshold of similarity percentage to determine someone is cheating (?)
Moss is not a bad option. But then its usage must be smarter otherwise system testing will take very long time (maybe a day) to complete.
The described script might be replaced with comparing assembly codes. With optimizations on, compiler may even able to detect some simple tricks which make codes look different.
And I think this job shouldn't cause too many other negative effects such as significantly longer System Test time. We don't need to waste too many resources on those cheaters.
I think the best and easiest way is just to ask Ideone administration to disable "Recent codes" during contests or enable "Private" by default. It will significantly reduce the number of cheaters.
I have a very good idea how to detect a lot of cheaters, but I don't want to share it in public because (obviously) it can be worked around
I think the plagiarism checker not directly matches everyone's code. It checks only if two users are friends or in the same Institute.
No, it matches.
https://codeforces.net/contest/1107/submission/49011870
https://codeforces.net/contest/1107/submission/49011833 (He is code's author)
I've just experimenting in using JPlag (google it, it's a handy tool) exactly for purpose of plagiarism detection in programming competitions few days ago. It does detect some obvious cases like copy-paste-rename-variables plagiarism, splitting/merging code to functions, but sometimes it is really hard to say, is code plagiated or not, even for human-judge. For instance, plagiarism checker goes totally crazy with some popular dynamic programming problems, which are very close to each other sometimes. Also, you should consider volume of submissions per contest and it's complexity. JPlag I've tried, match source files each-to-each (parsing source code, building syntax trees, etc.) and it took some time to check just about 200 submissions from local contest, a minute or two, I think. So, really good antiplagiarism need some extra computing power, you know. Nevertheless, I'm going to proceed with my research, so as I think it's interesting topic. May be I'll describe experiment results later, if you wonder.
Thanks for sharing your experience in developing plagiarism checkers. Perhaps Codeforces should consider including a brief ethical reminder as a footnote in every contest announcement [in addition to the information stated presently in the contest registration form]: "Fairness is as important as good coding skills. Plagiarism is strictly prohibited"
stop you will bug the whole website by doing this because you don't know how to code properly....