I recently scraped almost all of the submissions from Codeforces. Here I share all the source code and metadata (problem ID, submitter, language, verdict, etc.): https://mega.nz/folder/Sypi0BrS#iNbQXf3EwcjZbpwXRKHOnQ. The dataset contains at least 99.8% of the public submissions with ID <= 128M. In total, there are ~98M submissions.
In addition, I created a source code reverse search engine based on this dataset, which you can access at https://cfsearch.top/.
Disclaimer: The scraping process violates Codeforces' Robots.txt. Use of this dataset may even violate Codeforces' terms. Use it at your own risk.
Btw, MikeMirzayanov, is it possible to share the official dataset?
Wow. Amazing.
How much time did it take you to scrap this?
~ 2 weeks
Thanks for this dataset!
The code search could be extremely useful if polished up. For example if I want to practice say link cut tree problems, I can search for every submission with the words "link cut tree" or "LCT" to find relevant problems with reference submissions/implementations. These are otherwise really hard to find because those problems often have have alternative solutions that don't use advanced data structures (but require more insights to find) so you can't just sort by execution time.
Good idea. But the current search engine cannot handle such requirements :(
Great tool. It can be used to find alt accounts of users based on the templates they use.
Hey, it looks like the website is down... Would you like to host the website again? Or if not, would you like to share the source code of the reverse search so that we can host it? Thanks a lot!
I wanted this data, any other way to find it?
https://cfsearch.top/ doesnt work anymore