When submitting a solution to a problem, I think most people get the general impression that once you pass 30-40 test cases, you are basically home free, i.e. you are highly unlikely to fail a later test case and can consider the problem solved. Despite this, many problems have upwards of 200+ test cases, which can mean several minutes of grading time in the worst case, not even taking into account time spent in the queue.
On a related note, the judging software seems to have been heavily overburdened recently, possibly linked to the coronavirus.
I think that a good feature to have would be to allow lots of test cases for problems which are relatively recently added, but once a problem has been solved by ~1000+ people, most of the test cases should be removed to allow submissions to go through quicker. In order to choose which submissions to remove, the system should keep track of how often each test case gets failed. Test cases which are almost never failed should be removed. I strongly suspect that over 50% of test cases on most problems would fall into this category, if not 80% to 90% for some problems.
Would this bring about too many false positives for submissions? (For older problems, do we even care?)
I don't know if such an approach brings about false positives for older problems, but if it does, it is a real problem. Something like that just defeats the point of testing. Lots of people use older problems for practice, and making weak testcases IMO does not seem like an appropriate solution for reducing the load on the judging software.
Why not work on the second problem while the first is on the queue?
It is mostly the same q-time for all participants, there is no unfair advantage. Of course, it is more nice with a small/fast q, on the other hand it is no big deal to have some minutes waiting time, too.
Better solution than removing test cases permanently would be doing something similar to what is done during a contest. Have a smaller subset of test cases and judge problems first on these test cases with high priority and show verdict similar to pretest passed. Then remaining test cases can be run later with lower priority when the load on judge is low.
One thing I noticed is that sometimes the first 100 tests are judged pretty quickly but after getting to later tests(like 150-200) judging can often take 20+ seconds per test for some reason.
It may be useful to turn off tests that never failed, but total time is significant. Something like:
For older problems, do we even care?
Yes I think we should, because as mentioned in another comment, it defeats the whole purpose of testing. And you can't be sure that a subset of tests will make no difference(unless the amount of failures on them compared to other tests is like < 5%)
I have another idea. Why not shuffle the tests' order and put the ones that most people fail on first? (maybe keeping the samples in place)
After having a contest(thus adding 5-8 problems to the archive) we do the shuffling on its test cases after 10-15 days and then never come back to it. Since recent problems are the ones being solved the most and so we'll have good data to base upon after some days of the contest.
I honestly don't think that this will make a big improvement, but I can't think of something else right now and I really want us to come up with something. (I know that if some people can help solve this, it won't be surprising for them to be from this community ^^)
Test reordering could be useful for TopCoder where you just fail or not, but with ACM-like verdicts you want to have reliable test index for comparison with other participants.