On Input Formatting and How Lists are Represented in Input Files

№	Пользователь	Рейтинг
1	tourist	3985
2	jiangly	3814
3	jqdai0815	3682
4	Benq	3529
5	orzdevinwang	3526
6	ksun48	3517
7	Radewoosh	3410
8	hos.lyric	3399
9	ecnerwala	3392
9	Um_nik	3392

№	Пользователь	Вклад
1	cry	169
2	maomao90	162
2	Um_nik	162
4	atcoder_official	161
5	djm03178	158
6	-is-this-fft-	157
7	adamant	155
8	awoo	154
8	Dominater069	154
10	luogu_official	150

When preparing the problems for the February 25th NA ICPC regionals, we had some discussion around how lists should be formatted when given as input to contestants.

The primary argument motivating some design decisions was consistency — namely that all lists should be formatted the same way, where a list should be expressed by starting with the length of the list, $$$n$$$, and then writing $$$n$$$ lines, one item per line.

This is pretty standard and for most problems, this is the convention that most problems use for most lists. The main counterexample to this case seems to be when the list is a list of integers. A lot of problems, when confronted with a list of $$$n$$$ integers, will write out the list as one line of $$$n$$$ space-separated integers. This seems to be the meta for most contests, though I have not looked very carefully at this so this assumption may be wrong. NA ICPC is one of the contests that generally follows line-delimited lists of integers.

Here's my question to the competitive programming community at large — as a contestant, how do you feel about input formatting of this form?

Consistency is the most important thing, so all lists should be written as $$$n$$$ followed by $$$n$$$ lines, one per object.
Consistency is not that important. For complicated lists where the object is not just a single string or an integer, I prefer lists to be written as $$$n$$$ followed by $$$n$$$ lines, one per object. However, for lists of integers or strings, I prefer $$$n$$$ on one line followed by another line with $$$n$$$ space-separated tokens.
Consistency is not that important, and I don't care how lists of $$$n$$$ integers are written, since I'll just read them the same way no matter what.
I don't believe that lists by default should be written as $$$n$$$ followed by $$$n$$$ lines, one per object.

I also want to ask a similarly related question — as a contestant, have you thought about consistency of input formats in programming contests?

Yes, and it bothers me when contests are internally inconsistent with regards to reading until EOF or a sentinel, or having a prespecified $$$n$$$.
Yes, and it bothers me when different contests follow different formats.
Yes, but I don't really care.
No.

Комментарии (10)

Написать комментарий?

mphillotry

22 месяца назад, # |

← Rev. 4 →

+164

i move to abolish 'end of list denoted by 0/-1/eof'

on that note, i personally think one object per line for, say, a list of integers, is very hard to read as a human, even more so with multicase. consider the following 3 lists:

→ Ответить

adamant

← Rev. 2 →

+27

I think for integers it's nice to have all $$$n$$$ in the same line, as it requires less vertical space and the statement, and also it's nice to read it in Python like list(map(int, input().split()))? It is also visually more distinguishable if the input has more than one list.

Unless it is $$$t$$$ subtests with one integer each, in which case each subcase should generally start from a new line.

ko_osaga

+120

I have coordinated many contests and my position is strictly anti-consistency. My policy is:

If you are in an organizing team, never talk about consistency and grammar unless someone asked.
If you are truly bothered and can't sleep because of it, First — do it only if you've finished your duties. Second — change it without wasting others' time. Third — inform them.
Express gratefulness for your teammate who generously allowed you to change the grammar and consistency rules. Some people think the opposite way. It isn't.

(The point is not about grammar, the point is about nitpicking. For example, I am not a native English speaker, so my English actually have serious grammar issues. In this case it's not a nitpick, it's an issue. In this case I requested xiaowuc1 to review my grammars.)

Contests exist to demonstrate problems — statements are good if and only if people understand them clearly. There exists no other goal than this. The problem with consistency is:

It's just a waste of time. People waste so much time on consistency, grammar and all nitpicking bullshit. But after the contests we have weak tests and unprepared editorials, and people still don't understand that grammatically-perfect statements. Why does that happen? Statement testing is not a grammar exam. It should be about thinking from the solver's perspective and clarifying what people might not understand. If you did all your duties and have free time, do check grammar and consistency, but honestly, in almost all contests we are out of time. It's simply a very sub-optimal way to increase user experience. Understand and do what participants really need.
It doesn't create better problemset. A consistent way is not always the best way to solve issues. The case in the article is already a good example. Another example — I was in a team which prepares IOI TST problems (all function implementation). Someone wanted to put function invocation in texttt ($$$\texttt{int f(int x)}$$$) and all function variables in braces ($$$x$$$). This is a reasonable idea. But there was a problem where the function received an argument with an underscore ($$$the\_number$$$). It made the statement look ugly. I decide to write it as $$$\texttt{the_number}$$$. What is the consistent way? I don't give a fuck, I just like this way of underscore.
Flame wars. Different people work as a team and prepare different problems. They all have their own standards and ideas on how to format inputs and statements. Consistency means some people's standards should be ignored. That could happen, but sometimes they fight to defend their genius ideas on spaces versus newlines. Please, shut up! Let others do what they want, and do your thing.

Of course, I'm not against all nitpicking. For example, some grammar mistakes cause confusion. It's a bad expression and should be fixed. But if the expression is grammatically wrong but unambiguous, it is not an issue. It's okay to have such issues even in the final version.

There could be some contestants who are bothered by the internal inconsistency in a single problemset. But as long as it does not make them do mistakes (e.g mod 998244353 and 1000000007 in a single problem set can prompt a mistake, it should be better avoided if possible), those tickling feelings are exactly the contestant's problem, not the contest's problem. Contests should do the right thing even though people may not like it.

Xellos

+45

Can your contestants read the samples without pain? Consider the following:

Number of test cases, number of elements in the list, one per line... except there's nothing to guide the eyes to immediately distinguish that. The best course of action is to write it down on paper in a better way and never look at the original statement again.

Consider also that consistency among all possible lists together is actually inconsistent because lists of different objects naturally behave differently. A list of lists should be one per line, a list of integers should be in one line, for strings either is fine, for a general 3d list there are no good options since we're constrained to a 2d medium. In a lot of specific cases like a list of lists of strings or binary values, you also intuitively know what to do.

You should aim to be consistent in things like keeping the same type of input formatting for the same type of input, put the length before every list (or don't, but doing it is more convenient), use the same input validation tools everywhere etc.

Geothermal

+42

Can anyone who prefers newline separation justify why this format is better than space separation? I prefer space separation for pretty much the same reason others have described--it's vastly easier to read when working out sample cases on paper. Perhaps this makes things marginally easier for users of other languages, but off the top of my head I can't think of nontrivial benefits to Python/Java users.

Also, big +1 to ko_osaga's comment. This decision does not seem sufficiently high-impact to justify spending too much energy arguing over it.

anpoli99

22 месяца назад, # ^ |

+34

When I used Java, I always used BufferedReader + StringTokenizer, and actually I remember thinking it was nice when there was newline separation (so I didn't have to make a pointless StringTokenizer). Probably this counts as "trivial", but I'd argue it's always trivial, since of the time spent solving a problem, maybe 2% is spent on reading the input, and probably not even that much.

I think if there's multiple test cases, the far more helpful thing would be to explain the cases below, or add the new(ish) feature in Codeforces where when you hover over it, it shows you which test case is which (on paper I imagine you could have a label next to the input box, but I've never seen anyone try this). That would keep it readable no matter how messy-at-a-glance the input format needs to be.

Gassa

+26

This is an interesting question.

And people in the comments argue that there are matters more important for them than consistency. One is readability. Another is cost vs. benefit of making stuff consistent.

Which brings the question: why is consistency important? The original post just states that consistency was the primary argument motivating some design decisions. But it does not say why.

So, why? I'd much like to see that spelt out as well.

xiaowuc1

I have provided an anonymized sample of the conversation that motivated me to write this blog post. I have de-anonymized myself (I'd like to know whether I'm wrong on some things I claim), everyone else remains anonymized. The person arguing for consistency I have named as "formatter" to differentiate from other judges.

judge A: "what is wrong with numbers on a line?"

formatter:

"it's harder to create/see data"

"it generates monstrous lines"

"it also lends itself to the simple error of inconsistent spaces or spaces at the end of a line, which can cause some problems in some languages"

judge B: If it is readability, you can always say "N whitespace-separated integers" and format the input any way you like.

Spaces at the end can be an issue with one or several integers on the line, not sure why that matters, we do have input validators

formatter: it's much less likely with a single integer. if you're writing a program to create data it's easy to put that at the end

formatter: that's not my only argument

formatter: and i trust my validator to test that, but i don't know the other [validators] well enough to trust them

judge B: I trust xiaowuc1's validator

formatter: my main argument is consistency with other problems

formatter: we should do the same things in the same ways

xiaowuc1: this is an appeal to convention but I can't think of any major ICPC contest outside of NA that enforces this

formatter: for me, that's not a compelling argument

formatter: i value consistency within the problem set over concurrence with regions outside of NA

formatter: but that's just me

xiaowuc1: consistency is not an argument for this position though

formatter: uh, yes it is

xiaowuc1: you could just as well say by convention we do space-separated?

formatter: then we should do that on every problem with a list of integers in the input

xiaowuc1: like a priori why is your position preferred

xiaowuc1: I don't understand your consistency because I feel like you're just arguing for one token per line

formatter: i'm arguing to input lists of integers consistently throughout the problem set

formatter: in fact, i'm arguing to input lists of anything consistently throughout the set

xiaowuc1: I understand that consistency is good and we should try to be consistent where possible so that contestants expect the same thing over and over again, and that's why we do things like express bounds in the same way and so on, however, I think we should not be consistent if it compromises the clarity of the input file for the reader, and I think "loss of clarity" is a much worse experience than "loss of consistency" usually a line denotes a single concept or item — in this problem each line happens to denote one of the lists (be it increments, decrements, or query times), and that's something that does tend to be consistent (a line representing one coherent "unit") if you have many long lists and the input is just

length of list 1
item 1
item 2
...
length of list 2
item 1
item 2
...

it's also not particularly human readable

formatter: we're only talking about the sample io here

formatter: the contestants can't see the secret

formatter: and the samples can be made small enough

xiaowuc1: sure, though making the samples small mitigates it purely from the perspective of the samples

xiaowuc1: if a contestant wants to generate their own test file they will get a file that is longer by number of lines

xiaowuc1: there's definitely some small-middle ground where line-by-line tokens does much worse than token-by-token lines, though there is also the argument that most teams write code in a way where this doesn't matter anyway

formatter: most good teams

xiaowuc1: I would claim that weaker teams are more likely to be hurt by line-based than token-based because it's easier to see that a line is missing one of 5 integers than one of five lines is missing, but that's an aside

formatter: that's a stretch

formatter: i don't think i buy that

← Rev. 3 →

Interesting. Thanks for the perspective. To add to these, maybe:

I'd think an enforced "one token per line" rule would allow to make input more "librarized", that is, write a generic code to read (edit: and validate) "vector of anything", then use it everywhere. Still, if people go to such lengths, they can "librarize" reading any-whitespace-separated tokens just as well.

Also, there's a notion that Python beginners have a harder time writing a = list(map(int, input().split())) than just reading integers one per line. Somewhat important when the audience is elementary school students — which, here, often happen to be Python beginners.

Anyway, personally, I value readability more than mechanical consistency. Besides, the validator argument doesn't really hold water since, with testlib.h, it's just fine as:

int n = readInt (minN, maxN, "n");
vector <int> a = readInts (n, minA, maxA, "a");

...and if validating without testlib.h, the authors might have to talk about this very fact instead.

greencis

+18

I don’t agree that it’s harder. Either this is exclusive perception of that person or they simply cannot explain what they mean.

On the contrary, it generates monstrous number of minuscule lines which are very inconvenient to browse in Notepad/Lister/any other viewer program. Also, if test cases have different number of lines, it is difficult to find a particular place in the file.

Some trust issues we have here. Can you really trust everything that is not prepared by you? It’s very time-consuming to ensure validity of every single file yourself, that’s why there are many judges preparing the contest.

Why? Because the “formatter” wants it this way? Well, for me, that’s not a compelling argument.

formatter: that's a stretch
formatter: i don't think i buy that

Judging by the poll results, the majority of Codeforces users don’t seem to buy the “formatter’s” opinion either.

In my opinion, the main problem is not whether you should print a list of integers in a single line or in multiple lines. Contestants will sort it out anyway. The main problem is, such disputes can block the process of preparing the contest and even make developers quarrel with each other. Ability to find a consensus is more important than ability to invent strict rules and follow them with zeal.

Блог пользователя xiaowuc1