This one weird trick will speed up your I/O 5x

Revision en2, by purplesyringa, 2024-02-26 08:18:51

Just do this!!

#include <iostream>

int main() {
    std::ios_base::sync_with_stdio(false);
    std::cin.tie(nullptr);
    // If you use std::cout.tie(nullptr);, you are wrong.
    // This won't ever be supported.
    // You don't need this line. Remove it.
    // You have almost certainly added 12 useless bytes to every solution. They take space on disk.
    // Disks are produced by young children on factories. Do you want children to work more because
    // of you? Think of the children.
    // And about yourself too. One of the pictures on this page is loaded from my domain. I know your IP.

    // Your code here
    int n;
    std::cin >> n;
    long long sum = 0;
    for (int i = 0; i < n; i++) {
        int x;
        std::cin >> x;
        sum += x;
    }
    std::cout << sum << std::endl;
}

This will speed up your code so much you won't have to care about spending time in I/O anymore!! Give likez plz :3


Yeah, no. That's not how it works. We're still using <iostream>, and that means we have to rely on virtual inheritance and all the deoptimizations that follow from that. Even if you use a random "here's-my-fast-io-library" snippet from Codeforces, you still depend on block-buffered output. (Also, none of those libraries mesh well with floats.) In many cases, more CPU time is spent encoding/decoding numbers than interacting with the file system.

Can we do better? ...Yeah, stupid question. I wouldn't write this post if we couldn't. Here's how far I managed to go in half a month:

Aggregated benchmark results

These results are from Linux, that's why they look reasonable.

Under Windows, my library is at least as fast as this plot shows, but it also manages to get much faster than <iostream> on integers and much faster than <cstdio> on floating-point numbers. This supports the common claim "use cstdio unless you need floating-point" which I never managed to reproduce myself because apparently the relative performance of cstdio and iostream is the exact inverse of that on Linux. Duh. (To Apple fans: your <iostream> sucks because you use libc++. Install GCC together with Linux (brew sucks), switch to cstdio or (better yet (yes I am trying to force people to use this, why do you ask?)) my I/O.)

Here's the relevant "who the hell thought releasing this garbage was a good idea" plots for Windows MinGW:

Integer input Integer output

The performance is compared to <iostream> and <cstdio>, and also to burunduk1's library as a common example of fast I/O among competitive programmers.

"Random length" means that the test numbers are generated by first choosing length uniformly and then generating this many random digits. "Random value" means the number is chosen uniformly from the valid values of the type. For instance, most of the numbers are going to be of maximal length in this case. This predictive behavior makes some stuff faster to process.

Floating-point input Floating-point output

rng is assumed to be uniformly distributed, The difference between rng and e^rng is just like between "random value" and "random length". The units are one football field over the height of 100 Big Macs, by the way. Do not compare the metrics from different tests.

String data input String data output

Nothing of interest here. Maybe LOL at if (flag) std::cout << "YES"; else std::cout << "NO"; being faster than std::cout << (flag ? "YES" : "NO"); or something.

It supports std::bitset too. Not like anyone really uses that but uhhh...

Also, notice that for whatever reason output is relatively slow compared to input. That's not just an inefficiency of any particular I/O algorithm; that limitation is due to the performance limits of writing to a file. Generally speaking, output performance increases and input performance decreases in interactive tasks (because the input file can no longer be mmapped but the output pipe can be written to without going through vfs handlers). Unless you have to flush constantly, which you do. Idk, this is probably of more interest to test system developers than anyone else.

I'm impatient. How do I use this in my solutions?

You do this:

#include <iostream>

/* Copy-paste blazingio library here */

int main() {
    // If you're used to writing the two lines below, you can still use them, they just become no-ops.
    // std::ios_base::sync_with_stdio(false);
    // std::cin.tie(nullptr);

    // Your code here
    int n;
    std::cin >> n;
    long long sum = 0;
    for (int i = 0; i < n; i++) {
        int x;
        std::cin >> x;
        sum += x;
    }
    std::cout << sum << std::endl;
}

Here's a big blue link to the library because I know you're going to miss it otherwise:

blazingio.min.hpp

The library, called blazingio because I use Rust professionally and I signed an agreement that says I have to use the word "blazing" at least once per project, is packaged as a drop-in <iostream> replacement, which means you can use std::cin and std::cout (or using namespace std; + cin/cout) just like before.

If you're used to cin/cout, just paste the library into your template. If it behaves weirdly or you want to disable it for whatever reason, just remove it. No need to change your own code. Isn't that cool? I think that's cool. It took a while to make it work reliably. Anyway.

Are there any downsides?

That's quite pessimistic of you. You shouldn't have asked that, you should have said "what a cool project" or asked "what are the upsides?".

The main downside is that it's large. It's around 9 KB, and Codeforces submission size limit is just 65535 bytes. The repository contains technical instructions on how to build a slim version if you want to reduce the size. But unless you use another big template, this shouldn't be an issue.

Another downside is that blazingio keeps various large in-memory buffers, which means that your memory usage might increase by approximately the size of I/O. This is typically around 20 MiB max. This shouldn't be a problem in most cases, but you might want to be aware of this. Additionally, ejudge scripts have some legacy code that forces blazingio to allocate exactly 24 MiB for output buffers on some ejudge instances. This is problematic both because it uses more space than necessary and because that might turn out to be not enough. Hopefully these issues can be solved soon.

The last significant problem is that it might be, uh, useless. There used to be tasks that couldn't be solved without fast I/O in the past, but this is considered to be in bad taste by problem setters. However, even if fast I/O is not required, it might still give you an edge, enabling you to use less efficient algorithms in your solution elsewhere.

How it works

The one problem with competitive programmers is they don't see how to apply the tricks they constantly use to seemingly low-level tasks. (Another problem is that most of them share the same mental conditions, and don't talk to anyone IRL, so they don't question their mental health and think their meltdowns are just a quirk.) They are also less aware of OS and hardware internals. So maybe this could be your chance to learn something new.

Unfortunately, I do not have time for a write up at the moment. The short list is:

  • mmap for input, if it's a regular file (similarly on Windows)
  • MAP_NORESERVE-based large output buffer on Linux
  • Branchless autogrowing PAGE_GUARD-based output buffer on Windows
  • Optimized fast paths for a) input from a regular file, b) input from a pipe if the local buffer is filled
  • Workarounds for bad GCC codegen wrt. syscalls with inline assembly (had to read XNU code for that for M1 Macs; screw Apple, by the way)
  • Optimized float reading by loading digits to u64 if precision allows
  • "sqrt decomposition" (no, not really) for float exponents
  • SIMD-optimized string input and bitset input/output, including SWAR on SIMDless targets
  • FILE_FLAG_NO_BUFFERING, FILE_FLAG_WRITE_THROUGH for file output on Windows
  • Divisionless integer output via Terje Mathisen's trick, which applies to float mantissa too

If that's an interesting topic for you to hear about, please tell me and I might write another post on the internals!

Tags fast input, input-output

History

 
 
 
 
Revisions
 
 
  Rev. Lang. By When Δ Comment
en6 English purplesyringa 2024-02-26 21:30:30 20
ru6 Russian purplesyringa 2024-02-26 21:30:18 20
ru5 Russian purplesyringa 2024-02-26 21:27:11 12 Мелкая правка: ' трюки из бесполезных олимпиадн' -> ' трюки из олимпиадн'
en5 English purplesyringa 2024-02-26 08:32:14 21 Tiny change: 'plz :3\n\n---\n\' -> 'plz :3\n\n[cut]\n\n<!-- -->\n\n---\n\'
ru4 Russian purplesyringa 2024-02-26 08:32:01 21 Мелкая правка: 'канал!\n\n---\n\' -> 'канал!\n\n[cut]\n\n---\n\'
ru3 Russian purplesyringa 2024-02-26 08:22:18 21 Мелкая правка: 'n\n // Your code here\n int ' -> 'n\n // Ваш код тут\n int '
en4 English purplesyringa 2024-02-26 08:21:26 0 (published)
en3 English purplesyringa 2024-02-26 08:21:16 980 (saved to drafts)
ru2 Russian purplesyringa 2024-02-26 08:19:25 0 (опубликовано)
en2 English purplesyringa 2024-02-26 08:18:51 0 (published)
ru1 Russian purplesyringa 2024-02-26 08:18:32 9847 Первая редакция перевода на Русский (сохранено в черновиках)
en1 English purplesyringa 2024-02-26 08:18:03 9539 Initial revision (saved to drafts)