Does anyone know why __int128 seems to use 64-bit operations instead of SIMD instructions?
In this demo, 64-bit instructions are used but switching __int128 with any smaller type (int, long long, etc.) enables vectorization + unrolling.
Is using 64-bit operations just faster than loading into/out of the 128-bit registers?