https://algorithmica.org/en/eytzinger
This will probably be an hour-long read about the CPU memory model that explains 10 lines of code at the very end.
The following code is 4 to 5 times faster than calling std::lower_bound over a sorted array:
#pragma GCC optimize("O3")
#include <bits/stdc++.h>
using namespace std;
const int n = (1<<20);
const int block_size = 16;
alignas(64) int a[n], b[n+1];
int eytzinger(int i = 0, int k = 1) {
if (k <= n) {
i = eytzinger(i, 2 * k);
b[k] = a[i++];
i = eytzinger(i, 2 * k + 1);
}
return i;
}
int search(int x) {
int k = 1;
while (k <= n) {
__builtin_prefetch(b + k * block_size);
k = 2 * k + (b[k] < x);
}
k >>= __builtin_ffs(~k);
return k;
}
tl;dr: it uses a segment tree-like cache-friendly array reordering that allows trading off memory bandwidth for latency via prefetching. Because of this, its running time could be volatile on some online judges due to "noisy neighbor" effects (see comments below).
This is technically not a drop-in replacement, since it requires some preprocessing, but I can't recall a lot of scenarios where you obtain a sorted array but can't spend linear time on preprocessing.
This method could also be used to speedup heaps, segment trees, and other static binary tree structures.
I want to test this method, but maybe I am doing it wrong? Link. Got different check value and got that
std::lower_bound
is faster. Can you help to fix?The check value is wrong because it returns element's index in its internal numeration, not the original one. I should probably put that in bold in the red somewhere, though it's possible to restore it by spending a few more instructions.
As of running time, I am not sure, but I think that it probably has something to do with the fact that 32-bit random numbers are used to query an array with 20-bit items, so that most searches are only hitting the path to the largest element and the whole caching trick stops working.
I use variations of this code for benchmarking.
Idk what is
k >>= __builtin_ffs(~k);
but it should bek - n + 1
.https://ideone.com/IeWTJ1
Upd: nah, this works only for
n = power of 2
No problem, we can extend array to power of two by
std::numeric_limits<int>::max()
.Also I tested eytzinger in comparison with recursive template. Link on ideone
Guys I don't think that benchmarking on ideone is a good idea, and I wouldn't benchmark memory-bound algorithms on non-dedicated shared-memory machines in general.
The key trick is to trade off (almost all of) memory bandwidth for latency, which is not going to work well when you have noisy neighbours possibly competing for it.
I am not sure if major online judges provide decent memory isolation, but public ideone definitely doesn't.
Memory isolation is an OS feature that's required for security and if you have to switch processes or move pages around too often, performance just tanks. This overhead is possible and could be an explanation, but I'm 99% certain that when you run your code on ideone, it doesn't share physical memory.
Yes they don't share the memory, but they do share the memory bus, and hence the bandwidth. By "memory isolation" I meant that ideally they shouldn't.
Try running this bad boi in the background on a spare core on your laptop:
My running time on the benchmarks went 3x.
Ok, so that's what you meant. I'm not sure how it works for servers compared to regular PC CPUs (if there are attempts to mitigate it), but it's a factor.