7x faster drop-in replacement for std::lower_bound

№	Пользователь	Рейтинг
1	tourist	4009
2	jiangly	3823
3	Benq	3738
4	Radewoosh	3633
5	jqdai0815	3620
6	orzdevinwang	3529
7	ecnerwala	3446
8	Um_nik	3396
9	ksun48	3390
10	gamegame	3386

№	Пользователь	Вклад
1	cry	165
2	maomao90	163
2	Um_nik	163
4	atcoder_official	161
5	adamant	160
6	-is-this-fft-	158
7	awoo	157
8	TheScrasse	154
9	nor	153
9	Dominater069	153

http://algorithmica.org/en/b-tree

Hi, it's me again. A couple of days ago I published a post about speeding up binary search by rearranging memory in a way that allows prefetching, which is basically a way of trading off memory bandwidth for latency.

As noted in the comments, it had a noisy neighbors issue when tested on online judges, as different solutions evaluating on the same machine could compete for the same memory bandwidth. The speedup was like 2x-3x (and very volatile) on Codeforces while on my laptop and my dedicated server it was 4-5x, depending on the array sizes.

And so I rewrited it using B-tree layout instead, and it worked, because it bluntly requires 4x less memory reads. It is still quite volatile on CF—compare this (6.6x) and this (5.1x)—but I've never seen it drop below 3x on adequate array sizing (yes, the title is a bit clickbaity; I think the average speedup is around 5x).

There are two implementations:

One is straightforward and uses nothing fancy
The other is compute-optimized with AVX2 instructions and is ~30% faster, but doesn't work with older hardware (notably ideone and Yandex.Contest servers)

For more details, check the article.

Also, it would be really cool if someone could figure out a way to make compiler produce an autovectorized version of the faster one, so that it could be more easily extendable. It's weird and annoying that the compiler can't produce these 3 lines of code.

Комментарии (2)

Написать комментарий?

dmkz

3 года назад, # |

← Rev. 2 →

+26

Actually, it isn't full replacement of lower_bound, because, for finding index in this approach we need to store pair<int,int>, where index is second argument of pair.

In AVX2 approach we need to divide one block in vector [x0 x1 x2 x3 ... | i0 i1 i2 i3 ...], where first half is values, second half is indexes.

SSE4.2 can be used at the Yandex.Contest with 128-bit registers.

UPD. Maybe we can calculate index in original sorted array based on the position in Btree?

→ Ответить

sslotin

3 года назад, # ^ |

+60

I wrote a new version that is much faster and also doesn't require any additional work for finding indices (in fact, it is a few cycles faster if you only need the index).

I've actually made several variants, all based on structures I'm going to call S-trees and S+ trees, which are modifications of B-trees and B+ trees. The fastest one achieves 12-15x speedup over std::lower_bound on small arrays (that fit in cache) and 7-8x speedup on large arrays (>1e6). I am working on a version for Windows / GCC so that it can run on CodeForces, but speedups are not that large yet (~10x / ~6x, because GCC sucks).

This comment is sort of a teaser. I will publish an updated article sometime next week, after a few more minor optimizations I have in mind.

Блог пользователя sslotin