Fast Binary Search Template instead std::lower_bound for special cases

→ Pay attention

Before contest
Codeforces Round 1006 (Div. 3)
3 days
Register now »

→ Top rated

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

Countries | Cities | Organizations

View all →

→ Top contributors

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	157
6	Qingyu	155
7	djm03178	151
7	adamant	151
9	luogu_official	150
10	awoo	147

View all →

→ Find user

→ Recent actions

Detailed →

dmkozyrev's blog

Fast Binary Search Template instead std::lower_bound for special cases

By dmkozyrev, history, 5 years ago, In English

Hello everybody!

If you want to use binary search over array with known at compile time size, and this size can be extended to power of two, you can use next simple template:

template<int n, typename T>
int lowerBound(const T* __restrict arr, const T val) {
    if (n == 1) { return val > arr[0]; } 
    else {
        constexpr int mid = n / 2;
        if (val > arr[mid]) { return mid+lowerBound<n-mid, T>(arr+mid, val); }
        else { return lowerBound<mid, T>(arr,val); }
    }
};

Testing on ideone shows, that it is faster in 4 times in comparison with std::lower_bound.

Thanks for reading this.

UPD. Testing for doubles, in 3.5 times faster.

#binary search

dmkozyrev
5 years ago
8

Comments (8)

Write comment?

pajenegod

5 years ago, # |

+22

There is an issue with how you do your testing.

std::vector<int> queries(1 << 9);
for (auto &it : queries) { it = dist(gen); }
...
for (int i = 0; i < (1 << 15); i++) {
    for (auto q : queries) {
        res += f<1<<20>(&arr[0],q);
    }
}

The compiler could make use of the fact that the inner loop doesn't depend on i. I tried removing the i for loop and switching to $$$2 ^ {23}$$$ queries, see here. The times are now much more similar. 1.56 s vs 1.19 s.

→ Reply

dmkozyrev

5 years ago, # ^ |

← Rev. 2 →

This testing method was implemented with goal to reduce cost of loading queries from memory to cache. If compiler will use fact, that inner loop doesn't depend on i, then only one iteration will be executed, because there is not JIT-compilation in GCC. We can see, that runtime is a lot, so, I think, that $$$2^{15}$$$ iterations were executed.

Another test with xorshift32 random generator shows 1.68 s vs 1.08 s.

→ Reply

Clone3

5 years ago, # |

← Rev. 3 →

+15

First of all: Your results are very much skewed towards lowerBound. This could be because of several reasons

1) You first execute std::lower_bound thus filling data into caches perhaps allowing better execution for the following iteration. When trying to prove that something is slower than something, isolate things. Probably execute on different CPU cores and such too. I'd suggest making two separate binaries to avoid such effects and run them multiple times

2) Welp, you know this could be because your code invokes undefined behavior? You pretty much attempt to store $$$2^{20} \cdot 2^{23}$$$ in int res. You can test it by adding #pragma GCC optimize ("trapv") so that your code exits on int overflow (but don't use it when comparing performance results)

3) By initializing static variables and other stuff you skew the results towards the compile-based solution. This is pretty easy to isolate in this case by writing a generator that pipes exactly the same data into your program whilst not being compile-time available.

Actually

→ Reply

MrDindows

5 years ago, # ^ |

← Rev. 6 →

+28

couldn't understand any of the optimizations

I will tell you.

The first one and the most impressive one you can see on the image below:

Spoiler with image

Here compiler copies all 255 array elements, which can be accessed in the first 8 steps of binary search, to the stack memory, so they may be accessed without cache misses. And it's wow! I didn't expect such deep level of optimization logic from the any of compilers!

But the sad part of this story is that this happened only because many iterations of the same test are performed, and the compiler knows this in advance. In most of real problems, such an optimization cannot be expected. So the test itself is really not so good.

The second optimization is quite simple and obvious: compiler groups several subsequent calls of template function, so size and indices are known for these steps in compile time, and checks of binary search look like sequence of cmp, jmp, cmp, jmp without any indexing math.

Spoiler with second image

PS. My test: https://ideone.com/AL5ZNd

→ Reply

Clone3

5 years ago, # ^ |

← Rev. 2 →

Ah, so it is caching after all.

Think practical implementation of such binary search would look like this (but this particular optimization should cache last N calls instead if first ones) :https://codeforces.net/blog/entry/65278#comment-492839

But I couldn't prove the idea of cache locality via valgrind, so I didn't think in that direction. Can you elaborate on how to prove that such optimization improves L1 hits by some cpu counting tool?

→ Reply

Xellos

5 years ago, # ^ |

to the stack memory, so they may be accessed without cache misses

You're right that there's caching, but it's not caused by allocating and copying to the stack, but by the fact that just accessing this memory puts it to L1 cache. The stack is good because it's often faster to compute addresses, memory access speed only depends on locality of the data you're accessing.

→ Reply

MrDindows

5 years ago, # ^ |

You are right and a little bit wrong in the same time =) Accessing the memory really puts it to L1 cache, but there is no guarantee that it will remain there. There is an optimization here exactly because of locality of copied data, so instead of accessing so sparsely located elements of 8 steps of binsearch, we do that on copied data, which is located densely. And it really doesn’t matter to us whether it is a stack memory or not, I just stated the fact of this case.

Actually we can use such optimization of binary search by ourselves, and it was described in this blog (only RU) 9 years ago. There, the author used three levels of data and got 60 percent speedup.

→ Reply

Xellos

5 years ago, # ^ |

There is an optimization here exactly because of locality of copied data
memory access speed only depends on locality of the data you're accessing

Yes, we're saying the same thing then. It just sounded like you said that the stack helps caching.

(Technically, data doesn't have to be densely packed to fit in cache. It just has to be packed on distinct cache lines — in practice, this is some requirement on lower bits of their addresses. This is architecture-dependent, so I guess the compiler makes sure it's not a concern.)

→ Reply