Okay guys, I know it sounds like Top 10 optimizations 2017- (collectors edition) but hear me out
Hi everyone!
Recently, a problem ordered set has been added to Library Checker.
Essentially, the problem asks you to maintain a set $$$\{a_1,\dots,a_n\}$$$ that supports the following operations:
- Insert an element;
- Erase an element;
- Find the $$$k$$$-th smallest element;
- Find the order index of an element;
- Find lower bound of an element;
- Find pre-upper bound of an element.
Of course, all these can be done with GNU policy-based data structures (pbds) in $$$O(\log n)$$$ per query, which leads to the shortest code. At the same time, it has a huge constant factor in practice, so a natural question arises. What would be the fastest way to solve this? One of my earliest blogs already addressed this: we can use a Fenwick tree over the bitmask of the set and then do a jumping binary search to get the $$$k$$$-th element. I implemented this in two separate structures: fenwick
and fenwick_set
.
The first one is the actual implementation of a Fenwick tree, while the second is a wrapper with set-like interface. Then, the fenwick
structure is standard and straightforward, including finding lower bound of a prefix sum:
Then, the replies to the queries would look like this:
Here, we use an additional bit vector present
to test if the element is actually present in the set. Submitting this solution, we find out that it works 213ms, while the top solution is around 140ms. How come? Isn't Fenwick so fast it's close to $$$O(1)$$$ in practice?
I brought this topic up in the Discord AC server, and chromate00 suggested quite a clever trick to enhance it further: Just integrate the Fenwick tree more with our favorite little mrmriend bitset! What it actually means is, since we represent a set, we can split its bitmask into blocks of 64 bits, each stored in a single uint64_t
, and then Fenwick tree will be used to approximately answer queries based on popcounts of these 64-sized blocks, while the last part would be covered by bit magic.
In other words, this allows us to reduce memory consumption of the Fenwick tree itself from $$$n$$$ to $$$\frac{n}{64}$$$ (should be slightly better for cache too), and correspondingly query performance from $$$\log n$$$ to $$$\log \frac{n}{64}$$$. Note that for this, we also need to quickly find the $$$k$$$-th set bit in a 64-bit integer, and also find the number of set bits before the $$$k$$$-th bit. The second one is straightforward with std::popcount
:
size_t order_of_bit(uint64_t x, size_t k) {
return k ? std::popcount(x << (64 - k)) : 0;
}
The first one is less so, but it is also doable "quickly" with bmi2 intrinsic:
size_t kth_set_bit(uint64_t x, size_t k) {
return std::countr_zero(_pdep_u64(1ULL << k, x));
}
Here, _pdep_u64
takes a bitmask as the first argument and applies it to only set bits of the second argument. In other words, _pdep_u64(1 << k, x)
is 1 << t
, where $$$t$$$ is the $$$k$$$-th bit of $$$x$$$. Doing all this, and also adding fast io on top, I've managed to outperform the top-1 solution by 4 ms 😃
Note: You might want to do #pragma GCC target("bmi2,popcount")
for this to work properly.
Bonus
Do you like me hate coordinate compression? Well, you're in the right place! From now on, you can use
auto compress_coords(auto &coords) {
std::vector<int> original;
original.reserve(size(coords));
std::ranges::sort(coords);
int idx = -1, prev = -1;
for(auto &x: coords) {
if(x != prev) {
idx++;
prev = x;
original.push_back(x);
}
x.get() = idx;
}
return original;
}
The code above takes a range of reference_wrapper<int>
, then automatically assigns all concerned values to their order index in the sorted array, and returns the sorted vector of all distinct original values. This way, you no longer need to disrupt your main function with stupid things such as a[i] = lower_bound(begin(srt), end(srt), a[i]) - begin(srt)
, you just collect all your references in the array, and give the array to compress_coords
! See how simple it is:
vector<reference_wrapper<int>> coords;
for(auto &it: a) {
cin >> it;
coords.push_back(ref(it));
}
vector queries(q, pair{0, 0});
for(auto &[t, x]: queries) {
cin >> t >> x;
if(t != 2) {
coords.push_back(ref(x));
}
}
auto values = compress_coords(coords);
After that, you can treat a[i]
and q[i]
as if they are already compressed, and you can always put them as an index in values
if you want to recover their original value. Simple and universal, am I right?
P.S. All these are conveniently implemented and structured here in CP-Algorithms library, so you might check it out if you want.
P.P.S. Are there any more efficient yet simple structures for Ordered Set problem?
A pretty cool data structures which supports operations 1,2,5 and 6 is the van Embde Boas tree. The complexity per operation is actually $$$O( \log \log M)$$$ where $$$M$$$ is the largest element that can be stored inside the tree. I remember coding it and doing a few benchmarks, it was unfortunately slower than std:set due to having quite a large constant (and bad implementation :( ). It would be really interesting to see how far you can push it!
Yes, we have considered the use of vEB-trees, but yet it was not applicable to the new "Ordered Set" problem on yosupo due to its inability of maintaining $$$k$$$-th order statistics. :(((((
Also, it is important to note that vEB trees take $$$O(M)$$$ memory, and for all usages where such a memory usage is fine, there is a tough contender called the $$$64$$$-ary trie (basically the trie, but each intermediate node has $$$64$$$ children and one bitmask). The $$$64$$$-ary trie has $$$O(\log_{64} M)$$$ time complexity for each operation, which is not asymptotically better, but it still does dominate for all $$$M<10^8$$$.
Well vEB trees don't take up $$$O(M)$$$ memory if you implement them with hash tabels, although that does take quite a toll on the constant. If I recall correctly, this makes the complexity "with high probably" and not deterministic.
This is the implementation I was benchmarking on ints.
was it created by the same company that made Fenwick Tree
You can simulate a set using $$$O(\sqrt{n})$$$ deques, and it can be made faster than PBDS (for $$$n,q\leq5*10^5$$$) while also being very memory efficient.
Two very nice and interesting articles in the span of 3 hours. A lot of respect for this contribution!
I love bitset