Fenwick bitset - Codeforces

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	156
6	Qingyu	153
7	djm03178	152
7	adamant	152
9	luogu_official	150
10	awoo	147

Okay guys, I know it sounds like Top 10 optimizations 2017- (collectors edition) but hear me out

Hi everyone!

Recently, a problem ordered set has been added to Library Checker.

Essentially, the problem asks you to maintain a set $$$\{a_1,\dots,a_n\}$$$ that supports the following operations:

Insert an element;
Erase an element;
Find the $$$k$$$-th smallest element;
Find the order index of an element;
Find lower bound of an element;
Find pre-upper bound of an element.

Of course, all these can be done with GNU policy-based data structures (pbds) in $$$O(\log n)$$$ per query, which leads to the shortest code. At the same time, it has a huge constant factor in practice, so a natural question arises. What would be the fastest way to solve this? One of my earliest blogs already addressed this: we can use a Fenwick tree over the bitmask of the set and then do a jumping binary search to get the $$$k$$$-th element. I implemented this in two separate structures: fenwick and fenwick_set.

The first one is the actual implementation of a Fenwick tree, while the second is a wrapper with set-like interface. Then, the fenwick structure is standard and straightforward, including finding lower bound of a prefix sum:

code

    void add(size_t x, T const& v) {
        for(++x; x <= n; x += x & -x) {
            data[x] += v;
        }
    }
    // sum of [0, r)
    T prefix_sum(size_t r) const {
        assert(r <= n);
        T res = 0;
        for(; r; r -= r & -r) {
            res += data[r];
        }
        return res;
    }
    // Last x s.t. k = prefix_sum(x) + r for r > 0
    // Assumes data[x] >= 0 for all x, returns [x, r]
    auto prefix_lower_bound(T k) const {
        int x = 0;
        for(size_t i = std::bit_floor(n); i; i /= 2) {
            if(x + i <= n && data[x + i] < k) {
                k -= data[x + i];
                x += i;
            }
        }
        return std::pair{x, k};
    }

Then, the replies to the queries would look like this:

code

    void insert(size_t x) {
        if(present[x]) return;
        present[x] = 1;
        sz++;
        Base::add(x, 1);
    }
    void erase(size_t x) {
        if(!present[x]) return;
        present[x] = 0;
        sz--;
        Base::add(x, -1);
    }
    size_t order_of_key(size_t x) const {
        return Base::prefix_sum(x);
    }
    size_t find_by_order(size_t order) const {
        return order < sz ? Base::prefix_lower_bound(order + 1) : -1;
    }
    size_t lower_bound(size_t x) const {
        if(present[x]) {return x;}
        auto order = order_of_key(x);
        return order < sz ? find_by_order(order) : -1;
    }
    size_t pre_upper_bound(size_t x) const {
        if(present[x]) {return x;}
        auto order = order_of_key(x);
        return order ? find_by_order(order - 1) : -1;
    }

Here, we use an additional bit vector present to test if the element is actually present in the set. Submitting this solution, we find out that it works 213ms, while the top solution is around 140ms. How come? Isn't Fenwick so fast it's close to $$$O(1)$$$ in practice?

I brought this topic up in the Discord AC server, and chromate00 suggested quite a clever trick to enhance it further: Just integrate the Fenwick tree more with our favorite little mrmriend bitset! What it actually means is, since we represent a set, we can split its bitmask into blocks of 64 bits, each stored in a single uint64_t, and then Fenwick tree will be used to approximately answer queries based on popcounts of these 64-sized blocks, while the last part would be covered by bit magic.

In other words, this allows us to reduce memory consumption of the Fenwick tree itself from $$$n$$$ to $$$\frac{n}{64}$$$ (should be slightly better for cache too), and correspondingly query performance from $$$\log n$$$ to $$$\log \frac{n}{64}$$$. Note that for this, we also need to quickly find the $$$k$$$-th set bit in a 64-bit integer, and also find the number of set bits before the $$$k$$$-th bit. The second one is straightforward with std::popcount:

    size_t order_of_bit(uint64_t x, size_t k) {
        return k ? std::popcount(x << (64 - k)) : 0;
    }

The first one is less so, but it is also doable "quickly" with bmi2 intrinsic:

    size_t kth_set_bit(uint64_t x, size_t k) {
        return std::countr_zero(_pdep_u64(1ULL << k, x));
    }

Here, _pdep_u64 takes a bitmask as the first argument and applies it to only set bits of the second argument. In other words, _pdep_u64(1 << k, x) is 1 << t, where $$$t$$$ is the $$$k$$$-th bit of $$$x$$$. Doing all this, and also adding fast io on top, I've managed to outperform the top-1 solution by 4 ms 😃

Note: You might want to do #pragma GCC target("bmi2,popcount") for this to work properly.

Bonus

Do you like me hate coordinate compression? Well, you're in the right place! From now on, you can use

    auto compress_coords(auto &coords) {
        std::vector<int> original;
        original.reserve(size(coords));
        std::ranges::sort(coords);
        int idx = -1, prev = -1;
        for(auto &x: coords) {
            if(x != prev) {
                idx++;
                prev = x;
                original.push_back(x);
            }
            x.get() = idx;
        }
        return original;
    }

The code above takes a range of reference_wrapper<int>, then automatically assigns all concerned values to their order index in the sorted array, and returns the sorted vector of all distinct original values. This way, you no longer need to disrupt your main function with stupid things such as a[i] = lower_bound(begin(srt), end(srt), a[i]) - begin(srt), you just collect all your references in the array, and give the array to compress_coords! See how simple it is:

    vector<reference_wrapper<int>> coords;
    for(auto &it: a) {
        cin >> it;
        coords.push_back(ref(it));
    }
    vector queries(q, pair{0, 0});
    for(auto &[t, x]: queries) {
        cin >> t >> x;
        if(t != 2) {
            coords.push_back(ref(x));
        }
    }
    auto values = compress_coords(coords);

After that, you can treat a[i] and q[i] as if they are already compressed, and you can always put them as an index in values if you want to recover their original value. Simple and universal, am I right?

P.S. All these are conveniently implemented and structured here in CP-Algorithms library, so you might check it out if you want.

P.P.S. Are there any more efficient yet simple structures for Ordered Set problem?

vector<int> scale(vector<int*> to_scale) { vector<int> original; sort(to_scale.begin(), to_scale.end(), [](int* a, int* b) { return *a < *b; }); int curr_val = (-1), pr = INF; for(int i = 0; i < int(to_scale.size()); i++) { if(*to_scale[i] != pr) { pr = *to_scale[i]; original.pb(pr); ++curr_val; } *to_scale[i] = curr_val; } return original; }

Comments (15)

Write comment?

ppavic

3 months ago, # |

+14

A pretty cool data structures which supports operations 1,2,5 and 6 is the van Embde Boas tree. The complexity per operation is actually $$$O( \log \log M)$$$ where $$$M$$$ is the largest element that can be stored inside the tree. I remember coding it and doing a few benchmarks, it was unfortunately slower than std:set due to having quite a large constant (and bad implementation :( ). It would be really interesting to see how far you can push it!

→ Reply

chromate00

3 months ago, # ^ |

Yes, we have considered the use of vEB-trees, but yet it was not applicable to the new "Ordered Set" problem on yosupo due to its inability of maintaining $$$k$$$-th order statistics. :(((((

Also, it is important to note that vEB trees take $$$O(M)$$$ memory, and for all usages where such a memory usage is fine, there is a tough contender called the $$$64$$$-ary trie (basically the trie, but each intermediate node has $$$64$$$ children and one bitmask). The $$$64$$$-ary trie has $$$O(\log_{64} M)$$$ time complexity for each operation, which is not asymptotically better, but it still does dominate for all $$$M<10^8$$$.

Well vEB trees don't take up $$$O(M)$$$ memory if you implement them with hash tabels, although that does take quite a toll on the constant. If I recall correctly, this makes the complexity "with high probably" and not deterministic.

This is the implementation I was benchmarking on ints.

bestial-42-centroids

I'm not sure if it's what you meant, but Y fast trees can improve the memory complexity to $$$O(\text{#stored values})$$$. Most reasonable implementations would rely on cuckoo Hashing though

123gjweq2

was it created by the same company that made Fenwick Tree

yoshi_likes_e5

← Rev. 5 →

+13

You can simulate a set using $$$O(\sqrt{n})$$$ deques, and it can be made faster than PBDS (for $$$n,q\leq5*10^5$$$) while also being very memory efficient.

Update: the worst case is now faster than PBDS, it also was much faster than the previous code. On average it's now 2x faster than PBDS.

Even for $$$n$$$ up to $$$10^7$$$, the performance can be nearly equal to std::set.

19.30

Impresive. I got a runtime of 416 ms with my AVL tree and a memory usage of 32 MB. Here it is : https://judge.yosupo.jp/submission/249418

Wielomian

Two very nice and interesting articles in the span of 3 hours. A lot of respect for this contribution!

bitset

+25

I love bitset

Could you please provide the full code for this DS?

adamant

You can click on "bundled" here.

tgp07

wow

Mukundan314

Using a similar technique of bitset leaves but on a wide segment tree is actually slightly faster: https://judge.yosupo.jp/submission/254249, This still does a naive $$$\mathcal{O(n\log n)}$$$ initialization, and I haven't tuned the constants yet, so probably room for improvement there.

hajder

2 months ago, # |

← Rev. 3 →

Instead of using reference wrappers, here is what I've been using for a long time: (pseudocode written here, not guaranteed to compile (; )

And then:

vector<int*> to_scale;
for(int i = 0; i < n; i++) { to_scale.pb(&some_data[i]); }
scale(move(to_scale));

adamant's blog

Bonus