https://en.algorithmica.org/hpc/data-structures/segment-trees/
I wrote a SIMD-friendly segment tree ("wide segment tree") that is up to 10x faster than the Fenwick tree and the bottom-up segment tree:
The article also explains in detail how all the other popular segment tree implementations work and how to optimize them (fun fact: the Fenwick tree can be made 3x faster on large arrays if you insert "holes" in the array layout and make it ~0.1% larger). The article is long but hopefully accessible to beginners.
I only focused on the prefix sum case (partially to make the comparison with the Fenwick tree fair). While I have some ideas on generalizing it to more complex operations, and I will probably add I've added a separate section on implementing other operations, I highly encourage the community to try to efficiently implement mass assignment, RMQ, lazy propagation, persistency, and other common segment tree operations using wide segment trees.
In terms of generalizations, atcoder's interface is very flexible and popular: https://atcoder.github.io/ac-library/master/document_en/segtree.html It would be great if you can make your wide segment tree a drop in replacement targeting it.
In particular, it doesn't assume the operation is commutative, so it must accumulate front and back in the right order: https://github.com/atcoder/ac-library/blob/master/atcoder/segtree.hpp#L45 I think this ruins some of the optimizations you made but there should still be benefits to using the higher branching factor?
I don't think you can make an entirely drop-in replacement of the atcoder's segment tree as it requires users to write their own (scalar) reducers, while most wide segment tree operations need to be implemented with SIMD intrinsics.
That said, maybe it is possible to make a partial specialization on
std::plus
,std::bit_xor
,std::multiplies
,std::min
, and other common binary operations, but the library users would have to actually pass them instead of writing their own code.P. S. Added a section with ideas on how to implement RMQ, general reductions, and lazy propagation.
I agree. However, I just wanted to point out that such an approach might be simplified by doing what eve does (not the platform-agnostic parts, but the abstraction parts). For instance, templating based on block sizes (different block sizes would be needed for different int types, for instance) would be nice.
just wanted to say, your book is amazing, I've not been able to read much, but whatever few chapters I read were really orz
orz
Never thought that fenwick tree is slower than bottom-up segment tree. Thank you for your work!
May I ask that are you a researcher and focus on the field of data structure and computation complexity, or general computer science?