Hello, Codeforces!
In late December and early January, I wrote a proof-of-concept of separate service in C ++, to take out heavy data from the Codeforces Java-code to C++. Long time I have not written in C ++, experienced a funny sense of immersion into another world.
I was surprised to find the lack of open-addressing hashmap in the C++ standard library, and indeed in the boost, and in other popular libraries. It is strange somehow, because often open addressing should be better than separate chaining both in time and memory. And since I intend to keep small objects, then sure.
I quickly sketched a prototype that actually shows that open addressing in 2-3 times faster than the standard std::unordered_map. Here's the output of the benchmark on my laptop:
std :: map takes 15779 ms
std :: unordered_map takes 4698 ms
oaht :: hash_map takes 1473 ms
I think that there is no good implementation of such container in stl-style with the support of C++11 (move semantics). Maybe someone will show, but I have not found.
Here is my prototype on github: https://github.com/MikeMirzayanov/open_addressing_hash_table Unfortunately, I'm not cool C ++ expert, and I have no time to implement.
On the other hand, on Codeforces where were many C ++- discussions, and seems there many users who understand modern C ++. Also algorithms are water and air of Codeforces! I have hope that the universal mind Codeforces and straight hands of someone in the community will help bring to life such a container. Of course, we want maximum compatibility with std::unordered_map (API and semantics should be equivalent where it is possible). Here is a sample list of features that I want to see:
- One more template type parameter
class Pred = std::equal_to<Key>
- Member function
at
- Member function
clear
- Member function
empty
- Member function
insert
- Member function
equal_range
- Member function
swap
- Member function
operator=
- Member functions
load_factor
,max_load_factor
,rehash
,reserve
- Constant members
- Member types:
difference_type
,hasher
,key_type
,mapped_type
,reference
,size_type
,value_type
,pointer
,const_reference
,key_equal
- Iterators: member types
iterator
,const_iterator
, member functionsbegin
,end
,cbegin
,cend
- Member function
emplace
- Member function
emplace_hint
- Relational operators
- Non-member function overload: swap
- Move semantics
In addition, I want to keep it working on popular C ++ 11 compilers: g ++, Visual Studio 13+, clang. Please try to follow the code style of oaht.h
.
I'll be happy to delegate the management of this project to someone who has the enthusiasm, knowledge of C ++ and have proven experience in the problems. I'm looking for volunteer! Codeforces t-shirt and +5 to your honor guaranteed.
P.S. As a result we want to have a good implementation, so there is a desire to put into the project only correct and well-written code.
Will we be able to use
oaht::hash_map h;
after#include "codeforces.h"
on codeforces in the future? It sounds very interesting :)Have you tried SGI hash_map?
Unfortunately I couldn't find the collision resolution method used in it, but I've tried your
benchmark.cpp
on my laptop and it shows a better performance (than the std::unordered_map). So you might want to check it out.Header file:
<ext/hash_map>
Namespace:
__gnu_cxx
Edit1: It's not SGI's hash_map, I misinterpreted this comment in
<ext/hash_map>
Edit2: After some digging into the header files, I found that it uses closed addressing method for collision resolution.
You can find that in the header file
<backward/hashtable.h>
function name... insert_equal_noresize(...)
.It seems it has worst performance than open addressing hash map proof-of-concept:
oaht::hash_map
__gnu_cxx::hash_map
Have you tried Google Dense Hashmap? Though it might not 100% matching your specs, its a lot faster then stl containers and really nicely written.
https://code.google.com/p/sparsehash/
Looks promising, thanks. But I've tried
test<GOOGLE_NAMESPACE::sparse_hash_map<int64_t,int64_t>>("GOOGLE_NAMESPACE::sparse_hash_map");
and it works ~2000 ms while open addressing hash map proof-of-concept ~400 ms. Right now I can't use it MinGW (can't compile), using in MS VS 2013.I think dense_hash_map is more time efficient while sparse_hash_map is more space efficient. Can you try dense_hash_map?
I checked and default dense_hash_map is about 2x slower on my machine.
I wonder how much of this difference is due to the fact that your proof of concept supports few operations, namely only inserts but not deletes.
Did you use benchmark code from my github repository? So dense_hash_map wasn't tested on erases as well. I do not see that support of more functionallity in oaht will reduce performance.
Also in my case I will not use erases often.
How to use dense_hash_map? I've got Assertion failed: settings.use_empty(), file google\sparsehash/internal/densehashtable.h, line 476.
Yes, I cloned your github code.
Also had that error with dense_hash_map, but it is explained in the readme:
So, I created a new test with
Sorry for the multiple replies. Just noticed that dense_hash_map also uses open addressing with quadratic probing for collisions resolution.
So, I think these efficiency differences are more about the particular data set or hash function used, rather than the hash map implementation.
I believe that +1 strategy could be better than quadratic probing or double hashing because of better cpu cache usage.
I think it depends on the data set. Linear probing has the risk of clumping, having too many adjacent buckets occupied.
If you feel this is unlikely to occur in the data you'll use, then it's probably better.
My feeling is that libraries like dense_hash_map are designed to minimize the probability of an adversarial input, which may not be the best solution for your case.
It depends on load factor. Linear probing has worse function cluster_length(load_factor). Because of hashing it doesn't depend much on input (except anti-hash cases).
Currently, I'm looking into Walrus's implementation on https://code.google.com/p/morzhovets/source/browse It looks as a good trade off between time and memory. Much better than std::unordered_map.
I remember a long time ago I tried to implement my own oaht and it was also considerably faster than dense_hash_map for all benchmarks. I was using double hashing to resolve collisions.
For some reason, in the case of hash maps, it seems the more generic the code is, the less-performant it becomes. Maybe it has something to do with working well against most datasets, or is really just a load factor issue, I don't know.
How to use it on Codeforces? What compiler should be chosen and #include what file(s) I got lots of CE……