I have a tuple T, consisting of
- An integer id in range [1, 70 × 106]
- A double score1 in range [0.0, 1.0]
- A double score2 in range [0.0, 1.0]
- An integer x in range [0, 2]
What can be the most efficient way to store such tuple T = < id, score1, score2, x > given that only 5 digits after the decimal are sufficient for score1 and score2.
Currently, I am storing this tuple as a long which takes 8 bytes of memory. 27 bits for id, 17 bits for score1, 17 bits for score2 (considering values upto 5 decimal digits), and 2 bits for x, Overall 63 bits.
I'm trying to optimize the memory usage as there are millions of such tuples in an array. Also, If your way can store digits more than 5 decimal places, please tell me.
Is there a way, by which we can store the tuple T using float or some other data type? given that if the tuple is hashed to float, I should be able to get back id, x accurately and score1, score2 without much errors.
Thanks for your time, have a nice day.
This might not exactly answer your question, but you should look into bit fields.
https://en.wikipedia.org/wiki/Bit_field#C_programming_language
24 bits aren't enough for 70·106, one of those numbers (or your code) is wrong.
You have at least 70·106·105·105·3 = 21·1017, which means more than 60 bits per tuple are necessary and 61 are sufficient. The obvious way is to interpret the floats as ints < 105 and do id(105·105·3) + score1(105·3) + score2·3 + x.
There's sorcery like run-length encoding on e.g. blocks of 1000 tuples, but at that point, the upper bound may be worse. Anyway, you're not going to get a very high compression ratio. Also, keep in mind that the more complex compression method, the slower it is. That's especially a problem if you're storing a lot of things as floats — if the data is so large, the time limit is going to be pretty close too, so it's better to work with floats as ints, not otherwise, if it doesn't cause a precision loss.
Ad more than 5 decimal digits: if you have 61 bits per tuple, you can add 3 more without any cost and you have score12 up to approx. 300000. That's something at least. The other way around, you can concatenate the tuples' bitwise representations to get actually 61 bits per tuple, not 64 with padding.
"if you have 61 bits per tuple, you can add 3 more without any cost and you have score12 up to approx. 300000. That's something at least."
Can you explain how can we store upto 300000?
I meant that currently, you can store them as ints up to 100000 (multiply by 100000 and round). You could store them as ints up to approx. 300000 in the same way, just replace the number 100000 in each formula by 300000. It's basically storing ~0.4 more decimal digits.
It wouldn't help much as I can store decimal only upto 5 precision. Is there a way to store the tuple in float given that id and x need to be accurate, while scores can have some error lets say upto 10 - 4
Not without further precision loss or some bizzaro compression magic. Think about it: floating point numbers are good for dealing with arithmetic over a wide range when you don't mind small precision errors, applying them to a completely different task will likely make your life more difficult, not less. It's like trying to hammer a nail in with a screwdriver when you have a perfectly working hammer.
I agree with Xellos the 64 bits is a good choice for the data-tuple size. The following is another alternative to represent 64-bit data tuples and their corresponding scores using C++ bit-field and union features. 1
Chances are that 2 bits for variable x are completely unnecessary. Since there are only a few options, you could consider dropping it in favor of polymorphism.
Something like (these arent' "legit" int/doubles, but you get the idea)
Although this has some cons, sometimes it's significantly more transparent in usage (usually in such cases where X is some sort of
enum
)Other than that, there isn't much you can do about this really since you'll need no less than 60 bits to store your first three fields. You could consider memory compression, but it's very unlikely that it would lead to significant improvement in memory usage.
One thing to note: If you are storing many tuples, and ids are unique, you could consider dropping id in favor of using something like such:
And calculate the id based on the location in the array. But yeah, all that highly depends on usage and some information about your task could be useful.
The task is not from any algorithmic contest. There are around 800 Million such tuples and storing them in memory is around 6GB of RAM. That's why I'm trying to optimize the memory usage. I am trying to fit each tuple in less than 8bytes.
Considering the fact that 800 million is significantly higher than 70 * 106 you should really consider something of sorts
map<id, vector<Tuple>>
, because 50-ish% of your tuple's memory is wasted for id. Which could be avoided by creating such a map. I'd also argue that having an arrayIt would look even better than a map in my opinion if you know that the amount of tuples is going to be huge in advance.
Nice idea! However, not
map
, butvector
or a plain C array, since complex structures take more memory.map
is good in the opposite situation, when the number of ids that actually appear in the data is much smaller than the range of ids.Just wanted to clarify:
map
is not really necessary here, butvector
might be fine. If you callvector::shrink_to_fit
it should be quite fine, allowing you to avoid huge pain due to manual memory allocationsThe following is a suggested implementation of the map-based representation in which the id and x fields for any data tuple are combined into a 32-bit integer key, and the associated (score1,score2) pair is combined into a 32-bit integer score value. All the 32-bit integer score values associated with the same key are stored in the same entry in the map. 1