I see this article and at the bottom, it says that modulus operators are expensive so they implemented a slightly faster version of Euclidean Algorithm. Why not make a more efficient mod?
int mod(int a, int b) { // computes a % b;
return (a - b * (a / b));
}
afaik / and % are expensive compared to + and *
Cause / is approximately as expensive as %.
Actually on x86 division instruction DIV (or IDIV for signed integers) computes both quotient and reminder, just stores them in different registers. So obviously your "more efficient" version can't work faster than a % b;
I liked your blog! As explained in the comments, apparently this would not be faster, but I agree with you that mod operator are famously said to be the ones that are slow, I never stopped to think about the division operator, that should be as slow too! I’m sorry that your blog got downvoted. It made me have a better understanding of predicting the run time of a code
Your welcome. :D
Intel wants to know your location.......
On a serious note, it will be helpful to see not them as O(1) operations on 32 bit but asymptotic complexity on variable number of bits. Then you'll appreciate that modulus isn't much of a different problem than division.
You can do a binary lifting/binary search mod operation. I really don’t know whether it’s faster or not.
https://godbolt.org/z/7W35Me
You could get a speed up sometimes by doing this:
a = a >= b ? a % b : a;
The more versatile option that always works is, write assembly instructions to perform the modulus operation. It gives quite a bit of speed up.
If this always works and gives "quite a bit of speed up", why doesn't the C++ compiler just do that too?
Not sure. But the key is to always do a benchmark. We cannot trust compilers to do magic all the time can we? Anyway, I have personally tried using the assembly trick before and it worked pretty well.
Can you share your results? Would be nice to see the methodology and the final numbers to understand better.
bruh your shirt is orange again
When will your shirt become red?
I don't paint my shirts
Disclaimer: Please try not to bash me for the sample size. I don't have time to record data for different problems (but I have tried the trick on several other problems before). Hence, I have only presented one below. You may perform your own benchmarks too. Lastly, the assembly code does not belong to me, I shamelessly peeled it off Kaist's online ICPC team notebook).
Firstly, here is a sample problem from CF.
To give some context, I read about a failing submission for this problem from Petr's blog due to a large number of modulo operations. In particular, this sentence:
maroonrk's C failed due to trying to squeeze $$$10^8$$$ modulo operations in the time limit
The optimization(s)
I only modified this line of code in the main function (which was identified by Petr to be causing the TLE):
Note that I have used 2 optimizations (I call them optimization 1 and optimization 2 below):
Optimization 1 refers to writing modulo in the form a = a >= b ? a % b : a;
Optimization 2 refers to writing modulo in assembly code.
The benchmarks
The original code which uses modulo operation naively. Result: TLE on test 49 (>1000ms).
Using optimization 1 with C++11: Code. Result: TLE on test 51 (>1000ms)
Using optimization 1 + 2 with C++11: Code. Result: TLE on test 51 (>1000ms)
Using optimization 1 with C++17 (32-bit): Code. Result: TLE on test 51 (>1000ms)
Using optimization 1 + 2 with C++17 (32-bit): Code. Result: AC (982ms). Due to the closeness of the runtime to the time limit, I submitted twice to be sure. Both submissions yielded the same runtime.
Using optimization 1 with C++17 (64-bit): Code. Result: TLE on test 51 (>1000ms)
Using optimization 1 + 2 with C++17 (64-bit): Code. Result: AC (545ms)
Oh, yes. Here is the modulo subroutine in case you are interested to test it out yourself:
Because you waste time on comparison and branching. Similarly, it isn't easy to say if
sort()
should first check if the sequence is already sorted and then finish in $$$O(n)$$$.You are talking about the conditional
if a >= b
version that LanceTheDragonTrainer said sometimes works. I was asking about the assembly version that LTDT said always works.right, sorry
The compiler has to make sure to produce code that works correctly for all possible int values, we don't. In particular for this case I believe there are some odd corner cases if you allow numbers to be negative.
Just tell the compiler that you are modding unsigned integers, and you get a code that runs at around the same speed (slightly faster, even) than Lance's assembly version: 88112311
FYI Branches are much more expensive than integer division/modulo operators.
Implementing addition of two values $$$a, b \in [0, P-1]$$$ as
return a+b<P ? a+b : a+b-P;
is actually faster than(a+b)%P
.Benchmarks:
14700593 CPU ticks with a+b<P ? ... : ...
11126168 CPU ticks with just (a+b) % P
Your solution spends most time on computing
random() % P
and that includes computing that random value. Running your program multiple times gave me inconsistent results but the x?y:z version was faster by a few percents usually.The x?y:z version is more than twice faster if it's really a bottleneck of a solution https://ideone.com/8m0qWb (0.56s vs. 1.46s)
Well, your solution spends most time on data access :)
Actually it does not matter where most time is spent if it is the same for both versions because you always can subtract it from total times and compare the rests.
BTW, I think we need another test.
Kamil, unfortunately your test code cannot be used for benchmarking branch-misses (details under spoilers).
Branches are not expensive if there's a pattern that branch predictor can learn. While implementing addition most of the time the result will not overflow so a predictor which outputs false would be good enough for you.
There are lot of things at play like speculative execution and other low level CPU stuff. Modern CPU are quite complicated to make a rule of thumb.
this doesn't always produce correct result
I don't know if this works, but it might help. It's a fast way to reduce a%b under some loose constraints (Barret Reduction).
So smart your link is!
slightly faster than origin
%
: 827ms vs 643ms in my computerWon't faster... sorry
Since init1() and init2() generate different results, there is no sense to measure runtime I think.
Thanks ~
Barrett Reduction
Well, I see a quite different scenario with python.
I used IPython's
%timeit
to calculate the time difference and foundmod(a, b)
to be more expensive than%
.Can you check the runtime of this
BTW, this was proposed by LTDT
This is usually quite unhelpful because most of the time in the worst case mods are required at every step, and if you are at the point where this is the difference between AC and TLE, then it is probably better to remove unnecessary mods (i. e. just mod once after several additions, rather than after each one).
This also has overhead caused by the potential branch (which, in general, should probably always be used). The article linked by sys. provides a small speedup, but for most cases is probably not necessary.