If mod(%) are so expensive why not make own modulus?

4 года назад, # ^ |

If this always works and gives "quite a bit of speed up", why doesn't the C++ compiler just do that too?

Because you waste time on comparison and branching. Similarly, it isn't easy to say if sort() should first check if the sequence is already sorted and then finish in $$$O(n)$$$.

→ Ответить

AnandOza

4 года назад, # ^ |

← Rev. 2 →

You are talking about the conditional if a >= b version that LanceTheDragonTrainer said sometimes works. I was asking about the assembly version that LTDT said always works.

→ Ответить

4 года назад, # ^ |

+14

right, sorry

→ Ответить

ffao

4 года назад, # ^ |

+19

The compiler has to make sure to produce code that works correctly for all possible int values, we don't. In particular for this case I believe there are some odd corner cases if you allow numbers to be negative.

Just tell the compiler that you are modding unsigned integers, and you get a code that runs at around the same speed (slightly faster, even) than Lance's assembly version: 88112311

→ Ответить

4 года назад, # ^ |

-9

FYI Branches are much more expensive than integer division/modulo operators.

→ Ответить

4 года назад, # ^ |

+17

Implementing addition of two values $$$a, b \in [0, P-1]$$$ as return a+b<P ? a+b : a+b-P; is actually faster than (a+b)%P.

→ Ответить

4 года назад, # ^ |

← Rev. 2 →

+13

Benchmarks:
14700593 CPU ticks with a+b<P ? ... : ...
11126168 CPU ticks with just (a+b) % P

Code

#include <bits/stdc++.h>

constexpr unsigned P = 1e8;

unsigned f(unsigned a, unsigned b)
{
    //return (a + b) % P;
    return a + b < P ? a + b : a + b - P;
}

int main()
{
    srandom(0);

    const auto t0 = clock();

    unsigned s = 0;
    for (size_t i = 0; i < 1000000000; ++i) {
        s += f(random() % P, 1 + random() % P);
    }

    const auto t1 = clock();

    std::cout << s << '\n';
    std::cout << (t1 - t0) << '\n';

    return 0;
}

→ Ответить

4 года назад, # ^ |

+26

Your solution spends most time on computing random() % P and that includes computing that random value. Running your program multiple times gave me inconsistent results but the x?y:z version was faster by a few percents usually.

The x?y:z version is more than twice faster if it's really a bottleneck of a solution https://ideone.com/8m0qWb (0.56s vs. 1.46s)

→ Ответить

4 года назад, # ^ |

-15

Well, your solution spends most time on data access :)

Actually it does not matter where most time is spent if it is the same for both versions because you always can subtract it from total times and compare the rests.

BTW, I think we need another test.

→ Ответить

4 года назад, # ^ |

-14

Kamil, unfortunately your test code cannot be used for benchmarking branch-misses (details under spoilers).

(a+b)%P

$ perf stat ./a.out 
979631356

 Performance counter stats for './a.out':

          1,231.31 msec task-clock                #    1.000 CPUs utilized          
                 2      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             1,091      page-faults               #    0.886 K/sec                  
     5,087,658,741      cycles                    #    4.132 GHz                    
     6,370,937,337      instructions              #    1.25  insn per cycle         
       471,001,344      branches                  #  382.520 M/sec                  
           204,400      branch-misses             #    0.04% of all branches

(a+b)<P

$ perf stat ./a.out 
979631356

 Performance counter stats for './a.out':

            333.17 msec task-clock                #    0.999 CPUs utilized          
                 5      context-switches          #    0.015 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             1,088      page-faults               #    0.003 M/sec                  
     1,502,169,022      cycles                    #    4.509 GHz                    
     3,675,878,460      instructions              #    2.45  insn per cycle         
       470,814,463      branches                  # 1413.130 M/sec                  
           164,599      branch-misses             #    0.03% of all branches

→ Ответить

...._.._

4 года назад, # ^ |

Branches are not expensive if there's a pattern that branch predictor can learn. While implementing addition most of the time the result will not overflow so a predictor which outputs false would be good enough for you.

There are lot of things at play like speculative execution and other low level CPU stuff. Modern CPU are quite complicated to make a rule of thumb.

→ Ответить

brave-kid

7 месяцев назад, # ^ |

this doesn't always produce correct result

→ Ответить

vkgainz

4 года назад, # |

← Rev. 2 →

I don't know if this works, but it might help. It's a fast way to reduce a%b under some loose constraints (Barret Reduction).

→ Ответить

4 года назад, # ^ |

So smart your link is!

→ Ответить

4 года назад, # ^ |

← Rev. 2 →

+10

slightly faster than origin %: 827ms vs 643ms in my computer

#include <bits/stdc++.h>
#define watch(x) std::cout << (#x) << " is " << (x) << std::endl
using LL = long long;
constexpr LL M  = 1e9 + 7;
constexpr int  k = std::__lg(M) + 2;
constexpr LL m = (1LL << k) / M;

const int N = 1e8 + 2;
LL fac[N];
void init1(){
	fac[0] = 1;
	for (int i = 1; i < N; ++i) fac[i] = fac[i - 1] * i % M;
}
void init2() {
	auto mod = [&](LL &a) {
		LL r = a - ((a * m) >> k) * M;
		if (r >= M) r -= M;
	};
	fac[0] = 1;
	for (int i = 1; i < N; ++i) mod(fac[i] = fac[i - 1] * i);
}
int main() {
	//freopen("in","r",stdin);
	std::ios::sync_with_stdio(false);
	std::cin.tie(nullptr);

	auto start1 = std::chrono::high_resolution_clock::now();
	init1();
	auto end1 = std::chrono::high_resolution_clock::now();
	std::cout << "Time used: " << std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count() << " (ms)" << std::endl;

	auto start2 = std::chrono::high_resolution_clock::now();
	init2();
	auto end2 = std::chrono::high_resolution_clock::now();
	std::cout << "Time used: " << std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count() << " (ms)" << std::endl;

	return 0;
}

→ Ответить

4 года назад, # ^ |

Won't faster... sorry

→ Ответить

dmitry.dolgopolov

4 года назад, # ^ |

Since init1() and init2() generate different results, there is no sense to measure runtime I think.

→ Ответить