platelet's blog

By platelet, 17 months ago, In English

1842A - Tenzing and Tsondu

Tutorial
Code

1842B - Tenzing and Books

Tutorial
Alternative Solution
Code
Alternative Code

1842C - Tenzing and Balls

Tutorial
Code

1842D - Tenzing and His Animal Friends

Tutorial
Code

1842E - Tenzing and Triangle

Tutorial
Alternative Solution
Code
Alternative Code O(n alpha n)
Alternative Code O(n)

1842F - Tenzing and Tree

Tutorial
Code

1842G - Tenzing and Random Operations

Tutorial
Code

1842H - Tenzing and Random Real Numbers

Hint 1
Hint 2
Tutorial
Code

1842I - Tenzing and Necklace

Hint 1
Hint 2
Tutorial
Code

Full text and comments »

  • Vote: I like it
  • +573
  • Vote: I do not like it

By platelet, 17 months ago, In English

Note the unusual start time of the round.

Hello, Codeforces!

Now that Gaokao is over, we are very glad to invite you to participate in CodeTON Round 5 (Div. 1 + Div. 2, Rated, Prizes!), which will start at Jun/24/2023 17:05 (Moscow time). You will be given 9 problems and 3 hours to solve them. The round will be rated for everyone.

All problems are written and prepared by Gary2005, Asuka, Crying, sjcsjcsjc, MonkeyKing, DerekFeng, KbltQaQ, ShmilyTY and me.

Statements and editorials will be available in Chinese (Simplified) after the contest.

We would like to give our sincere thanks to:

The main character of the problems will be Tenzing Tsondu.

We hope that everyone can enjoy the round! As this round is sponsored, everyone will have an opportunity to win some prizes!

Good luck!

UPD1: Here is the score distribution:

250 — 500 — 1000 — 1500 — 2000 — 2500 — 3000 — 3750 — 5000

UPD2: Tutorial is available.

UPD3: Congratulations to the winners.

  1. tourist
  2. maroonrk
  3. hos.lyric
  4. cnnfls_csy
  5. Um_nik
  6. DearMargaret
  7. jiangly
  8. potato167
  9. kotatsugame
  10. noimi

UPD4: Congratulations to the first solver of each problem.

A: alexwice
B: ksun48
C: PinkieRabbit
D: Um_nik
E: Qingyu
F: amenotiomoi
G: tourist
H: rain_boy
I: maroonrk (after contest)

UPD5: Chinese statements

UPD6: Chinese editorials

Some information from our title sponsor:

Hello, Codeforces!

We, the Ton Foundation team, are pleased to support CodeTON Round 5.

The Open Network (TON) is a fully decentralized layer-1 blockchain designed to onboard billions of users to Web3.

Since July 2022, we have been supporting Codeforces as a title sponsor. This round is another way for us to contribute to the development of the community.

The winners of CodeTON Round 5 will receive valuable prizes.

The first 1,023 participants will receive prizes in TON cryptocurrency:

  • 1st place: 1,024 TON
  • 2–3 places: 512 TON each
  • 4–7 places: 256 TON each
  • 8–15 places: 128 TON each
  • 512–1,023 places: 2 TON each

We wish you good luck at CodeTON Round 5 and hope you enjoy the contest!

Full text and comments »

  • Vote: I like it
  • +1345
  • Vote: I do not like it

By platelet, history, 22 months ago, In English

Wrong answer submission and Accepted submission

You can compare them, the only difference is that the Wrong answer submission has an extra cin.tie(0)->sync_with_stdio(0);

I didn't use cin and cout, just fread and fwrite.

Can someone tell me if there is something wrong with my usage or a compiler bug.

Full text and comments »

  • Vote: I like it
  • +28
  • Vote: I do not like it

By platelet, history, 22 months ago, In English

Given $$$k,m$$$ and $$$n$$$ numbers $$$a_i$$$, compute each $$$a_i\times k\bmod m$$$

I came up with a method that is almost twice as fast as the compiler implementation (when $$$m$$$ is a constant), which can effectively speed up the NTT and some DPs.

First of all

$$$ a_i\times k\bmod m=\{a_i\times \frac km\}\times m\tag1 $$$

where $$$\{x\}=x-\lfloor x\rfloor$$$ is the fractional part function. The principle of equation $$$(1)$$$ is that $$$a_i\times k\bmod m$$$ is only related to the fractional part of $$$\frac{a_i\times k}m$$$.

Let $$$\frac km\approx\frac p{2^{64}}$$$, $$$p$$$ is either equal to $$$\lfloor\frac km\times2^{64}\rfloor$$$ or $$$\lceil\frac km\times2^{64}\rceil$$$, which one to choose, we will decide later, then

$$$ \begin{aligned} a_i\times k\bmod m&\approx\{a_i\times \frac p{2^{64}}\}\times m\\ &=\frac{a_ip\bmod 2^{64}}{2^{64}}\times m\\ &=\frac{a_ip\bmod 2^{64}\times m}{2^{64}} \end{aligned} \tag2 $$$

There are two place we need to choose whether round up or down:

  • $$$p=\lfloor\frac km\times2^{64}\rfloor$$$ or $$$p=\lceil\frac km\times2^{64}\rceil$$$.
  • The final formula in (2) isn't always an integer, so we need to consider whether we round it up or down.

We choose $$$p=\lceil\frac km\times2^{64}\rceil$$$, which will slightly enlarge the answer, while the final formula in (2) rounded down, which will slightly lessen the answer. We'll prove that this set of choice can give us correct answer just if $$$a_i\le\frac{2^{64}}m$$$.

Proof
const int P = 998244353;

void calc(int n, int k, int a[]) {
    unsigned long long p = ((unsigned __int128)k << 64) + P - 1) / P;
    for(int i = 0; i < n; i++)
    	a[i] = (unsigned)a[i] * p * (unsigned __int128)P >> 64;
}

A few notes.

  • The code uses unsigned __int128 so it can only be used in a 64-bit architecture.
  • (unsigned)a[i] * p will automatically be modulo $$$2^{64}$$$.
  • * (unsigned __int128)P >> 64 is 64-bit multiplication retaining only the high 64 bits (in the rdx register), the same speed as multiplying two unsigned long longs together.
  • Converting a[i] to unsigned is because int to unsigned and then to unsigned long long is faster than going directly to unsigned long long, which requires sign extension.

Speed test.

code

It contains two parts.

  • The first part is the reciprocal throughput, the time taken by the CPU to be highly parallel (modern CPUs can be parallelized at instruction level on a single core), containing a total of $$$50000\times50000$$$ modulo multiplications.
  • The second part is the Latency, which is the time taken for each modulo multiplication to be performed sequentially without parallelism, containing a total of $$$50000\times25000$$$ modulo multiplications.

Possible output:

Throughput test(50000 * 50000):
Compiler's signed modulo:   1954.83 ms
Compiler's unsigned modulo: 1746.73 ms
My modulo:                  1160.47 ms

Latency test(50000 * 25000):
Compiler's signed modulo:   4329.33 ms
Compiler's unsigned modulo: 3945.29 ms
My modulo:                  2397.97 ms

By the way, a few general facts.

  • Constant modulo multiplication is almost 4 times faster in parallel than serial (as is modulo multiplication of my invention).
  • int to unsigned then to long long is faster than long long, but negative numbers will be wrong.
  • unsigned long long modulo constants is faster than long long.

Comparison with other methods

Comparison with Barrett reduction and Montgomery multiplication:

  • The purpose of my method is to compute $$$a\times b\bmod m$$$ for fixed $$$m$$$ and $$$b$$$, while Barrett reduction and Montgomery multiplication compute $$$a\times b\bmod m$$$ for fixed m. But my method is faster than the other two methods.

  • The derivation of my method is similar to Barrett reduction. So They both work when $$$m < 2^{32}$$$, while Montgomery multiplication works when $$$m < 2^{64}$$$ and $$$m$$$ is an odd number.

Extensions

It is also possible to compute $$$(a_1b_1+a_2b_2+\cdots+a_nb_n)\bmod m$$$, but $$$\sum a_i$$$ cannot exceed $$$\frac{2^{64}}m$$$.

Let $$$p_i=\lceil\frac{b_i}m\times2^{64}\rceil$$$.

$$$ (\sum a_ib_i)\bmod m=\lfloor\frac{(\sum a_ip_i)\bmod 2^{64}\times m}{2^{64}}\rfloor $$$

Full text and comments »

  • Vote: I like it
  • +311
  • Vote: I do not like it

By platelet, history, 22 months ago, In English

speed test code of this.

#include <bits/stdc++.h>

using namespace std;

const int N = 5e4, P = 998244353;

int a[N];

void ThroughputTest() {
    int checkSum1 = 0, checkSum2 = 0, checkSum3 = 0;

    auto start = chrono::steady_clock::now();
    for(int i = 0; i < N; i += 2)
        for(int j = 0; j < N; j++) {
            checkSum1 ^= (int64_t)a[i] * a[j] % P;
            checkSum1 ^= (int64_t)a[i + 1] * a[j] % P;
        }
    auto end = std::chrono::steady_clock::now();
    cout << "Compiler's signed modulo:   " << (end - start).count() * 1e-6 << " ms" << endl;

    start = chrono::steady_clock::now();
    for(int i = 0; i < N; i += 2)
        for(int j = 0; j < N; j++) {
            checkSum2 ^= (uint64_t)(uint32_t)a[i] * (uint32_t)a[j] % P;
            checkSum2 ^= (uint64_t)(uint32_t)a[i + 1] * (uint32_t)a[j] % P;
        }
    end = std::chrono::steady_clock::now();
    cout << "Compiler's unsigned modulo: " << (end - start).count() * 1e-6 << " ms" << endl;

    start = chrono::steady_clock::now();
    for(int i = 0; i < N; i += 2) {
        uint64_t x = (((__uint128_t)a[i] << 64) + P - 1) / P;
        uint64_t y = (((__uint128_t)a[i + 1] << 64) + P - 1) / P;
        for(int j = 0; j < N; j++) {
            checkSum3 ^= (uint32_t)a[j] * x * (__uint128_t)P >> 64;
            checkSum3 ^= (uint32_t)a[j] * y * (__uint128_t)P >> 64;
        }
    }
    end = std::chrono::steady_clock::now();
    cout << "My modulo:                  " << (end - start).count() * 1e-6 << " ms" << endl;

    assert(checkSum1 == checkSum2 && checkSum2 == checkSum3);
}
void LatencyTest() {
    int checkSum1 = 0, checkSum2 = 0, checkSum3 = 0;

    auto start = chrono::steady_clock::now();
    for(int i = 0; i < N; i += 2)
        for(int j = 0; j < N / 2; j++) {
            checkSum1 = (int64_t)a[i] * (a[j] ^ checkSum1) % P;
            checkSum1 = (int64_t)a[i + 1] * (a[j] ^ checkSum1) % P;
        }
    auto end = std::chrono::steady_clock::now();
    cout << "Compiler's signed modulo:   " << (end - start).count() * 1e-6 << " ms" << endl;

    start = chrono::steady_clock::now();
    for(int i = 0; i < N; i += 2)
        for(int j = 0; j < N / 2; j++) {
            checkSum2 = (uint64_t)(uint32_t)a[i] * (uint32_t)(a[j] ^ checkSum2) % P;
            checkSum2 = (uint64_t)(uint32_t)a[i + 1] * (uint32_t)(a[j] ^ checkSum2) % P;
        }
    end = std::chrono::steady_clock::now();
    cout << "Compiler's unsigned modulo: " << (end - start).count() * 1e-6 << " ms" << endl;

    start = chrono::steady_clock::now();
    for(int i = 0; i < N; i += 2) {
        uint64_t x = (((__uint128_t)a[i] << 64) + P - 1) / P;
        uint64_t y = (((__uint128_t)a[i + 1] << 64) + P - 1) / P;
        for(int j = 0; j < N / 2; j++) {
            checkSum3 = (uint32_t)(a[j] ^ checkSum3) * x * (__uint128_t)P >> 64;
            checkSum3 = (uint32_t)(a[j] ^ checkSum3) * y * (__uint128_t)P >> 64;
        }
    }
    end = std::chrono::steady_clock::now();
    cout << "My modulo:                  " << (end - start).count() * 1e-6 << " ms" << endl;

    assert(checkSum1 == checkSum2 && checkSum2 == checkSum3);
}
int main() {
    mt19937 gen;
    for(int i = 0; i < N; i++) a[i] = gen() % P;
    cout << "Throughput test (50000 * 50000):" << endl;
    ThroughputTest();
    cout << endl;
    cout << "Latency test (50000 * 25000):" << endl;
    LatencyTest();
}

Possible output:

Throughput test(50000 * 50000):
Compiler's signed modulo:   1954.83 ms
Compiler's unsigned modulo: 1746.73 ms
My modulo:                  1160.47 ms

Latency test(50000 * 25000):
Compiler's signed modulo:   4329.33 ms
Compiler's unsigned modulo: 3945.29 ms
My modulo:                  2397.97 ms

Full text and comments »

  • Vote: I like it
  • +67
  • Vote: I do not like it

By platelet, history, 22 months ago, In English

proof of this

Theorem: Let $$$p=\lceil\frac km\times2^{64}\rceil$$$, when $$$a_i\le \frac{2^{64}}m$$$, the computation of the lower rounding is exact.

Proof: Let $$$\frac p{2^{64}}=\frac km+\epsilon$$$, where $$$\epsilon\in[0,\frac1{2^{64}})$$$.

$$$ \begin{aligned} \lfloor\{a_i\times \frac p{2^{64}}\}\times m\rfloor&=\lfloor\{a_i\times (\frac km+\epsilon)\}\times m\rfloor\\ &=\lfloor\{a_i\times \frac km+a_i\epsilon\}\times m\rfloor \end{aligned} $$$

Here if $$$\lfloor a_i\times\frac km+a_i\epsilon\rfloor>\lfloor a_i\times \frac km\rfloor$$$ must be wrong, $$$a_i\times \frac km$$$ is at least $$$\frac1m$$$ away from $$$\lfloor a_i\times \frac km\rfloor+1$$$, so as long as $$$a_i\epsilon<\frac1m$$$ it's OK, let's continue the derivation.

$$$ \begin{aligned} &=\lfloor\{a_i\times \frac km+a_i\epsilon\}\times m\rfloor\\ &=\big\lfloor\left(\{a_i\times \frac km\}+a_i\epsilon\right)m\big\rfloor\\ &=\lfloor\{a_i\times \frac km\}\times m+a_im\epsilon\rfloor \end{aligned} $$$

Since $$$\{a_i\times\frac km\}\times m$$$ is an integer, the result is exact as long as $$$a_im\epsilon<1$$$.

The $$$a_i\epsilon<\frac1m$$$ and $$$a_im\epsilon<1$$$ are the same, and then the condition can be rewritten as $$$a_i\le\frac{2^{64}}m$$$ according to $$$\epsilon<\frac1{2^{64}}$$$.

Full text and comments »

  • Vote: I like it
  • +43
  • Vote: I do not like it