Умножение 64-битных чисел по 64-битному модулю ассемблерной вставкой в GCC

№	Пользователь	Рейтинг
1	tourist	4009
2	jiangly	3823
3	Benq	3738
4	Radewoosh	3633
5	jqdai0815	3620
6	orzdevinwang	3529
7	ecnerwala	3446
8	Um_nik	3396
9	ksun48	3390
10	gamegame	3386

№	Пользователь	Вклад
1	cry	167
2	Um_nik	163
3	maomao90	162
3	atcoder_official	162
5	adamant	159
6	-is-this-fft-	158
7	awoo	157
8	TheScrasse	154
9	Dominater069	153
9	nor	153

Компилятор GCC предоставляет возможность использовать ассемблерные вставки. Это может быть полезно например для умножения двух 64-битных чисел по 64-битному модулю.

Дело в том, что умножая два 64-битных регистра, процессор сохраняет результат в паре регистров rdx (верхнюю часть) и rax (нижнюю часть). Деление же работает похожим образом: делимое берется с регистров rdx и rax, после чего в rax сохраняется частное, а в rdx остаток.

Используя эти знания можно реализовать аналог следующей функции:

inline long long mul(long long a, long long b) {
	return (__int128)a * b % 1000000014018503;
}

Вот таким образом:

inline long long mul(long long a, long long b) {
	long long res;
	asm(
		"mov %1, %%rax\n"
		"mov %2, %%rbx\n"
		"imul %%rbx\n"
		"mov $1000000014018503, %%rbx\n"
		"idiv %%rbx\n"
		"mov %%rdx, %0\n"
		:"=res"(res)
		:"a"(a), "b"(b)
	);
	return res;
}

Мы указываем на использование переменных res на запись, a и b на чтение. Они соответственно получают обозначения %0, %1, %2. Операции записываются с использованием стандартного AT&T синтаксиса.

Теперь вы умеете писать хеши по 64-битному модулю, что эквивалентно использованию пары по 32-битному модулю, без использования __int128.

Комментарии (10)

Написать комментарий?

Gornak40

2 года назад, # |

Автокомментарий: текст был обновлен пользователем Gornak40 (предыдущая версия, новая версия, сравнить).

→ Ответить

Auto comment: topic has been updated by Gornak40 (previous revision, new revision, compare).

clyring

+32

You haven't declared any of the fixed registers you clobber with this code, so it's terrible undefined behavior: If the compiler was using rax for anything you are toast. Also, 64-bit idiv is very slow on some systems: You may find a floating-point-based method much faster. (And for hashing applications you can probably use Montgomery reduction instead of "ordinary" modmul for even better performance.)

ToxicPie9

← Rev. 3 →

+16

I see a few problems with this code:

Codeforces runs on Windows, so rbx should be preserved (source), otherwise it may cause troubles when combined with GCC-generated code.
The assembly causes many things to be moved around often if you see the compiled code, making it inefficient.
- If you want to write an entire function in asm, I suggest using GCC's __attribute__((naked)) (source).
Integer division instructions are very slow, and dividing by a constant can be optimized a lot. You can find many resources for fast division on Codeforces (like this blog or this).

That being said, using x86 instructions directly is significantly faster than running __int128 division (which calls a large, slower function __modti3) when you only need 64 bits of modulus and output.

Here is my version of the function in assembly (Windows call convention):

__attribute__((naked)) long long modmul(long long, long long, long long) {
    asm(R"(
        mov %rcx, %rax
        imul %rdx
        idiv %r8
        mov %rdx, %rax
        ret
    )");
}

for sysv users

__attribute__((naked)) long long modmul(long long, long long, long long) {
    asm(R"(
        mov %rdi, %rax
        mov %rdx, %rcx
        imul %rsi
        idiv %rcx
        mov %rdx, %rax
        ret
    )");
}

Cuellius

Codeforces has the gym named "Fast modular multiplication", where I have tested how fast the assembler insertion is.

Assembler insertion ~1326 ms

//                                         RCX,      RDX,       R8 
__attribute__((naked)) uint64_t mulmod64(uint64_t, uint64_t, uint64_t) 
{
    asm(R"(
        .intel_syntax noprefix
        mov rax, rcx
        mul rdx
        div r8
        mov rax, rdx
        ret
        .att_syntax noprefix
    )");
}
 
uint64_t prod(const test& t)
{
    return mulmod64(t.x, t.y, t.modulo);
}

unsigned __int128 multiplication ~1482 ms

uint64_t prod(const test& t)
{
    return uint64_t((unsigned __int128)t.x * t.y % t.modulo);
}

So, the assembler insertion is slightly faster, but not significantly faster.

ssvb

2 года назад, # ^ |

It's also possible to implement inline assembly version without any function call overhead:

inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t c) {
	uint64_t res;
	asm(
		"mul %2\n"
		"div %3\n"
		:"=&d"(res), "+a"(a)
		:"r"(b), "r"(c)
		:"cc"
	);
	return res;
}

But almost all CPU cycles are spent on executing the super slow division instruction in all code variants. The __int128 variant without inline assembly is also using the same division instruction (after some extra checks to ensure that division overflows won't be triggered).

Yes, and it is faster, ~1248 ms. And your function can be inlined (this removes one jmp and two mov) despite the same assembler output (with exactness up to swapping commands' order and ud2 opcode) (on Linux): https://godbolt.org/z/r5MeMxdbM

On Windows your function produces one extra mov, but inlining removes one jmp and one mov.

Dump (on Windows)

0000000000000000 <mulmod64(unsigned long long, unsigned long long, unsigned long long)>:
   0:	48 89 c8             	mov    rax,rcx
   3:	48 f7 e2             	mul    rdx
   6:	49 f7 f0             	div    r8
   9:	48 89 d0             	mov    rax,rdx
   c:	c3                   	ret    
   d:	0f 0b                	ud2    
   f:	90                   	nop

0000000000000010 <mulmod64_v2(unsigned long long, unsigned long long, unsigned long long)>:
  10:	49 89 d2             	mov    r10,rdx
  13:	48 89 c8             	mov    rax,rcx
  16:	49 f7 e2             	mul    r10
  19:	49 f7 f0             	div    r8
  1c:	48 89 d0             	mov    rax,rdx
  1f:	c3                   	ret    

0000000000000020 <prod(test const&)>:
  20:	48 8b 51 08          	mov    rdx,QWORD PTR [rcx+0x8]
  24:	4c 8b 41 10          	mov    r8,QWORD PTR [rcx+0x10]
  28:	48 8b 09             	mov    rcx,QWORD PTR [rcx]
  2b:	eb d3                	jmp    0 <mulmod64(unsigned long long, unsigned long long, unsigned long long)>
  2d:	0f 1f 00             	nop    DWORD PTR [rax]

0000000000000030 <prod_v2(test const&)>:
  30:	4c 8b 41 08          	mov    r8,QWORD PTR [rcx+0x8]
  34:	4c 8b 49 10          	mov    r9,QWORD PTR [rcx+0x10]
  38:	48 8b 01             	mov    rax,QWORD PTR [rcx]
  3b:	49 f7 e0             	mul    r8
  3e:	49 f7 f1             	div    r9
  41:	48 89 d0             	mov    rax,rdx
  44:	c3                   	ret

Death_on_2_Legs

21 месяц назад, # |

If you really need int64 multiplication, better consider this variant:

using uint64 = unsigned long long;
uint64 modmul(uint64 a, uint64 b, uint64 M) {
	ll ret = a * b - M * uint64(1.L / M * a * b);
	return ret + M * (ret < 0) - M * (ret >= (ll)M);
}

There is a proof that it works here: https://github.com/kth-competitive-programming/kactl/blob/main/doc/modmul-proof.pdf

This is much faster, and is not correct only for some int64 values, that are almost certainly much bigger than any modulo you choose

21 месяц назад, # ^ |

Thanks!

bashkort

All hail our emperor Stas

Блог пользователя Gornak40