Curiosity about SSE Instructions

№	Пользователь	Рейтинг
1	tourist	4009
2	jiangly	3823
3	Benq	3738
4	Radewoosh	3633
5	jqdai0815	3620
6	orzdevinwang	3529
7	ecnerwala	3446
8	Um_nik	3396
9	ksun48	3390
10	gamegame	3386

№	Пользователь	Вклад
1	cry	167
2	Um_nik	163
3	maomao90	162
3	atcoder_official	162
5	adamant	159
6	-is-this-fft-	158
7	awoo	157
8	TheScrasse	154
9	Dominater069	153
9	nor	153

Hi, today I bring to you a question about something different, and perhaps completely useless anyways... :P

After user logicmachine's brilliant SSE solution to some problem (Link), I wonder how much SSE instructions really help program run speed. From the comments in that thread, it helps cut the speed by a factor of approximately 1/4, but is it really true?

I personally am not familiar with these things, but if they will really cut the speed of a program, that would be something curious to look into (maybe fit the TL with O(N^2) for N=1e5 ;))

Now you think, just get better, who needs these magic tricks, they're not even fair :P. But there are always some cases where I have some program, and it's barely over TL, and microoptimizations like these would (maybe?) bring it down to the TL. There's been some case where I've changed all the ints to shorts just to fit the TL... xD

So just for fun, if anyone can provide some light on SSE instructions (do they really help?), that would be pretty cool!

Thanks,

minimario

#include <immintrin.h> int dist2(const char* s, const char* t, unsigned int len) { __m256i a, b, c, e, f, g, h, x, x2, y, y2, b2, c2; int i = 0, d = len; for (; i < len;) { int j = 0; x = _mm256_set1_epi32(0); x2 = _mm256_set1_epi32(0); int lim = len, lim2 = i + 64 * 120; if (lim2 < lim) lim = lim2; for (; i < lim; i += 64) { b = _mm256_loadu_si256((void*)(s + i)); c = _mm256_loadu_si256((void*)(t + i)); y = _mm256_cmpeq_epi8(b, c); b2 = _mm256_loadu_si256((void*)(s + i + 32)); c2 = _mm256_loadu_si256((void*)(t + i + 32)); y2 = _mm256_cmpeq_epi8(b2, c2); x = _mm256_add_epi8(x, y); x2 = _mm256_add_epi8(x2, y2); } signed char z[64]; _mm256_storeu_si256((void*) z, x); _mm256_storeu_si256((void*) (z + 32), x2); for (j = 0; j < 64; j++) { d += z[j]; } } return d; } int dist(const char* s, const char* t, unsigned int len) { int i = 0, d = 0; while (len % 64) { len--; d += s[len] != t[len]; } d += dist2(s, t, len); return d; }

Комментарии (3)

Написать комментарий?

gongy

8 лет назад, # |

I'm really interested in this, too. It seems like pretty awesome knowledge to have (even if not always useful) but I"m not sure how to go about learning it.

→ Ответить

slycelote

If a large part of your algorithm can be vectorized, then you can indeed get a 4 times speedup. Source: recently fixed a bug at my job that lead to a non-SSE matrix multiplication at one point.

microtony

← Rev. 2 →

+10

To compute the Hamming Distance of two strings we can use this function:

If we make 200000 calls with len = 200000, it takes 8.196 seconds.

int dist(const char* s, const char* t, unsigned int len) {
    int i, d = 0;
    for (i = 0; i < len; i++) {
        d += s[i] != t[i];
    }
    return d;
}

This AVX implementation only takes 0.538 seconds: (Edit: this uses AVX2)

Блог пользователя minimario