Curiosity about SSE Instructions

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	157
6	Qingyu	156
7	djm03178	152
7	adamant	152
9	luogu_official	150
10	awoo	147

Hi, today I bring to you a question about something different, and perhaps completely useless anyways... :P

After user logicmachine's brilliant SSE solution to some problem (Link), I wonder how much SSE instructions really help program run speed. From the comments in that thread, it helps cut the speed by a factor of approximately 1/4, but is it really true?

I personally am not familiar with these things, but if they will really cut the speed of a program, that would be something curious to look into (maybe fit the TL with O(N^2) for N=1e5 ;))

Now you think, just get better, who needs these magic tricks, they're not even fair :P. But there are always some cases where I have some program, and it's barely over TL, and microoptimizations like these would (maybe?) bring it down to the TL. There's been some case where I've changed all the ints to shorts just to fit the TL... xD

So just for fun, if anyone can provide some light on SSE instructions (do they really help?), that would be pretty cool!

Thanks,

minimario

#include <immintrin.h> int dist2(const char* s, const char* t, unsigned int len) { __m256i a, b, c, e, f, g, h, x, x2, y, y2, b2, c2; int i = 0, d = len; for (; i < len;) { int j = 0; x = _mm256_set1_epi32(0); x2 = _mm256_set1_epi32(0); int lim = len, lim2 = i + 64 * 120; if (lim2 < lim) lim = lim2; for (; i < lim; i += 64) { b = _mm256_loadu_si256((void*)(s + i)); c = _mm256_loadu_si256((void*)(t + i)); y = _mm256_cmpeq_epi8(b, c); b2 = _mm256_loadu_si256((void*)(s + i + 32)); c2 = _mm256_loadu_si256((void*)(t + i + 32)); y2 = _mm256_cmpeq_epi8(b2, c2); x = _mm256_add_epi8(x, y); x2 = _mm256_add_epi8(x2, y2); } signed char z[64]; _mm256_storeu_si256((void*) z, x); _mm256_storeu_si256((void*) (z + 32), x2); for (j = 0; j < 64; j++) { d += z[j]; } } return d; } int dist(const char* s, const char* t, unsigned int len) { int i = 0, d = 0; while (len % 64) { len--; d += s[len] != t[len]; } d += dist2(s, t, len); return d; }

Comments (3)

Write comment?

gongy

8 years ago, # |

I'm really interested in this, too. It seems like pretty awesome knowledge to have (even if not always useful) but I"m not sure how to go about learning it.

→ Reply

slycelote

If a large part of your algorithm can be vectorized, then you can indeed get a 4 times speedup. Source: recently fixed a bug at my job that lead to a non-SSE matrix multiplication at one point.

microtony

← Rev. 2 →

+10

To compute the Hamming Distance of two strings we can use this function:

If we make 200000 calls with len = 200000, it takes 8.196 seconds.

int dist(const char* s, const char* t, unsigned int len) {
    int i, d = 0;
    for (i = 0; i < len; i++) {
        d += s[i] != t[i];
    }
    return d;
}

This AVX implementation only takes 0.538 seconds: (Edit: this uses AVX2)

minimario's blog