Блог пользователя minimario

Автор minimario, история, 8 лет назад, По-английски

Hi, today I bring to you a question about something different, and perhaps completely useless anyways... :P

After user logicmachine's brilliant SSE solution to some problem (Link), I wonder how much SSE instructions really help program run speed. From the comments in that thread, it helps cut the speed by a factor of approximately 1/4, but is it really true?

I personally am not familiar with these things, but if they will really cut the speed of a program, that would be something curious to look into (maybe fit the TL with O(N^2) for N=1e5 ;))

Now you think, just get better, who needs these magic tricks, they're not even fair :P. But there are always some cases where I have some program, and it's barely over TL, and microoptimizations like these would (maybe?) bring it down to the TL. There's been some case where I've changed all the ints to shorts just to fit the TL... xD

So just for fun, if anyone can provide some light on SSE instructions (do they really help?), that would be pretty cool!

Thanks,

minimario

  • Проголосовать: нравится
  • +33
  • Проголосовать: не нравится

»
8 лет назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

I'm really interested in this, too. It seems like pretty awesome knowledge to have (even if not always useful) but I"m not sure how to go about learning it.

»
8 лет назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

If a large part of your algorithm can be vectorized, then you can indeed get a 4 times speedup. Source: recently fixed a bug at my job that lead to a non-SSE matrix multiplication at one point.

»
8 лет назад, # |
Rev. 2   Проголосовать: нравится +10 Проголосовать: не нравится

To compute the Hamming Distance of two strings we can use this function:

If we make 200000 calls with len = 200000, it takes 8.196 seconds.

int dist(const char* s, const char* t, unsigned int len) {
    int i, d = 0;
    for (i = 0; i < len; i++) {
        d += s[i] != t[i];
    }
    return d;
}

This AVX implementation only takes 0.538 seconds: (Edit: this uses AVX2)

#include <immintrin.h>

int dist2(const char* s, const char* t, unsigned int len) {
  __m256i a, b, c, e, f, g, h, x, x2, y, y2, b2, c2;
  int i = 0, d = len;
  for (; i < len;) {
    int j = 0;
    x = _mm256_set1_epi32(0);
    x2 = _mm256_set1_epi32(0);
    int lim = len, lim2 = i + 64 * 120;
    if (lim2 < lim) lim = lim2;
    for (; i < lim; i += 64) {
      b = _mm256_loadu_si256((void*)(s + i));
      c = _mm256_loadu_si256((void*)(t + i));
      y =  _mm256_cmpeq_epi8(b, c);
      b2 = _mm256_loadu_si256((void*)(s + i + 32));
      c2 = _mm256_loadu_si256((void*)(t + i + 32));
      y2 =  _mm256_cmpeq_epi8(b2, c2);
      x = _mm256_add_epi8(x, y);
      x2 = _mm256_add_epi8(x2, y2);
    }
    signed char z[64];
    _mm256_storeu_si256((void*) z, x);
    _mm256_storeu_si256((void*) (z + 32), x2);
    for (j = 0; j < 64; j++) {
      d += z[j];
    }
  }
  return d;
}

int dist(const char* s, const char* t, unsigned int len) {
  int i = 0, d = 0;
  while (len % 64) {
    len--;
    d += s[len] != t[len];
  }
  d += dist2(s, t, len);
  return d;
}