尽快比较缓冲区

发布于 2024-11-09 23:44:43 字数 891 浏览 0 评论 0原文

我需要逐块比较两个缓冲区的相等性。我不需要有关两个缓冲区关系的信息，只要每两个块是否相等即可。我的 intel 机器最高支持 SSE4.2

天真的方法是：

const size_t CHUNK_SIZE = 16; //128bit for SSE2 integer registers
const int ARRAY_SIZE = 200000000;

char* array_1 = (char*)_aligned_malloc(ARRAY_SIZE, 16);
char* array_2 = (char*)_aligned_malloc(ARRAY_SIZE, 16);

for (size_t i = 0; i < ARRAY_SIZE; )
{
    volatile bool result = memcmp(array_1+i, array_2+i, CHUNK_SIZE);
    i += CHUNK_SIZE;
}

与我第一次尝试使用 SSE 相比：

union U
{
    __m128i m;
    volatile int i[4];
} res;

for (size_t i = 0; i < ARRAY_SIZE; )
{
    __m128i* pa1 = (__m128i*)(array_1+i);
    __m128i* pa2 = (__m128i*)(array_2+i);
    res.m = _mm_cmpeq_epi32(*pa1, *pa2);
    volatile bool result =  ( (res.i[0]==0) || (res.i[1]==0) || (res.i[2]==0) || (res.i[3]==0) );
    i += CHUNK_SIZE;
}

速度增益约为 33%。我还能做得更好吗？

原文

I need to compare two buffers chunk-wise for equality. I don't need information about the relation of the two buffers, just if each two chunks are equal or not. My intel machine supports up to SSE4.2

The naive approach is:

const size_t CHUNK_SIZE = 16; //128bit for SSE2 integer registers
const int ARRAY_SIZE = 200000000;

char* array_1 = (char*)_aligned_malloc(ARRAY_SIZE, 16);
char* array_2 = (char*)_aligned_malloc(ARRAY_SIZE, 16);

for (size_t i = 0; i < ARRAY_SIZE; )
{
    volatile bool result = memcmp(array_1+i, array_2+i, CHUNK_SIZE);
    i += CHUNK_SIZE;
}

Compared to my first try using SSE ever:

union U
{
    __m128i m;
    volatile int i[4];
} res;

for (size_t i = 0; i < ARRAY_SIZE; )
{
    __m128i* pa1 = (__m128i*)(array_1+i);
    __m128i* pa2 = (__m128i*)(array_2+i);
    res.m = _mm_cmpeq_epi32(*pa1, *pa2);
    volatile bool result =  ( (res.i[0]==0) || (res.i[1]==0) || (res.i[2]==0) || (res.i[3]==0) );
    i += CHUNK_SIZE;
}

The gain in speed is about 33%. Could I do any better?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

呆头 2024-11-16 23:44:43

您确实不应该使用标量代码和联合来测试所有单独的向量元素 - 相反，请执行以下操作：

for (size_t i = 0; i < ARRAY_SIZE; i += CHUNK_SIZE)
{
    const __m128i a1 = _mm_load_si128(array_1 + i);
    const __m128i a2 = _mm_load_si128(array_2 + i);
    const __m128i vcmp = _mm_cmpeq_epi32(a1, a2);
    const int vmask = _mm_movemask_epi8(vcmp);
    const bool result = (vmask == 0xffff);
    // you probably want to break here if you get a mismatch ???
}

You really shouldn't be using scalar code and unions to test all the individual vector elements - do something like this instead:

for (size_t i = 0; i < ARRAY_SIZE; i += CHUNK_SIZE)
{
    const __m128i a1 = _mm_load_si128(array_1 + i);
    const __m128i a2 = _mm_load_si128(array_2 + i);
    const __m128i vcmp = _mm_cmpeq_epi32(a1, a2);
    const int vmask = _mm_movemask_epi8(vcmp);
    const bool result = (vmask == 0xffff);
    // you probably want to break here if you get a mismatch ???
}

回复收藏 0 原文

隔纱相望 2024-11-16 23:44:43

由于您可以使用 SSE 4.1，因此还有另一种可能更快的替代方案：

for (size_t i = 0; i < ARRAY_SIZE; i += CHUNK_SIZE;)
{
    __m128i* pa1 = (__m128i*)(array_1+i);
    __m128i* pa2 = (__m128i*)(array_2+i);
    __m128i temp = _mm_xor_si128(*pa1, *pa2);
    bool result = (bool)_mm_testz_si128(temp, temp);
}

如果 a & 是，则 _mm_testz_si128(a, b) 返回 0 b != 0 并且如果 a & 则返回 1 b == 0 。优点是您也可以将此版本与新的 AVX 指令一起使用，其中块大小为 32 字节。

Since you can use SSE 4.1, there is another alternative that might be faster:

for (size_t i = 0; i < ARRAY_SIZE; i += CHUNK_SIZE;)
{
    __m128i* pa1 = (__m128i*)(array_1+i);
    __m128i* pa2 = (__m128i*)(array_2+i);
    __m128i temp = _mm_xor_si128(*pa1, *pa2);
    bool result = (bool)_mm_testz_si128(temp, temp);
}

_mm_testz_si128(a, b) returns 0 if a & b != 0 and it returns 1 if a & b == 0. The advantage is that you can use this version with the new AVX instructions as well, where the chunk size is 32 bytes.

回复收藏 0 原文

~没有更多了~