调整麻省理工学院的比特计数算法来并行计算单词数?

发布于 2024-11-16 10:19:44 字数 1737 浏览 7 评论 0 原文

我想使用众所周知的 MIT 位计数算法的一个版本,使用 SSE2 指令来计算 Conway 生命游戏中的邻居。

这是 c 中的 MIT 位计数,扩展为 count bitcounts > 63 位。

int bitCount(unsigned long long n)
{
unsigned long long uCount;

uCount = n – ((n >> 1) & 0×7777777777777777)
           - ((n >> 2) & 0×3333333333333333)
           - ((n >> 3) & 0×1111111111111111);
return ((uCount + (uCount >> 4))
& 0x0F0F0F0F0F0F0F0F) % 255;
}

这是 Pascal 的一个版本,

function bitcount(n: uint64): cardinal;
var ucount: uint64;
begin
  ucount:= n - ((n shr 1) and $7777777777777777)
             - ((n shr 2) and $3333333333333333) 
             - ((n shr 3) and $1111111111111111);
  Result:= ((ucount + (count shr 4)) 
           and $0F0F0F0F0F0F0F0F) mod 255;
end;

我希望并行计算该结构中的位。

  32-bit word where the pixels are laid out as follows.
  lo-byte         lo-byte neighbor
  0 4 8 C  048C   0 4 8 C 
   +---------------+
  1|5 9 D  159D   1|5 9 D 
   |               |
  2|6 A E  26AE   2|6 A E  
   +---------------+
  3 7 B F  37BF   3 7 B F 
 |-------------|            << slice A
   |---------------|        << slice B
     |---------------|      << slice C

请注意该结构中间有 16 位需要查找。 我想使用 SSE2 计算中间 16 位中每一位的邻居计数。 为了做到这一点,我将切片 A 放入 XMM0 低位双字中,将切片 B 放入 XXM0-dword1 等中。
我将 XMM0 复制到 XMM1,并屏蔽 XMM0 低位字中的位 5 的位 012-456-89A,对 XMM0 的 word1 执行相同操作,依此类推。不同的切片和掩码,以确保 XMM0 和 XMM1 中的每个单词保存不同像素的邻居。

问题
如何调整 MIT 位数以最终得到每个 XMM 字中每个字/像素的位数?

备注
我不想使用查找表,因为我已经有了这种方法,并且我想 测试 SSE2 是否会通过不需要对查找表进行内存访问来加速该过程。

使用 SSE 汇编的答案将是最佳的,因为我在 Delphi 中对此进行编程,因此我使用 x86+SSE2 汇编代码。

I want to use a version of the well known MIT bitcount algorithm to count neighbors in Conway's game of life using SSE2 instructions.

Here's the MIT bitcount in c, extended to count bitcounts > 63 bits.

int bitCount(unsigned long long n)
{
unsigned long long uCount;

uCount = n – ((n >> 1) & 0×7777777777777777)
           - ((n >> 2) & 0×3333333333333333)
           - ((n >> 3) & 0×1111111111111111);
return ((uCount + (uCount >> 4))
& 0x0F0F0F0F0F0F0F0F) % 255;
}

Here's a version in Pascal

function bitcount(n: uint64): cardinal;
var ucount: uint64;
begin
  ucount:= n - ((n shr 1) and $7777777777777777)
             - ((n shr 2) and $3333333333333333) 
             - ((n shr 3) and $1111111111111111);
  Result:= ((ucount + (count shr 4)) 
           and $0F0F0F0F0F0F0F0F) mod 255;
end;

I'm looking to count the bits in this structure in parallel.

  32-bit word where the pixels are laid out as follows.
  lo-byte         lo-byte neighbor
  0 4 8 C  048C   0 4 8 C 
   +---------------+
  1|5 9 D  159D   1|5 9 D 
   |               |
  2|6 A E  26AE   2|6 A E  
   +---------------+
  3 7 B F  37BF   3 7 B F 
 |-------------|            << slice A
   |---------------|        << slice B
     |---------------|      << slice C

Notice how this structure has 16 bits in the middle that need to be looked up.
I want to calculate neighbor counts for each of the 16 bits in the middle using SSE2.
In order to do this I put slice A in XMM0 low-dword, slice B in XXM0-dword1 etc.
I copy XMM0 to XMM1 and I mask off bits 012-456-89A for bit 5 in the low word of XMM0, do the same for word1 of XMM0, etc. using different slices and masks to make sure each word in XMM0 and XMM1 holds the neighbors for a different pixel.

Question
How do I tweak the MIT-bitcount to end up with a bitcount per word/pixel in each XMM word?

Remarks
I don't want to use a lookup table, because I already have that approach and I want to
test to see if SSE2 will speed up the process by not requiring memory accesses to the lookup table.

An answer using SSE assembly would be optimal, because I'm programming this in Delphi and I'm thus using x86+SSE2 assembly code.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

凉风有信 2024-11-23 10:19:44

MIT 算法很难在 SSE2 中实现,因为没有可用于最终 ... % 255 表达式的整数模指令。在各种 popcnt 方法中,最适合 SSE 的方法可能是 Henry S. Warren 的“Hackers Delight”,我使用 SSE 内在函数在 C 中实现了它:

#include <stdio.h>
#include <emmintrin.h>

__m128i _mm_popcnt_epi16(__m128i v)
{
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x5555)), _mm_and_si128(_mm_srli_epi16(v, 1), _mm_set1_epi16(0x5555)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x3333)), _mm_and_si128(_mm_srli_epi16(v, 2), _mm_set1_epi16(0x3333)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x0f0f)), _mm_and_si128(_mm_srli_epi16(v, 4), _mm_set1_epi16(0x0f0f)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x00ff)), _mm_and_si128(_mm_srli_epi16(v, 8), _mm_set1_epi16(0x00ff)));
    return v;
}

int main(void)
{
    __m128i v0 = _mm_set_epi16(7, 6, 5, 4, 3, 2, 1, 0);
    __m128i v1;

    v1 = _mm_popcnt_epi16(v0);

    printf("v0 = %vhd\n", v0);
    printf("v1 = %vhd\n", v1);

    return 0;
}

编译和测试如下:

$ gcc -Wall -msse2 _mm_popcnt_epi16.c -o _mm_popcnt_epi16
$ ./_mm_popcnt_epi16 
v0 = 0 1 2 3 4 5 6 7
v1 = 0 1 1 2 1 2 2 3
$ 

看起来就像大约 16 条算术/逻辑指令一样,因此它应该以每点大约 16 / 8 = 2 个时钟运行。

如果需要,您可以轻松地将其转换为原始汇编程序 - 每个内在函数映射到单个指令。

The MIT algorithm would be tough to implement in SSE2, since there is no integer modulus instruction which could be used for the final ... % 255 expression. Of the various popcnt methods out there, the one that lends itself to SSE most easily and efficiently is probably the first one in Chapter 5 of "Hackers Delight" by Henry S. Warren, which I have implemented here in C using SSE intrinsics:

#include <stdio.h>
#include <emmintrin.h>

__m128i _mm_popcnt_epi16(__m128i v)
{
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x5555)), _mm_and_si128(_mm_srli_epi16(v, 1), _mm_set1_epi16(0x5555)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x3333)), _mm_and_si128(_mm_srli_epi16(v, 2), _mm_set1_epi16(0x3333)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x0f0f)), _mm_and_si128(_mm_srli_epi16(v, 4), _mm_set1_epi16(0x0f0f)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x00ff)), _mm_and_si128(_mm_srli_epi16(v, 8), _mm_set1_epi16(0x00ff)));
    return v;
}

int main(void)
{
    __m128i v0 = _mm_set_epi16(7, 6, 5, 4, 3, 2, 1, 0);
    __m128i v1;

    v1 = _mm_popcnt_epi16(v0);

    printf("v0 = %vhd\n", v0);
    printf("v1 = %vhd\n", v1);

    return 0;
}

Compile and test as follows:

$ gcc -Wall -msse2 _mm_popcnt_epi16.c -o _mm_popcnt_epi16
$ ./_mm_popcnt_epi16 
v0 = 0 1 2 3 4 5 6 7
v1 = 0 1 1 2 1 2 2 3
$ 

It looks like around 16 arithmetic/logical instructions so it should run at around 16 / 8 = 2 clocks per point.

You can easily convert this to raw assembler if you need to - each intrinsic maps to a single instruction.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文