如何在 Sandy Bridge 上将一系列整数中的位快速计数到单独的容器中?

发布于 2024-12-10 11:39:34 字数 505 浏览 3 评论 0原文

更新:请阅读代码,它不是关于计算一个 int 中的位数

是否可以使用一些聪明的汇编器来提高以下代码的性能?

uint bit_counter[64];

void Count(uint64 bits) {
  bit_counter[0] += (bits >> 0) & 1;
  bit_counter[1] += (bits >> 1) & 1;
  // ..
  bit_counter[63] += (bits >> 63) & 1;
}

Count 位于我的算法的最内层循环中。

更新: 架构:x86-64、Sandy Bridge,因此可以使用 SSE4.2、AVX1 和较旧的技术,但不能使用 AVX2 或 BMI1/2。

bits 变量几乎具有随机位(接近一半零和一半一)

Update: Please read the code, it is NOT about counting bits in one int

Is it possible to improve performance of the following code with some clever assembler?

uint bit_counter[64];

void Count(uint64 bits) {
  bit_counter[0] += (bits >> 0) & 1;
  bit_counter[1] += (bits >> 1) & 1;
  // ..
  bit_counter[63] += (bits >> 63) & 1;
}

Count is in the inner-most loop of my algorithm.

Update:
Architecture: x86-64, Sandy Bridge, so SSE4.2, AVX1 and older tech can be used, but not AVX2 or BMI1/2.

bits variable has almost random bits (close to half zeros and half ones)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

高跟鞋的旋律 2024-12-17 11:39:34

您可以尝试使用 SSE 执行此操作,每次迭代增加 4 个元素。

警告:未经测试的代码如下...

#include <stdint.h>
#include <emmintrin.h>

uint32_t bit_counter[64] __attribute__ ((aligned(16)));
                     // make sure bit_counter array is 16 byte aligned for SSE

void Count_SSE(uint64 bits)
{
    const __m128i inc_table[16] = {
        _mm_set_epi32(0, 0, 0, 0),
        _mm_set_epi32(0, 0, 0, 1),
        _mm_set_epi32(0, 0, 1, 0),
        _mm_set_epi32(0, 0, 1, 1),
        _mm_set_epi32(0, 1, 0, 0),
        _mm_set_epi32(0, 1, 0, 1),
        _mm_set_epi32(0, 1, 1, 0),
        _mm_set_epi32(0, 1, 1, 1),
        _mm_set_epi32(1, 0, 0, 0),
        _mm_set_epi32(1, 0, 0, 1),
        _mm_set_epi32(1, 0, 1, 0),
        _mm_set_epi32(1, 0, 1, 1),
        _mm_set_epi32(1, 1, 0, 0),
        _mm_set_epi32(1, 1, 0, 1),
        _mm_set_epi32(1, 1, 1, 0),
        _mm_set_epi32(1, 1, 1, 1)
    };

    for (int i = 0; i < 64; i += 4)
    {
        __m128i vbit_counter = _mm_load_si128(&bit_counter[i]);
                                          // load 4 ints from bit_counter
        int index = (bits >> i) & 15;     // get next 4 bits
        __m128i vinc = inc_table[index];  // look up 4 increments from LUT
        vbit_counter = _mm_add_epi32(vbit_counter, vinc);
                                          // increment 4 elements of bit_counter
        _mm_store_si128(&bit_counter[i], vbit_counter);
    }                                     // store 4 updated ints
}

它是如何工作的:本质上我们在这里所做的就是对原始循环进行向量化,以便我们每次循环迭代处理 4 位而不是 1 位。因此,我们现在有 16 次循环迭代,而不是 64 次。对于每次迭代我们从 bits 加载 4 位,然后将它们用作 LUT 的索引,该 LUT 包含当前 4 位的 4 个增量的所有可能组合。然后我们将这 4 个增量添加到 bit_counter 的当前 4 个元素中。

加载、存储和添加的数量减少了 4 倍,但这将在一定程度上被 LUT 加载和其他内务处理所抵消。不过,您可能仍会看到 2 倍的速度提升。如果您决定尝试的话,我很想知道结果。

You could try doing it with SSE, incrementing 4 elements per iteration.

Warning: untested code follows...

#include <stdint.h>
#include <emmintrin.h>

uint32_t bit_counter[64] __attribute__ ((aligned(16)));
                     // make sure bit_counter array is 16 byte aligned for SSE

void Count_SSE(uint64 bits)
{
    const __m128i inc_table[16] = {
        _mm_set_epi32(0, 0, 0, 0),
        _mm_set_epi32(0, 0, 0, 1),
        _mm_set_epi32(0, 0, 1, 0),
        _mm_set_epi32(0, 0, 1, 1),
        _mm_set_epi32(0, 1, 0, 0),
        _mm_set_epi32(0, 1, 0, 1),
        _mm_set_epi32(0, 1, 1, 0),
        _mm_set_epi32(0, 1, 1, 1),
        _mm_set_epi32(1, 0, 0, 0),
        _mm_set_epi32(1, 0, 0, 1),
        _mm_set_epi32(1, 0, 1, 0),
        _mm_set_epi32(1, 0, 1, 1),
        _mm_set_epi32(1, 1, 0, 0),
        _mm_set_epi32(1, 1, 0, 1),
        _mm_set_epi32(1, 1, 1, 0),
        _mm_set_epi32(1, 1, 1, 1)
    };

    for (int i = 0; i < 64; i += 4)
    {
        __m128i vbit_counter = _mm_load_si128(&bit_counter[i]);
                                          // load 4 ints from bit_counter
        int index = (bits >> i) & 15;     // get next 4 bits
        __m128i vinc = inc_table[index];  // look up 4 increments from LUT
        vbit_counter = _mm_add_epi32(vbit_counter, vinc);
                                          // increment 4 elements of bit_counter
        _mm_store_si128(&bit_counter[i], vbit_counter);
    }                                     // store 4 updated ints
}

How it works: essentially all we are doing here is vectorizing the original loop so that we process 4 bits per loop iteration instead of 1. So we now have 16 loop iterations instead of 64. For each iteration we load 4 bits from bits, then use them as an index into a LUT which contains all possible combinations of 4 increments for the current 4 bits. We then add these 4 increments to the current 4 elements of bit_counter.

The number of loads and stores and adds is reduced by a factor of 4, but this will be offset somewhat by the LUT load and other housekeeping. You may still see a 2x speed up though. I'd be interested to know the result if you do decide to try it.

初见 2024-12-17 11:39:34

也许您可以一次执行 8 个操作,方法是将 8 个位间隔为 8 个,并保留 8 个 uint64 用于计数。不过,每个计数器只有 1 个字节,因此在解压这些 uint64 之前,您只能累积 255 次 count 调用。

Maybe you can do 8 at once, by taking 8 bits spaced 8 apart and keeping 8 uint64's for the counts. That's only 1 byte per single counter though, so you can only accumulate 255 invocations of count before you'd have to unpack those uint64's.

旧情别恋 2024-12-17 11:39:34

查看位调整黑客

编辑 至于“位位置桶累积”(bit_counter[]),我感觉这可能是一个很好的例子valarray +掩蔽。不过,这需要相当多的编码+测试+分析。如果您真的感兴趣,请告诉我。

如今,您可以使用绑定元组(TR1、boost 或 C++11)非常接近 valarray 行为;我有一种感觉,它会变得更容易阅读和更慢的编译。

Look at Bit Twiddling Hacks

Edit As for the 'bit position bucket accumulation' (bit_counter[]) I have a feeling that this might be a good case for valarrays + masking. That'd be a fair bit of coding+testing+profiling though. Let me know if you are really interested.

You could, these days, come very close to valarray behaviour using tied tuples (TR1, boost or C++11); I have a feeling it would come out being simpler to read and slower to compile.

源来凯始玺欢你 2024-12-17 11:39:34

显然,这可以通过“垂直计数器”快速完成。来自现已失效的 Bittrics 页面 (存档@steike

考虑一个普通的整数数组,我们在其中读取位
水平:

<前><代码> msb<-->lsb
x[0] 00000010 = 2
x[1] 00000001 = 1
x[2] 00000101 = 5

垂直计数器存储数字,顾名思义,
垂直;也就是说,一个 k 位计数器存储在 k 个字中,其中
每个字中有一位。

<前><代码> x[0] 00000110 lsb ↑
x[1] 00000001 |
x[2] 00000100 |
x[3] 00000000 |
x[4] 00000000 高位 ↓
第512章

有了这样存储的数字,我们可以使用按位运算来
一次增加它们的任何子集。

我们创建一个位图,其位置对应于 1 位
我们想要增加的计数器,并从 LSB 开始循环遍历数组,
随着我们的进展更新这些位。一项加法中的“携带”变为
数组下一个元素的输入。

 输入总和

-------------------------------------------------- ------------------------------------------
   ABCS
   0 0 0 0
   0 1 0 1 总和 = a ^ b
   1 0 0 1 进位 = a &乙
   1 1 1 1

  进位=输入;
  长 *p = 缓冲区;
  而(进位){
    a = *p; b = 进位;
    *p++ = a ^ b;
    进位 = a &乙;
  }

对于 64 位字,循环平均运行 6-7 次——迭代次数由最长进位链决定。

Apparently this can be done quickly with "vertical counters". From the now-defunct page on Bit tricks (archive) by @steike:

Consider a normal array of integers, where we read the bits
horizontally:

       msb<-->lsb
  x[0]  00000010  = 2
  x[1]  00000001  = 1
  x[2]  00000101  = 5

A vertical counter stores the numbers, as the name implies,
vertically; that is, a k-bit counter is stored across k words, with a
single bit in each word.

  x[0]  00000110   lsb ↑
  x[1]  00000001       |
  x[2]  00000100       |
  x[3]  00000000       |
  x[4]  00000000   msb ↓
             512

With the numbers stored like this, we can use bitwise operations to
increment any subset of them all at once.

We create a bitmap with a 1 bit in the positions corresponding to the
counters we want to increment, and loop through the array from LSB up,
updating the bits as we go. The "carries" from one addition becomes
the input for the next element of the array.

  input  sum

--------------------------------------------------------------------------------
   A B   C S
   0 0   0 0
   0 1   0 1      sum    = a ^ b
   1 0   0 1      carry  = a & b
   1 1   1 1

  carry = input;
  long *p = buffer;
  while (carry) {
    a = *p; b = carry;
    *p++ = a ^ b;
    carry = a & b;
  }

For 64-bit words the loop will run 6-7 times on average -- the number of iterations is determined by the longest chain of carries.

好倦 2024-12-17 11:39:34

你可以像这样展开你的函数。它可能比您的编译器可以执行的速度更快!

//   rax as 64 bit input
   xor  rcx, rcx                //clear addent

   add  rax, rax                //Copy 63th bit to carry flag
   adc  dword ptr [@bit_counter + 63 * 4], ecx    //Add carry bit to counter[64]

   add  rax, rax                //Copy 62th bit to carry flag
   adc  dword ptr [@bit_counter + 62 * 4], ecx    //Add carry bit to counter[63]

   add  rax, rax                //Copy 62th bit to carry flag
   adc  dword ptr [@bit_counter + 61 * 4], ecx    //Add carry bit to counter[62]
//   ...
   add  rax, rax                //Copy 1th bit to carry flag
   adc  dword ptr [@bit_counter + 1 * 4], ecx     //Add carry bit to counter[1]

   add  rax, rax                //Copy 0th bit to carry flag
   adc  dword ptr [@bit_counter], ecx             //Add carry bit to counter[0]

编辑:

您也可以尝试使用双增量,如下所示:

//   rax as 64 bit input
   xor  rcx, rcx                //clear addent
//
   add  rax, rax                //Copy 63th bit to carry flag
   rcl  rcx, 33                 //Mov carry to 32th bit as 0bit of second uint
   add  rax, rax                //Copy 62th bit to carry flag
   adc  qword ptr [@bit_counter + 62 * 8], rcx  //Add rcx to 63th and 62th counters

   add  rax, rax                //Copy 61th bit to carry flag
   rcl  rcx, 33                 //Mov carry to 32th bit as 0bit of second uint
   add  rax, rax                //Copy 60th bit to carry flag
   adc  qword ptr [@bit_counter + 60 * 8], rcx  //Add rcx to 61th and 60th counters
//...

You can unroll your function like this. It is probably faster than what your compiler can do!

//   rax as 64 bit input
   xor  rcx, rcx                //clear addent

   add  rax, rax                //Copy 63th bit to carry flag
   adc  dword ptr [@bit_counter + 63 * 4], ecx    //Add carry bit to counter[64]

   add  rax, rax                //Copy 62th bit to carry flag
   adc  dword ptr [@bit_counter + 62 * 4], ecx    //Add carry bit to counter[63]

   add  rax, rax                //Copy 62th bit to carry flag
   adc  dword ptr [@bit_counter + 61 * 4], ecx    //Add carry bit to counter[62]
//   ...
   add  rax, rax                //Copy 1th bit to carry flag
   adc  dword ptr [@bit_counter + 1 * 4], ecx     //Add carry bit to counter[1]

   add  rax, rax                //Copy 0th bit to carry flag
   adc  dword ptr [@bit_counter], ecx             //Add carry bit to counter[0]

EDIT:

You can try also with double increment like this:

//   rax as 64 bit input
   xor  rcx, rcx                //clear addent
//
   add  rax, rax                //Copy 63th bit to carry flag
   rcl  rcx, 33                 //Mov carry to 32th bit as 0bit of second uint
   add  rax, rax                //Copy 62th bit to carry flag
   adc  qword ptr [@bit_counter + 62 * 8], rcx  //Add rcx to 63th and 62th counters

   add  rax, rax                //Copy 61th bit to carry flag
   rcl  rcx, 33                 //Mov carry to 32th bit as 0bit of second uint
   add  rax, rax                //Copy 60th bit to carry flag
   adc  qword ptr [@bit_counter + 60 * 8], rcx  //Add rcx to 61th and 60th counters
//...
荒岛晴空 2024-12-17 11:39:34

您可以使用一组计数器,每个计数器的大小不同。首先在2位计数器中累加3个值,然后解包并更新4位计数器。当 15 个值准备好时,解包为字节大小的计数器,并在 255 个值后更新 bit_counter[]。

所有这些工作都可以在 128 位 SSE 寄存器中并行完成。在现代处理器上,只需一条指令即可将 1 位解包为 2 位。只需使用 PCLMULQDQ 指令将源四字与其自身相乘即可。这会将源位与零交织。同样的技巧可能有助于将 2 位解包为 4 位。并且可以通过洗牌、解包和简单的逻辑运算来完成 4 和 8 位的解包。

平均性能似乎不错,但代价是需要 120 字节的额外计数器和相当多的汇编代码。

You could use the set of counters, each of different size. Firstly accumulate 3 values in 2-bit counters, then unpack them and update 4-bit counters. When 15 values are ready, unpack to byte-sized counters, and after 255 values update bit_counter[].

All this work may be done in parallel in 128 bit SSE registers. On modern processors only one instruction is needed to unpack 1 bit to 2. Just multiply source quadword to itself with PCLMULQDQ instruction. This will interleave source bits with zeros. The same trick may help to unpack 2 bits to 4. And unpacking of 4 and 8 bits may be done with shuffles, unpacks and simple logical operations.

Average performance seems to be good, but the price is 120 bytes for additional counters and quite a lot of assembly code.

饮惑 2024-12-17 11:39:34

一般来说,没有办法回答这个问题;这一切都取决于编译器
以及底层架构。唯一真正知道的方法就是尝试
不同的解决方案并进行测量。 (例如,在某些机器上,
轮班可能会非常昂贵。对于其他人,没有。)对于初学者,我会使用
像这样:

uint64_t mask = 1;
int index = 0;
while ( mask != 0 ) {
    if ( (bits & mask) != 0 ) {
        ++ bit_counter[index];
    }
    ++ index;
    mask <<= 1;
}

完全展开循环可能会提高性能。
根据架构的不同,将 if 替换为:

bit_counter[index] += ((bits & mask) != 0);

可能会更好。或者更糟糕……不可能提前知道。它是
也有可能在某些机器上,系统地转移到
正如您所做的那样,低阶位和掩蔽将是最好的。

一些优化还取决于典型数据的样子。如果
大多数单词只有一两个位设置,您可能会通过
一次测试一个字节,或一次测试四个位,然后跳过这些
那完全是零。

There's no way to answer this in general; it all depends on the compiler
and the underlying architecture. The only real way to know is to try
different solutions, and measure. (On some machines, for example,
shifts can be very expensive. On others, no.) For starters, I'd use
something like:

uint64_t mask = 1;
int index = 0;
while ( mask != 0 ) {
    if ( (bits & mask) != 0 ) {
        ++ bit_counter[index];
    }
    ++ index;
    mask <<= 1;
}

Unrolling the loop completely will likely improve performance.
Depending on the architecture, replacing the if with:

bit_counter[index] += ((bits & mask) != 0);

might be better. Or worse... it's impossible to know in advance. It's
also possible that on some machines, systematically shifting into the
low order bit and masking, as you are doing, would be best.

Some optimizations will also depend on what typical data looks like. If
most of the words only have one or two bits set, you might gain by
testing a byte at at time, or four bits at a time, and skipping those
that are all zeros completely.

山田美奈子 2024-12-17 11:39:34

如果您计算每个半字节(16 种可能性)在每个偏移量(16 种可能性)出现的频率,您可以轻松地对结果求和。这 256 笔钱很容易保存:

unsigned long nibble_count[16][16]; // E.g. 0x000700B0 corresponds to [4][7] and [2][B]
unsigned long bitcount[64];

void CountNibbles(uint64 bits) {
  // Count nibbles
  for (int i = 0; i != 16; ++i) {
     nibble_count[i][bits&0xf]++;
     bits >>= 4;
  }
}
void SumNibbles() {
  for (int i = 0; i != 16; ++i) {
    for (int nibble = 0; nibble != 16; ++nibble) {
        for(int bitpos = 0; bitpos != 3; ++bitpos) {
           if (nibble & (1<<bitpos)) {
              bitcount[i*4 + bitpos] += nibble_count[i][nibble];
           }
        }
     }
   }
}

If you count how often each nibble (16 possibilities) occurs at each offset (16 possibilities), you can easily sum the results. And those 256 sums are easily kept:

unsigned long nibble_count[16][16]; // E.g. 0x000700B0 corresponds to [4][7] and [2][B]
unsigned long bitcount[64];

void CountNibbles(uint64 bits) {
  // Count nibbles
  for (int i = 0; i != 16; ++i) {
     nibble_count[i][bits&0xf]++;
     bits >>= 4;
  }
}
void SumNibbles() {
  for (int i = 0; i != 16; ++i) {
    for (int nibble = 0; nibble != 16; ++nibble) {
        for(int bitpos = 0; bitpos != 3; ++bitpos) {
           if (nibble & (1<<bitpos)) {
              bitcount[i*4 + bitpos] += nibble_count[i][nibble];
           }
        }
     }
   }
}
坏尐絯 2024-12-17 11:39:34

这相当快:

void count(uint_fast64_t bits){
    uint_fast64_t i64=ffs64(bits);
    while(i64){
        bit_counter[i64-1]++;
        bits=bits & 0xFFFFFFFFFFFFFFFF << i64;
        i64=ffs64(bits);
    }
}

您需要快速实现 ffs 对于 64 位。对于大多数编译器和 CPU 来说,这是一条指令。该循环针对字中的每个位执行一次,因此 bits=0 会非常快,而 1 的 64 位位会较慢。

我在 64 位 Ubuntu 下使用 GCC 对此进行了测试,它产生与您相同的数据输出:

void Count(uint64 bits) {
  bit_counter[0] += (bits >> 0) & 1;
  bit_counter[1] += (bits >> 1) & 1;
  // ..
  bit_counter[63] += (bits >> 63) & 1;
}

速度根据 64 位字中 1 位的数量而变化。

This is fairly fast:

void count(uint_fast64_t bits){
    uint_fast64_t i64=ffs64(bits);
    while(i64){
        bit_counter[i64-1]++;
        bits=bits & 0xFFFFFFFFFFFFFFFF << i64;
        i64=ffs64(bits);
    }
}

You need to have a fast implementation of ffs for 64 bits. For most compiler and CPU's this is a single instruction. The loop is executed once for each bit in the word, so bits=0 will be very fast and bits being 64 bits of 1 will be slower.

I tested this under 64 bit Ubuntu with GCC, and it produces the same data output as your:

void Count(uint64 bits) {
  bit_counter[0] += (bits >> 0) & 1;
  bit_counter[1] += (bits >> 1) & 1;
  // ..
  bit_counter[63] += (bits >> 63) & 1;
}

The speed is variable based on the number of 1 bits in the 64 bit word.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文