计算 32 位整数中设置的位数
代表数字 7 的 8 位如下所示:
00000111
设置了 3 位。
确定 32 位整数中设置位数的算法有哪些?
8 bits representing the number 7 look like this:
00000111
Three bits are set.
What are the algorithms to determine the number of set bits in a 32-bit integer?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(30)
C++20
std::popcount
以下提案已合并http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0553r4.html 并且应该将其添加到
标头中。我希望用法是这样的:
当 GCC 支持时我会尝试一下,带有
g++-9 -std=c++2a
的 GCC 9.1.0 仍然不支持它。该提案称:
和:
还添加了
std::rotl
和std::rotr
来进行循环位旋转:C++ 中循环移位(旋转)操作的最佳实践C++20
std::popcount
The following proposal has been merged http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0553r4.html and should add it to a the
<bit>
header.I expect the usage to be like:
I'll give it a try when support arrives to GCC, GCC 9.1.0 with
g++-9 -std=c++2a
still doesn't support it.The proposal says:
and:
std::rotl
andstd::rotr
were also added to do circular bit rotations: Best practices for circular shift (rotate) operations in C++我大约在 1990 年为 RISC 机器编写了一个快速位计数宏。它不使用高级算术(乘法、除法、%)、内存获取(太慢)、分支(太慢),但它确实假设 CPU 有一个32 位桶式移位器(换句话说,>> 1 和>> 32 占用相同数量的周期。)它假设小常量(例如 6、12、24)不需要任何成本即可加载到寄存器中,或者存储在临时文件中并一遍又一遍地重复使用。
有了这些假设,在大多数 RISC 机器上,它可以在大约 16 个周期/指令中计算 32 位。 请注意,15 个指令/周期接近周期或指令数量的下限,因为似乎至少需要 3 条指令(掩码、移位、运算符)才能将加数数量减半,因此 log_2(32) = 5, 5 x 3 = 15 条指令是准下界。
这是第一个也是最复杂的步骤的秘密:
因此,如果我将上面的第一列 (A) 右移 1 位,然后从 AB 中减去它,我就会得到输出 (CD)。 扩展到 3 位的情况类似; 如果你愿意的话,你可以用像我上面这样的 8 行布尔表来检查它。
I wrote a fast bitcount macro for RISC machines in about 1990. It does not use advanced arithmetic (multiplication, division, %), memory fetches (way too slow), branches (way too slow), but it does assume the CPU has a 32-bit barrel shifter (in other words, >> 1 and >> 32 take the same amount of cycles.) It assumes that small constants (such as 6, 12, 24) cost nothing to load into the registers, or are stored in temporaries and reused over and over again.
With these assumptions, it counts 32 bits in about 16 cycles/instructions on most RISC machines. Note that 15 instructions/cycles is close to a lower bound on the number of cycles or instructions, because it seems to take at least 3 instructions (mask, shift, operator) to cut the number of addends in half, so log_2(32) = 5, 5 x 3 = 15 instructions is a quasi-lowerbound.
Here is a secret to the first and most complex step:
so if I take the 1st column (A) above, shift it right 1 bit, and subtract it from AB, I get the output (CD). The extension to 3 bits is similar; you can check it with an 8-row boolean table like mine above if you wish.
有多种算法可以对设置的位数进行计数; 但我认为最好的就是更快的!
您可以在此页面上查看详细信息:
Bit Twiddling Hacks
我建议这个:
计数位设置使用 64 位指令的 14、24 或 32 位字
此方法需要具有快速模数除法的 64 位 CPU 才能高效。 第一个选项只需要3次操作; 第二个选项取10; 第三个选项取15。
There are many algorithm to count the set bits; but i think the best one is the faster one!
You can see the detailed on this page:
Bit Twiddling Hacks
I suggest this one:
Counting bits set in 14, 24, or 32-bit words using 64-bit instructions
This method requires a 64-bit CPU with fast modulus division to be efficient. The first option takes only 3 operations; the second option takes 10; and the third option takes 15.
我总是在竞争性编程中使用它,它很容易编写并且很高效:
I always use this in competitive programming, and it's easy to write and is efficient:
我发现使用 SIMD 指令(SSSE3 和 AVX2)在数组中实现位计数。 与使用 __popcnt64 内部函数相比,它的性能提高了 2-2.5 倍。
SSSE3 版本:
AVX2 版本:
I found an implementation of bit counting in an array with using of SIMD instruction (SSSE3 and AVX2). It has in 2-2.5 times better performance than if it will use __popcnt64 intrinsic function.
SSSE3 version:
AVX2 version:
一种快速 C# 解决方案,使用预先计算的字节位计数表并根据输入大小进行分支。
A fast C# solution using a pre-calculated table of Byte bit counts with branching on the input size.
Java JDK1.5
Integer.bitCount(n);
其中 n 是要计算 1 的数量。
还检查,
Java JDK1.5
Integer.bitCount(n);
where n is the number whose 1's are to be counted.
check also,
我特别喜欢财富档案中的这个例子:
我最喜欢它,因为它太漂亮了!
I'm particularly fond of this example from the fortune file:
I like it best because it's so pretty!
你可以这样做:
这背后的逻辑是 n-1 的位与 n 的最右边的设置位反转。
如果n=6,即110,那么5就是101,这些位从n的最右边设置位开始反转。
因此,如果我们
&
这两个位,我们将在每次迭代中将最右边的位设置为 0,并且始终转到下一个最右边的设置位。 因此,对设置位进行计数。 当每一位都被设置时,最坏的时间复杂度将为 O(log n)。You can do:
The logic behind this is the bits of n-1 is inverted from rightmost set bit of n.
If n=6, i.e., 110 then 5 is 101 the bits are inverted from rightmost set bit of n.
So if we
&
these two we will make the rightmost bit 0 in every iteration and always go to the next rightmost set bit. Hence, counting the set bit. The worst time complexity will be O(log n) when every bit is set.如果您使用 C++,另一个选择是使用模板元编程:
用法是:
您当然可以进一步扩展此模板以使用不同的类型(甚至自动检测位大小),但为了清晰起见,我保持简单。
编辑:忘记提及这很好,因为它应该可以在任何 C++ 编译器中工作,并且如果使用常量值作为位数,它基本上只是为您展开循环(换句话说,我很确定这是您能找到的最快的通用方法)
if you're using C++ another option is to use template metaprogramming:
usage would be:
you could of course further expand this template to use different types (even auto-detecting bit size) but I've kept it simple for clarity.
edit: forgot to mention this is good because it should work in any C++ compiler and it basically just unrolls your loop for you if a constant value is used for the bit count (in other words, I'm pretty sure it's the fastest general method you'll find)
让我解释一下这个算法。
该算法基于分治算法。 假设有一个8位整数213(二进制为11010101),算法的工作原理如下(每次合并两个相邻块):
Let me explain this algorithm.
This algorithm is based on Divide and Conquer Algorithm. Suppose there is a 8bit integer 213(11010101 in binary), the algorithm works like this(each time merge two neighbor blocks):
来自《黑客之乐》,第 12 页 66,图 5-2
以大约 20 条指令执行(依赖于架构),无分支。
黑客的乐趣 令人愉快! 强烈推荐。
From Hacker's Delight, p. 66, Figure 5-2
Executes in ~20-ish instructions (arch dependent), no branching.
Hacker's Delight is delightful! Highly recommended.
“最佳算法”是什么意思? 短路代码还是禁食代码? 您的代码看起来非常优雅,并且执行时间恒定。 代码也很短。
但如果速度是主要因素而不是代码大小,那么我认为以下可能会更快:
我认为对于 64 位值来说这不会更快,但 32 位值可能会更快。
What do you means with "Best algorithm"? The shorted code or the fasted code? Your code look very elegant and it has a constant execution time. The code is also very short.
But if the speed is the major factor and not the code size then I think the follow can be faster:
I think that this will not more faster for a 64 bit value but a 32 bit value can be faster.
我使用下面的代码更直观。
逻辑:n & (n-1) 重置 n 的最后设置位。
PS:我知道这不是 O(1) 解决方案,尽管这是一个有趣的解决方案。
I use the below code which is more intuitive.
Logic : n & (n-1) resets the last set bit of n.
P.S : I know this is not O(1) solution, albeit an interesting solution.
几个悬而未决的问题:-
我们可以修改算法以支持负数,如下所示:-
现在为了克服第二个问题,我们可以编写如下算法:-
有关完整参考,请参阅:
http://goursaha.freeoda.com/Miscellaneous/IntegerBitCount.html
Few open questions:-
we can modify the algo to support the negative number as follows:-
now to overcome the second problem we can write the algo like:-
for complete reference see :
http://goursaha.freeoda.com/Miscellaneous/IntegerBitCount.html
您正在寻找的函数通常称为二进制数的“横向和”或“总体计数”。 Knuth 在前分册 1A,第 11-12 页中对此进行了讨论(尽管在第 2 卷,4.6.3-(7) 中有一个简短的参考。)
经典轨迹是 Peter Wegner 的文章“A Technique for在二进制计算机中计数”,来自 ACM 通讯,第 3 卷 (1960) 第 5 期,第 322 页。 他在那里给出了两种不同的算法,一种针对预期“稀疏”的数字(即数量较少)进行了优化,另一种针对相反的情况进行了优化。
The function you are looking for is often called the "sideways sum" or "population count" of a binary number. Knuth discusses it in pre-Fascicle 1A, pp11-12 (although there was a brief reference in Volume 2, 4.6.3-(7).)
The locus classicus is Peter Wegner's article "A Technique for Counting Ones in a Binary Computer", from the Communications of the ACM, Volume 3 (1960) Number 5, page 322. He gives two different algorithms there, one optimized for numbers expected to be "sparse" (i.e., have a small number of ones) and one for the opposite case.
这不是最快或最好的解决方案,但我以我的方式发现了同样的问题,我开始思考再思考。 最后我意识到,如果你从数学方面得到问题,并绘制一个图表,然后你发现它是一个具有一些周期性部分的函数,然后你意识到周期之间的差异......所以可以这样做干得好:
It's not the fastest or best solution, but I found the same question in my way, and I started to think and think. finally I realized that it can be done like this if you get the problem from mathematical side, and draw a graph, then you find that it's a function which has some periodic part, and then you realize the difference between the periods... so here you go:
我认为 Brian Kernighan 的 方法也很有用......
它会经历与设置位一样多的迭代。 因此,如果我们有一个仅设置高位的 32 位字,那么它只会循环一次。
I think the Brian Kernighan's method will be useful too...
It goes through as many iterations as there are set bits. So if we have a 32-bit word with only the high bit set, then it will only go once through the loop.
当您写出位模式时,黑客之乐的位操作变得更加清晰。
第一步将偶数位与奇数位相加,产生每两个位的和。 其他步骤将高阶块添加到低阶块中,将块大小一直加倍,直到我们得到占据整个 int 的最终计数。
The Hacker's Delight bit-twiddling becomes so much clearer when you write out the bit patterns.
The first step adds the even bits to the odd bits, producing a sum of bits in each two. The other steps add high-order chunks to low-order chunks, doubling the chunk size all the way up, until we have the final count taking up the entire int.
对于 232 查找表和单独迭代每个位之间的折中方案:
来自 http: //ctips.pbwiki.com/CountBits
For a happy medium between a 232 lookup table and iterating through each bit individually:
From http://ctips.pbwiki.com/CountBits
这可以在
O(k)
中完成,其中k
是设置的位数。This can be done in
O(k)
, wherek
is the number of bits set.为什么不迭代除以 2?
我同意这不是最快的,但“最好”有点含糊。 我认为“最好”应该有一个清晰的要素
Why not iteratively divide by 2?
I agree that this isn't the fastest, but "best" is somewhat ambiguous. I'd argue though that "best" should have an element of clarity
这是有助于了解微架构的问题之一。 我刚刚对使用 -O3 编译的 gcc 4.3.3 下的两个变体进行了计时,使用 C++ 内联来消除函数调用开销,十亿次迭代,保持所有计数的运行总和以确保编译器不会删除任何重要的内容,使用 rdtsc 进行计时(时钟周期精确)。
未经修改的 Hacker's Delight 需要 12.2 gigacycle。 我的并行版本(位数是其两倍)运行速度为 13.0 GB。 在 2.4GHz Core Duo 上,两者总共花费了 10.5 秒。 25 gigacycles = 在此时钟频率下仅超过 10 秒,所以我确信我的计时是正确的。
这与指令依赖链有关,这对该算法非常不利。 通过使用一对 64 位寄存器,我几乎可以将速度再次提高一倍。 事实上,如果我聪明一点,早点添加 x+ya,我就可以减少一些班次。 经过一些小调整的 64 位版本的结果大约是偶数,但位数又是原来的两倍。
凭借 128 位 SIMD 寄存器(又一个两倍),SSE 指令集通常也有巧妙的快捷方式。
代码没有理由特别透明。 界面简单,算法可以在很多地方在线引用,并且可以进行全面的单元测试。 偶然发现它的程序员甚至可能会学到一些东西。 这些位操作在机器级别上是非常自然的。
好的,我决定对调整后的 64 位版本进行测试。 对于这个 sizeof(unsigned long) == 8
看起来差不多是正确的(不过我没有仔细测试)。 现在计时结果为 10.70 gigacycles / 14.1 gigacycles。 后来的数字总计 1280 亿位,相当于这台机器上经过的 5.9 秒。 非并行版本加快了一点点,因为我在 64 位模式下运行,并且它喜欢 64 位寄存器,比 32 位寄存器稍好一些。
让我们看看这里是否还有更多的 OOO 管道。 这有点复杂,所以我实际上测试了一下。 每个项的总和为 64,所有总和为 256。
我有一瞬间很兴奋,但事实证明 gcc 正在使用 -O3 玩内联技巧,即使我在某些测试中没有使用 inline 关键字。 当我让 gcc 发挥作用时,对 pop4() 的十亿次调用需要 12.56 gigacycle,但我确定它将参数折叠为常量表达式。 更现实的数字似乎是 19.6gc,另外加速 30%。 我的测试循环现在看起来像这样,确保每个参数都足够不同以阻止 gcc 玩把戏。
8.17 秒内总计 2560 亿比特。 按照 16 位表查找的基准计算,3200 万位的计算结果为 1.02 秒。 无法直接比较,因为另一个bench没有给出时钟速度,但看起来我已经从64KB表格版本中打掉了鼻涕,这首先是L1缓存的悲剧性使用。
更新:决定做显而易见的事情并通过添加四个重复行来创建 pop6() 。 结果为 22.8gc,9.5 秒内总计 3840 亿位。 因此,现在还有 20% 的时间为 320 亿位,时间为 800 毫秒。
This is one of those questions where it helps to know your micro-architecture. I just timed two variants under gcc 4.3.3 compiled with -O3 using C++ inlines to eliminate function call overhead, one billion iterations, keeping the running sum of all counts to ensure the compiler doesn't remove anything important, using rdtsc for timing (clock cycle precise).
The unmodified Hacker's Delight took 12.2 gigacycles. My parallel version (counting twice as many bits) runs in 13.0 gigacycles. 10.5s total elapsed for both together on a 2.4GHz Core Duo. 25 gigacycles = just over 10 seconds at this clock frequency, so I'm confident my timings are right.
This has to do with instruction dependency chains, which are very bad for this algorithm. I could nearly double the speed again by using a pair of 64-bit registers. In fact, if I was clever and added x+y a little sooner I could shave off some shifts. The 64-bit version with some small tweaks would come out about even, but count twice as many bits again.
With 128 bit SIMD registers, yet another factor of two, and the SSE instruction sets often have clever short-cuts, too.
There's no reason for the code to be especially transparent. The interface is simple, the algorithm can be referenced on-line in many places, and it's amenable to comprehensive unit test. The programmer who stumbles upon it might even learn something. These bit operations are extremely natural at the machine level.
OK, I decided to bench the tweaked 64-bit version. For this one sizeof(unsigned long) == 8
That looks about right (I'm not testing carefully, though). Now the timings come out at 10.70 gigacycles / 14.1 gigacycles. That later number summed 128 billion bits and corresponds to 5.9s elapsed on this machine. The non-parallel version speeds up a tiny bit because I'm running in 64-bit mode and it likes 64-bit registers slightly better than 32-bit registers.
Let's see if there's a bit more OOO pipelining to be had here. This was a bit more involved, so I actually tested a bit. Each term alone sums to 64, all combined sum to 256.
I was excited for a moment, but it turns out gcc is playing inline tricks with -O3 even though I'm not using the inline keyword in some tests. When I let gcc play tricks, a billion calls to pop4() takes 12.56 gigacycles, but I determined it was folding arguments as constant expressions. A more realistic number appears to be 19.6gc for another 30% speed-up. My test loop now looks like this, making sure each argument is different enough to stop gcc from playing tricks.
256 billion bits summed in 8.17s elapsed. Works out to 1.02s for 32 million bits as benchmarked in the 16-bit table lookup. Can't compare directly, because the other bench doesn't give a clock speed, but looks like I've slapped the snot out of the 64KB table edition, which is a tragic use of L1 cache in the first place.
Update: decided to do the obvious and create pop6() by adding four more duplicated lines. Came out to 22.8gc, 384 billion bits summed in 9.5s elapsed. So there's another 20% Now at 800ms for 32 billion bits.
如果您碰巧使用 Java,内置方法
Integer.bitCount
将执行此操作。If you happen to be using Java, the built-in method
Integer.bitCount
will do that.我感到无聊,并对三种方法的十亿次迭代进行了计时。 编译器是gcc -O3。 CPU 是他们在第一代 Macbook Pro 中安装的任何东西。
最快的是以下代码,用时 3.7 秒:
第二名是相同的代码,但查找 4 个字节而不是 2 个半字。 这花了大约 5.5 秒。
第三名是有点麻烦的“横向加法”方法,用时 8.6 秒。
第四名是 GCC 的 __builtin_popcount(),成绩为可耻的 11 秒。
一次一位计数的方法要慢得多,而且我厌倦了等待它完成。
因此,如果您最关心性能,那么请使用第一种方法。 如果您关心,但又不足以花费 64Kb RAM,请使用第二种方法。 否则,请使用可读(但速度慢)的一次一位方法。
很难想象在什么情况下您会想要使用位调整方法。
编辑:类似的结果此处。
I got bored, and timed a billion iterations of three approaches. Compiler is gcc -O3. CPU is whatever they put in the 1st gen Macbook Pro.
Fastest is the following, at 3.7 seconds:
Second place goes to the same code but looking up 4 bytes instead of 2 halfwords. That took around 5.5 seconds.
Third place goes to the bit-twiddling 'sideways addition' approach, which took 8.6 seconds.
Fourth place goes to GCC's __builtin_popcount(), at a shameful 11 seconds.
The counting one-bit-at-a-time approach was waaaay slower, and I got bored of waiting for it to complete.
So if you care about performance above all else then use the first approach. If you care, but not enough to spend 64Kb of RAM on it, use the second approach. Otherwise use the readable (but slow) one-bit-at-a-time approach.
It's hard to think of a situation where you'd want to use the bit-twiddling approach.
Edit: Similar results here.
我认为最快的方法(不使用查找表和 popcount)如下。 只需 12 次操作即可计算设置位。
它之所以有效,是因为您可以通过将设置位分成两半来计算设置位的总数,计算两半中设置位的数量,然后将它们相加。 也称为
分而治之
范式。 让我们详细说明一下。两位中的位数可以是
0b00
、0b01
或0b10
。 让我们尝试在 2 位上解决这个问题。这就是所需要的:最后一列显示每两个位对中设置位的计数。 如果两位数为
>= 2 (0b10)
,则and
生成0b01
,否则生成0b00
。这个说法应该很容易理解。 第一次操作后,我们得到了每两位中设置位的计数,现在我们将每 4 位中的计数相加。
然后我们对上面的结果求和,得到 4 位中设置位的总数。 最后一条语句是最棘手的。
让我们进一步分解...
它与第二个语句类似; 我们以 4 为一组来计算设置的位。 由于我们之前的操作,我们知道每个半字节中都有设置位的计数。 让我们看一个例子。 假设我们有字节
0b01000010
。 这意味着第一个半字节设置了 4 位,第二个半字节设置了 2 位。 现在我们将这些小块加在一起。它为我们提供了第二个半字节
0b01000110
中一个字节中设置位的计数,因此我们屏蔽了该数字中所有字节的前四个字节(丢弃它们)。现在每个字节都有设置位的计数。 我们需要将它们加在一起。 诀窍是将结果乘以
0b10101010
,它有一个有趣的属性。 如果我们的数字有四个字节ABC D
,它将产生一个具有这些字节A+B+C+DB+C+D C+D D
的新数字。 4 字节数字最多可以设置 32 位,可以表示为0b00100000
。现在我们需要的是第一个字节,它具有所有字节中所有设置位的总和,我们可以通过
>> 得到它。 24
。 该算法专为32 位
字设计,但可以轻松修改为64 位
字。I think the fastest way—without using lookup tables and popcount—is the following. It counts the set bits with just 12 operations.
It works because you can count the total number of set bits by dividing in two halves, counting the number of set bits in both halves and then adding them up. Also know as
Divide and Conquer
paradigm. Let's get into detail..The number of bits in two bits can be
0b00
,0b01
or0b10
. Lets try to work this out on 2 bits..This is what was required: the last column shows the count of set bits in every two bit pair. If the two bit number is
>= 2 (0b10)
thenand
produces0b01
, else it produces0b00
.This statement should be easy to understand. After the first operation we have the count of set bits in every two bits, now we sum up that count in every 4 bits.
We then sum up the above result, giving us the total count of set bits in 4 bits. The last statement is the most tricky.
Let's break it down further...
It's similar to the second statement; we are counting the set bits in groups of 4 instead. We know—because of our previous operations—that every nibble has the count of set bits in it. Let's look an example. Suppose we have the byte
0b01000010
. It means the first nibble has its 4bits set and the second one has its 2bits set. Now we add those nibbles together.It gives us the count of set bits in a byte, in the second nibble
0b01000110
and therefore we mask the first four bytes of all the bytes in the number (discarding them).Now every byte has the count of set bits in it. We need to add them up all together. The trick is to multiply the result by
0b10101010
which has an interesting property. If our number has four bytes,A B C D
, it will result in a new number with these bytesA+B+C+D B+C+D C+D D
. A 4 byte number can have maximum of 32 bits set, which can be represented as0b00100000
.All we need now is the first byte which has the sum of all set bits in all the bytes, and we get it by
>> 24
. This algorithm was designed for32 bit
words but can be easily modified for64 bit
words.在我看来,“最好”的解决方案是可以被另一位程序员(或两年后的原始程序员)阅读而无需大量注释的解决方案。 您可能很想要最快或最聪明的解决方案,有些已经提供了,但我更喜欢可读性而不是聪明。
如果您想要更快的速度(假设您很好地记录了它以帮助您的继任者),您可以使用表查找:
这些依赖于特定的数据类型大小,因此它们不那么可移植。 但是,由于许多性能优化无论如何都不可移植,因此这可能不是问题。 如果您想要可移植性,我会坚持使用可读的解决方案。
In my opinion, the "best" solution is the one that can be read by another programmer (or the original programmer two years later) without copious comments. You may well want the fastest or cleverest solution which some have already provided but I prefer readability over cleverness any time.
If you want more speed (and assuming you document it well to help out your successors), you could use a table lookup:
These rely on specific data type sizes so they're not that portable. But, since many performance optimisations aren't portable anyway, that may not be an issue. If you want portability, I'd stick to the readable solution.
某些语言以一种可以使用有效硬件支持(如果可用)的方式可移植地公开操作,否则希望有一些不错的库回退。
例如(来自按语言列出的表格):
std:: bitset<>::count()
,或 C++20 < code>std::popcount(T x)java.lang.Integer.bitCount()
(也适用于 Long 或 BigInteger)System. Numerics.BitOperations.PopCount()
int.bit_count()
(自 3.10 起)不过,并非所有编译器/库实际上都能在可用时使用硬件支持。 (特别是 MSVC,即使使用使 std::popcount 内联为 x86 popcnt 的选项,其 std::bitset::count 仍然始终使用查找表。这有望在未来版本中改变。)
还要考虑以下内置函数当可移植语言没有这种基本的位操作时,您的编译器。 以 GNU C 为例:
在最坏的情况下(没有单指令硬件支持),编译器将生成对函数的调用(在当前的 GCC 中使用移位/和位黑客 喜欢这个答案,至少对于 x86 )。 在最好的情况下,编译器将发出一条 cpu 指令来完成这项工作。 (就像
*
或/
运算符 - GCC 将使用硬件乘法或除法指令(如果可用),否则将调用 libgcc 辅助函数。)或者甚至更好,如果操作数是内联后的编译时常量,它可以进行常量传播以获得编译时常量 popcount 结果。GCC 内置程序甚至可以跨多个平台工作。 Popcount 几乎已成为 x86 架构中的主流,因此现在开始使用内置指令是有意义的,这样当您使用
-mpopcnt
或包含该指令的内容进行编译时,您可以重新编译以使其内联硬件指令( 示例)。 其他架构已经流行多年,但在 x86 世界中,仍然有一些古老的 Core 2 和类似的老式 AMD CPU 在使用。在 x86 上,您可以告诉编译器它可以使用
-mpopcnt
(也由-msse4.2
暗示)支持popcnt
指令。 请参阅 GCC x86 选项。-march=nehalem -mtune=skylake
(或-march=
您希望代码采用并调整的任何 CPU)可能是一个不错的选择。 在较旧的 CPU 上运行生成的二进制文件将导致非法指令错误。要使二进制文件针对您构建它们的计算机进行优化,请使用
-march=native
(使用 gcc、clang 或 ICC)。MSVC 为 x86
popcnt
指令提供了一个内在函数,但是与 gcc 不同,它实际上是硬件指令的内在特征,需要硬件支持。使用
std::bitset<>::count()
而不是内置函数理论上,任何知道如何为目标 CPU 有效弹出计数的编译器都应该通过 ISO C++ 公开该功能
std::bitset
。 实际上,在某些情况下,对于某些目标 CPU,您可能会更好地使用位破解 AND/shift/ADD。对于硬件 popcount 是可选扩展的目标体系结构(例如 x86),并非所有编译器都具有在可用时利用它的
std::bitset
。 例如,MSVC 无法在编译时启用popcnt
支持,并且它的std::bitset<>::count
始终使用 表查找,即使使用/Ox /arch:AVX
(其中意味着 SSE4.2,这又意味着 popcnt 功能。)(更新:见下文;确实让 MSVC 的 C++20std::popcount
使用 x86 < code>popcnt,但仍然不是它的 bitset<>::count。MSVC 可以通过更新其标准库头以使用 std::popcount(如果可用)来解决这个问题。)但至少你得到了一些可以在任何地方使用的可移植的东西。 ,并且使用带有正确目标选项的 gcc/clang,您可以获得支持它的架构的硬件 popcount。
See 来自 Godbolt 编译器资源管理器上的 gcc、clang、icc 和 MSVC 的 asm。
x86-64
gcc -O3 -std=gnu++11 -mpopcnt
发出此消息:PowerPC64
gcc -O3 -std=gnu++11
发出(对于int
arg version):该源代码根本不是 x86 特定或 GNU 特定的,但只能与 gcc/clang/icc 良好地编译,至少在针对 x86(包括 x86-64)时是这样。
另请注意,gcc 对于没有单指令 popcount 的体系结构的后备是一次字节表查找。 对于 ARM 来说,这并不是什么好事,例如。
C++20 有
std::popcount(T)
< /a>不幸的是,当前的 libstdc++ 标头在开始时使用特殊情况
if(x==0) return 0;
定义了它,在编译 x86 时,clang 不会优化:clang 11.0.1
-O3 -std=gnu++20 -march=nehalem
现场演示但 GCC 编译得很好:
甚至 MSVC 也能很好地使用它,只要您使用
-arch:AVX
或更高版本(并使用-std:c++latest)。 现场演示
Some languages portably expose the operation in a way that can use efficient hardware support if available, otherwise some library fallback that's hopefully decent.
For example (from a table by language):
std::bitset<>::count()
, or C++20std::popcount(T x)
java.lang.Integer.bitCount()
(also for Long or BigInteger)System.Numerics.BitOperations.PopCount()
int.bit_count()
(since 3.10)Not all compilers / libraries actually manage to use HW support when it's available, though. (Notably MSVC, even with options that make std::popcount inline as x86 popcnt, its std::bitset::count still always uses a lookup table. This will hopefully change in future versions.)
Also consider the built-in functions of your compiler when the portable language doesn't have this basic bit operation. In GNU C for example:
In the worst case (no single-instruction HW support) the compiler will generate a call to a function (which in current GCC uses a shift/and bit-hack like this answer, at least for x86). In the best case the compiler will emit a cpu instruction to do the job. (Just like a
*
or/
operator - GCC will use a hardware multiply or divide instruction if available, otherwise will call a libgcc helper function.) Or even better, if the operand is a compile-time constant after inlining, it can do constant-propagation to get a compile-time-constant popcount result.The GCC builtins even work across multiple platforms. Popcount has almost become mainstream in the x86 architecture, so it makes sense to start using the builtin now so you can recompile to let it inline a hardware instruction when you compile with
-mpopcnt
or something that includes that (example). Other architectures have had popcount for years, but in the x86 world there are still some ancient Core 2 and similar vintage AMD CPUs in use.On x86, you can tell the compiler that it can assume support for
popcnt
instruction with-mpopcnt
(also implied by-msse4.2
). See GCC x86 options.-march=nehalem -mtune=skylake
(or-march=
whatever CPU you want your code to assume and to tune for) could be a good choice. Running the resulting binary on an older CPU will result in an illegal-instruction fault.To make binaries optimized for the machine you build them on, use
-march=native
(with gcc, clang, or ICC).MSVC provides an intrinsic for the x86
popcnt
instruction, but unlike gcc it's really an intrinsic for the hardware instruction and requires hardware support.Using
std::bitset<>::count()
instead of a built-inIn theory, any compiler that knows how to popcount efficiently for the target CPU should expose that functionality through ISO C++
std::bitset<>
. In practice, you might be better off with the bit-hack AND/shift/ADD in some cases for some target CPUs.For target architectures where hardware popcount is an optional extension (like x86), not all compilers have a
std::bitset
that takes advantage of it when available. For example, MSVC has no way to enablepopcnt
support at compile time, and it'sstd::bitset<>::count
always uses a table lookup, even with/Ox /arch:AVX
(which implies SSE4.2, which in turn implies the popcnt feature.) (Update: see below; that does get MSVC's C++20std::popcount
to use x86popcnt
, but still not its bitset<>::count. MSVC could fix that by updating their standard library headers to use std::popcount when available.)But at least you get something portable that works everywhere, and with gcc/clang with the right target options, you get hardware popcount for architectures that support it.
See asm from gcc, clang, icc, and MSVC on the Godbolt compiler explorer.
x86-64
gcc -O3 -std=gnu++11 -mpopcnt
emits this:PowerPC64
gcc -O3 -std=gnu++11
emits (for theint
arg version):This source isn't x86-specific or GNU-specific at all, but only compiles well with gcc/clang/icc, at least when targeting x86 (including x86-64).
Also note that gcc's fallback for architectures without single-instruction popcount is a byte-at-a-time table lookup. This isn't wonderful for ARM, for example.
C++20 has
std::popcount(T)
Current libstdc++ headers unfortunately define it with a special case
if(x==0) return 0;
at the start, which clang doesn't optimize away when compiling for x86:clang 11.0.1
-O3 -std=gnu++20 -march=nehalem
Live demoBut GCC compiles nicely:
Even MSVC does well with it, as long as you use
-arch:AVX
or later (and enable C++20 with-std:c++latest
). Live demo这称为“汉明权重”、“popcount”或“横向加法”。
一些 CPU 有一条内置指令来完成此操作,而另一些 CPU 则具有作用于位向量的并行指令。 像 x86 的
popcnt
这样的指令(在支持它的 CPU 上)几乎肯定会对于单个整数来说是最快的。 某些其他架构可能具有通过微编码循环实现的慢速指令,该微编码循环在每个周期测试一位(需要引用 - 硬件弹出计数通常很快,如果存在的话。)。“最佳”算法实际上取决于您使用的 CPU 以及您的使用模式。
您的编译器可能知道如何执行适合您正在编译的特定 CPU 的操作,例如 C++20
std::popcount()
或 C++std::bitset<32>::count()
,作为访问内置/内在函数的可移植方式(请参阅关于这个问题的另一个答案)。 但是您的编译器为没有硬件 popcnt 的目标 CPU 选择的回退可能不是您的用例的最佳选择。 或者您的语言(例如 C)可能不会公开任何可以使用特定于 CPU 的 popcount 的可移植函数(当有一个函数时)。不需要(或受益于)任何硬件支持的可移植算法
如果您的 CPU 具有较大的缓存并且您在紧密循环中执行大量此类操作,则预填充的表查找方法可能会非常快。 然而,它可能会因为“高速缓存未命中”的代价而受到影响,即 CPU 必须从主内存中获取一些表。 (单独查找每个字节以保持表格较小。)如果您想要连续范围的数字的 popcount,则只有低字节会更改 256 个数字的组,这非常好。
如果您知道您的字节大部分是 0 或大部分是 1,那么对于这些情况有一些有效的算法,例如,在循环中使用 bithack 清除最低的集合,直到它变为零。
我相信下面是一个非常好的通用算法,称为“并行”或“可变精度 SWAR 算法”。 我已经用类似 C 的伪语言表达了这一点,您可能需要调整它以适用于特定语言(例如,在 C++ 中使用 uint32_t,在 Java 中使用 >>):
GCC10 和 clang 10.0 可以识别此模式/ idiom 并将其编译为硬件 popcnt 或可用的等效指令,为您提供两全其美的效果。 (Godbolt)
对于 JavaScript:使用
|0
强制转换为整数以提高性能:将第一行更改为i = (i|0) - ((i >> 1)& 0x55555555);
这具有所讨论的任何算法中最好的最坏情况行为,因此将有效地处理您向其抛出的任何使用模式或值。 (它的性能不依赖于普通 CPU 的数据,其中包括乘法在内的所有整数运算都是恒定时间的。使用“简单”输入它不会变得更快,但它仍然相当不错。)
参考:
这个 SWAR bithack 的工作原理:
第一步是屏蔽的优化版本,以隔离奇数/偶数位,移位以排列它们,然后添加。 这实际上在 2 位累加器中进行了 16 次单独的加法(SWAR = SIMD Within A Register)。 就像
(i & 0x55555555) + ((i>>1) & 0x55555555)
。下一步采用这些 16x 2 位累加器中的奇数/偶数八个并再次相加,产生 8x 4 位和。 这次不可能进行 i - ... 优化,因此它只是在移位之前/之后进行屏蔽。 当编译需要在寄存器中构造 32 位常量的 ISA 时,在移位之前两次使用相同的
0x33...
常量而不是0xccc...
是一件好事分别地。最后的移位加法步骤
(i + (i >> 4)) & 0x0F0F0F0F
加宽至 4x 8 位累加器。 它在加法之后而不是之前进行屏蔽,因为如果设置了相应输入位的所有 4 位,则任何 4 位累加器中的最大值都是4
。 4+4 = 8 仍然适合 4 位,因此在i + (i >> 4)
中不可能在半字节元素之间进位。到目前为止,这只是使用 SWAR 技术并进行了一些巧妙优化的相当正常的 SIMD。 继续使用相同的模式再执行 2 个步骤可以扩大到 2x 16 位计数,然后扩大到 1x 32 位计数。 但是,在具有快速硬件乘法的机器上,有一种更有效的方法:
一旦我们有足够少的“元素”,与魔术常量的乘法可以将所有元素求和到顶部元素。 在本例中为字节元素。 乘法是通过左移和加法完成的,因此 x * 0x01010101 的乘法结果为
x + (x<<8) + (x<<16) + (x<<24)
。我们的 8 位元素足够宽(并且保持足够小的计数),因此不会在最高 8 位中产生进位 。此版本的 64 位版本可以使用 0x0101010101010101 乘法器在 64 位整数中执行 8x 8 位元素,并使用
>>56
提取高字节。 所以它不需要任何额外的步骤,只需要更广泛的常数。 这是当未启用硬件popcnt
指令时,GCC 在 x86 系统上对__builtin_popcountll
使用的内容。 如果您可以为此使用内置函数或内在函数,请这样做以使编译器有机会进行特定于目标的优化。使用完整的 SIMD 来处理更宽的向量(例如对整个数组进行计数)
这种按位 SWAR 算法可以并行化以同时在多个向量元素中完成,而不是在单个整数寄存器中完成,从而在具有 SIMD 但没有可用的 popcount 指令的 CPU 上实现加速。 (例如,x86-64 代码必须在任何 CPU 上运行,而不仅仅是 Nehalem 或更高版本。)
但是,使用向量指令进行 popcount 的最佳方法通常是使用变量洗牌来执行 4 位的表查找每个字节并行的时间。 (4 位索引向量寄存器中保存的 16 项表)。
在 Intel CPU 上,硬件 64 位 popcnt 指令的性能优于 SSSE3
PSHUFB
位并行实现大约为 2 倍,但前提是如果您的编译器正确处理一个>。 否则上证所可能会大幅领先。 较新的编译器版本可以识别 popcnt false 依赖项 英特尔问题。vpternlogd
使 Harley-Seal 非常出色。)This is known as the 'Hamming Weight', 'popcount' or 'sideways addition'.
Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. Instructions like x86's
popcnt
(on CPUs where it's supported) will almost certainly be fastest for a single integer. Some other architectures may have a slow instruction implemented with a microcoded loop that tests a bit per cycle (citation needed - hardware popcount is normally fast if it exists at all.).The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.
Your compiler may know how to do something that's good for the specific CPU you're compiling for, e.g. C++20
std::popcount()
, or C++std::bitset<32>::count()
, as a portable way to access builtin / intrinsic functions (see another answer on this question). But your compiler's choice of fallback for target CPUs that don't have hardware popcnt might not be optimal for your use-case. Or your language (e.g. C) might not expose any portable function that could use a CPU-specific popcount when there is one.Portable algorithms that don't need (or benefit from) any HW support
A pre-populated table lookup method can be very fast if your CPU has a large cache and you are doing lots of these operations in a tight loop. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory. (Look up each byte separately to keep the table small.) If you want popcount for a contiguous range of numbers, only the low byte is changing for groups of 256 numbers, making this very good.
If you know that your bytes will be mostly 0's or mostly 1's then there are efficient algorithms for these scenarios, e.g. clearing the lowest set with a bithack in a loop until it becomes zero.
I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm'. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and >>> in Java):
GCC10 and clang 10.0 can recognize this pattern / idiom and compile it to a hardware popcnt or equivalent instruction when available, giving you the best of both worlds. (Godbolt)
For JavaScript: coerce to integer with
|0
for performance: change the first line toi = (i|0) - ((i >> 1) & 0x55555555);
This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it. (Its performance is not data-dependent on normal CPUs where all integer operations including multiply are constant-time. It doesn't get any faster with "simple" inputs, but it's still pretty decent.)
References:
How this SWAR bithack works:
The first step is an optimized version of masking to isolate the odd / even bits, shifting to line them up, and adding. This effectively does 16 separate additions in 2-bit accumulators (SWAR = SIMD Within A Register). Like
(i & 0x55555555) + ((i>>1) & 0x55555555)
.The next step takes the odd/even eight of those 16x 2-bit accumulators and adds again, producing 8x 4-bit sums. The
i - ...
optimization isn't possible this time so it does just mask before / after shifting. Using the same0x33...
constant both times instead of0xccc...
before shifting is a good thing when compiling for ISAs that need to construct 32-bit constants in registers separately.The final shift-and-add step of
(i + (i >> 4)) & 0x0F0F0F0F
widens to 4x 8-bit accumulators. It masks after adding instead of before, because the maximum value in any 4-bit accumulator is4
, if all 4 bits of the corresponding input bits were set. 4+4 = 8 which still fits in 4 bits, so carry between nibble elements is impossible ini + (i >> 4)
.So far this is just fairly normal SIMD using SWAR techniques with a few clever optimizations. Continuing on with the same pattern for 2 more steps can widen to 2x 16-bit then 1x 32-bit counts. But there is a more efficient way on machines with fast hardware multiply:
Once we have few enough "elements", a multiply with a magic constant can sum all the elements into the top element. In this case byte elements. Multiply is done by left-shifting and adding, so a multiply of
x * 0x01010101
results inx + (x<<8) + (x<<16) + (x<<24)
. Our 8-bit elements are wide enough (and holding small enough counts) that this doesn't produce carry into that top 8 bits.A 64-bit version of this can do 8x 8-bit elements in a 64-bit integer with a 0x0101010101010101 multiplier, and extract the high byte with
>>56
. So it doesn't take any extra steps, just wider constants. This is what GCC uses for__builtin_popcountll
on x86 systems when the hardwarepopcnt
instruction isn't enabled. If you can use builtins or intrinsics for this, do so to give the compiler a chance to do target-specific optimizations.With full SIMD for wider vectors (e.g. counting a whole array)
This bitwise-SWAR algorithm could parallelize to be done in multiple vector elements at once, instead of in a single integer register, for a speedup on CPUs with SIMD but no usable popcount instruction. (e.g. x86-64 code that has to run on any CPU, not just Nehalem or later.)
However, the best way to use vector instructions for popcount is usually by using a variable-shuffle to do a table-lookup for 4 bits at a time of each byte in parallel. (The 4 bits index a 16 entry table held in a vector register).
On Intel CPUs, the hardware 64bit popcnt instruction can outperform an SSSE3
PSHUFB
bit-parallel implementation by about a factor of 2, but only if your compiler gets it just right. Otherwise SSE can come out significantly ahead. Newer compiler versions are aware of the popcnt false dependency problem on Intel.vpternlogd
making Harley-Seal very good.)