查找位数组中设置的最高有效位(最左边)
我有一个位数组实现,其中第 0 个索引是数组中第一个字节的 MSB,第 8 个索引是第二个字节的 MSB,等等...
找到在此设置的第一个位的快速方法是什么位数组?我查找过的所有相关解决方案都找到了第一个最低有效位,但我需要第一个最高有效位。因此,给定 0x00A1,我想要 8(因为它是左起第 9 位)。
I have a bit array implementation where the 0th index is the MSB of the first byte in an array, the 8th index is the MSB of the second byte, etc...
What's a fast way to find the first bit that is set in this bit array? All the related solutions I have looked up find the first least significant bit, but I need the first most significant one. So, given 0x00A1, I want 8 (since it's the 9th bit from the left).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(17)
GCC 有
__builtin_clz
转换为 x86/x64 上的 BSR、ARM 上的 CLZ 等,并在硬件未实现时模拟该指令。Visual C++ 2005 及更高版本具有
_BitScanReverse
。
GCC has
__builtin_clz
that translates to BSR on x86/x64, CLZ on ARM, etc. and emulates the instruction if the hardware does not implement it.Visual C++ 2005 and up has
_BitScanReverse
.tl:博士;对于 32 位,请使用 de Bruijn 乘法。
这是 “最快” 可移植算法。它比该线程中的所有其他可移植 32 位 MSB 算法更快、更正确。
当输入为零时,de Bruijn 算法也会返回正确的结果。 __builtin_clz 和 _BitScanReverse 指令返回不正确的结果输入为零。
在 Windows x86-64 上,de Bruijn 乘法的运行速度与等效(有缺陷的)Windows 函数相当,性能差异仅为 3% 左右。
这是代码。
该线程中的所有其他答案要么运行得比作者建议的要差得多,要么没有正确计算结果,或者两者兼而有之。让我们对它们进行基准测试,并验证它们是否按照它们声称的那样进行。
这是一个简单的 C++11 工具来测试所有这些实现。它在 Visual Studio 上编译干净,但应该适用于所有现代编译器。它允许您在性能模式 (bVerifyResults = false) 和检查模式 (bVerifyResults = true) 下运行基准测试。
以下是验证模式下的结果:
当输入为零时,“性能迷”和 Microsoft 本机实现会执行不同的操作。 msbPerformanceJunkie32 产生 -1,微软的 _BitScanReverse 产生一个随机数,与底层硬件指令一致。此外,msbPerformanceJunkie32 实现产生的结果与所有其他答案相差一。
以下是在我的 i7-4600 笔记本电脑上运行、在发布模式下编译的性能模式下的结果:
de Bruijn 版本完美地击败了其他实现,因为它是无分支的,因此它可以很好地应对以下输入:产生一组均匀分布的输出。由于现代 CPU 上分支错误预测的惩罚,所有其他版本对于任意输入都较慢。 smbFfs 函数会产生不正确的结果,因此可以忽略。
有些实现适用于 32 位输入,有些实现适用于 64 位输入。模板将帮助我们比较苹果与苹果,无论输入大小如何。
这是代码。如果您愿意,请自行下载并运行基准测试。
tl:dr; For 32 bits, use de Bruijn multiplication.
It's the "fastest" portable algorithm. It is substantially faster and more correct than all the other portable 32-bit MSB algorithms in this thread.
The de Bruijn algorithm also returns a correct result when the input is zero. The __builtin_clz and _BitScanReverse instructions return incorrect results when the input is zero.
On Windows x86-64, de Bruijn multiplication runs at a speed comparable to the equivalent (flawed) Windows function, with a performance difference of only around 3%.
Here's the code.
All the other answers in this thread either run much more poorly than their authors suggest, or don't calculate the result correctly, or both. Let's benchmark them all, and let's verify that they do what they claim to do.
Here's a simple C++11 harness to test all these implementations. It compiles clean on Visual Studio but should work on all modern compilers. It allows you to run the benchmark in performance mode (bVerifyResults = false) and in checking mode (bVerifyResults = true).
Here are the results in verification mode:
The "performance junkie" and the Microsoft native implementations do different things when the input is zero. msbPerformanceJunkie32 produces -1, and Microsoft's _BitScanReverse produces a random number, consistent with the underlying hardware instruction. Also the msbPerformanceJunkie32 implementation produces a result that is off by one from all the other answers.
Here are the results in performance mode, running on my i7-4600 laptop, compiled in release mode:
The de Bruijn version beats the other implementations soundly because it is branchless, and therefore it runs well against inputs that produce an evenly distributed set of outputs. All the other versions are slower against arbitrary inputs because of the penalties of branch misprediction on modern CPUs. The smbFfs function produces incorrect results so it can be ignored.
Some of the implementations work on 32 bit inputs, and some work on 64 bit inputs. A template will help us compare apples to apples, regardless of the input size.
Here's the code. Download and run the benchmarks yourself if you like.
作为一个性能迷,我尝试了很多 MSB 集的变体,以下是我遇到的最快的,
As a performance junkie I have tried a ton of variations for MSB set, the following is the fastest I have come across,
有多种方法可以做到这一点,并且不同实现的相对性能在某种程度上取决于机器(我碰巧出于类似目的在某种程度上对此进行了基准测试)。在某些机器上甚至有一个内置指令(如果可用并且可以处理可移植性,请使用一个指令)。
在此处查看一些实现(在“以 2 为底的整数对数”下)。如果您使用的是 GCC,请查看函数
__builtin_clz
和__builtin_clzl
(它们分别对非零无符号整数和无符号长整数执行此操作)。 “clz”代表“计算前导零”,这是描述同一问题的另一种方式。当然,如果您的位数组不适合合适的机器字,则需要迭代数组中的字以找到第一个非零字,然后仅对该字执行此计算。
There are multiple ways to do this, and the relative performance of the different implementations is somewhat machine-dependent (I happen to have benchmarked this to some extent for a similar purpose). On some machines there's even a built-in instruction for this (use one if available and portability can be dealt with).
Check out some implementations here (under “integer log base 2”). If you are using GCC, check out the functions
__builtin_clz
and__builtin_clzl
(which do this for non-zero unsigned ints and unsigned longs, respectively). The “clz” stands for “count leading zeros”, which is yet another way to describe the same problem.Of course, if your bit array does not fit into a suitable machine word, you need to iterate over words in the array to find the first non-zero word and then perform this calculation only on that word.
查找 BSR(位扫描反向)x86 asm 指令以获得执行此操作的最快方法。来自英特尔的文档:
在源操作数(第二个操作数)中搜索最高有效设置位(1 位)。
如果找到最高有效 1 位,则其位索引存储在目标操作数中
(第一个操作数)。
Look up the BSR (Bit scan reverse) x86 asm instruction for the fastest way to do this. From Intel's doc:
Searches the source operand (second operand) for the most significant set bit (1 bit).
If a most significant 1 bit is found, its bit index is stored in the destination operand
(first operand).
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious< /a>
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
我使用了许多函数来获取最高有效位,但在 32 和 64 位数字之间移动或在 x86_64 和 x86 框之间移动时通常会出现问题。函数
__builtin_clz
、__builtin_clzl
和__builtin_clzll
适用于 32/64 位数字以及跨 x86_64 和 x86 机器。然而,需要三个功能。我发现了一个简单的 MSB,它依赖于右移,可以处理所有正数情况。至少就我对它的使用而言,它在其他人失败的地方取得了成功:通过将输入指定为
unsigned long long
,它可以处理从unsigned char
到unsigned char
的所有数字类。 code>unsigned long long 并给出标准定义,它在 x86_64 和 x86 版本之间兼容。0
的大小写被定义为返回0
,但可以根据需要进行更改。一个简单的测试和输出是:输出:
注意:出于速度考虑,使用以
__builtin_clzll
为中心的单个函数来完成相同的事情仍然快大约 6 倍。I have worked with a number of functions to get the most significant bit, but problems generally arise moving between 32 and 64 bit numbers or moving between x86_64 and x86 boxes. The functions
__builtin_clz
,__builtin_clzl
and__builtin_clzll
work well for 32/64 bit numbers and across x86_64 and x86 machines. However, three functions are required. I have found a simple MSB that relies on right-shift that will handle all cases for positive numbers. At least for the use I make of it, it has succeeded where others have failed:By designating input as
unsigned long long
it can handle all number classes fromunsigned char
tounsigned long long
and given the standard definition, it is compatible across x86_64 and x86 builds. The case for0
is defined to return0
, but can be changed as required. A simple test and output are:Output:
NOTE: for speed considerations, using a single function to accomplish the same thing centered around
__builtin_clzll
is still faster by a factor of about 6.x86 有一个 BSR 指令,它返回一个位索引(而不是其上方的前导零的数量)。
但不幸的是,没有可移植的内在函数可以有效为所有编译器公开它。 GNU C 提供了 __builtin_clz,但是
unsigned bitidx = 31 - __builtin_clz(x);
不会在当前的 GCC 和 ICC 中优化回 BSR。 (它与 clang 一起使用,这证明了表达式是等价的,所以它可以)。下面定义了
BSR32()
和BSR64()
宏或函数,它们可以有效地编译为bsr
指令x86。 (如果输入为零,则产生垃圾结果。内部函数无法利用 asm 指令的行为,即在输入 = 0 时不修改目标。)向非 x86 的可移植性需要一些额外的
#ifdef
例如回退到31-__builtin_clz
。大多数非 x86 ISA,如果它们有前导零位扫描,则会计算前导零而不是提供位索引。这就是 GNU C 将 __builtin_clz 定义为可移植内置函数的原因。 (如果目标系统上没有硬件支持,内置函数将编译为软件模拟,通常调用 libgcc 辅助函数。)bsf
可能不需要编译器那么多的帮助,因为内置函数匹配asm 指令返回 LSB 位索引的行为,即尾随零的计数。测试调用者
unsigned test32(unsigned x) { return BSR32(x); }
inlines it to 1 instruction on all the major x86 compilers, 在 Godbolt 编译器浏览器上。 BSR64 以相同的方式内联到 64 位操作数大小的版本。另请参阅是否有 x86/x86_64 指令将最高有效位以下的所有位清零? 例如用例。这样做的目的是避免可移植(到非 MSVC)版本中的代码缓慢:
如果没有
-march=haswell
,我们只能从 clang 获得 BSR,但是:这只是可恶的。 (有趣的是,如果输入为零,ICC 会执行 CMOV 来生成
-1
。BSR 根据其输入设置 ZF,这与大多数根据结果。)使用
-march=haswell
(或以其他方式启用 BMI1 指令),它没有那么糟糕,但仍然不如 BSR 好。模输出依赖项,编译器在 lzcnt 中主要是为了避免这种依赖项,但奇怪的是 BSR 却没有。 (由于 input=0 行为,输出依赖项是 true 依赖项。)为什么打破 LZCNT 的“输出依赖”很重要?x86 has a BSR instruction that returns a bit-index (rather than the count of leading zeros above it).
But unfortunately there's no portable intrinsic that efficiently exposes it for all compilers. GNU C provides
__builtin_clz
, butunsigned bitidx = 31 - __builtin_clz(x);
doesn't optimize back to just BSR with current GCC and ICC. (It does with clang, which proves that the expression is equivalent so it could).The following defines
BSR32()
andBSR64()
macros or functions that compile efficiently to just absr
instruction on x86. (Producing a garbage result if the input was zero. There's no way with intrinsics to take advantage of the asm instruction's behaviour of leaving the destination unmodified for input=0.)Portability to non-x86 would take some additional
#ifdef
e.g. to fall back to31-__builtin_clz
. Most non-x86 ISAs, if they have a leading-zero bitscan at all, count leading zeros instead of giving you the bit-index. That's why GNU C defines__builtin_clz
as the portable builtin. (If there's no HW support on the target system, the builtin will compile to software emulation, usually calling a libgcc helper function.)bsf
probably doesn't need as much help for compilers, because the builtin matches the asm instruction's behaviour of returning the bit-index of the LSB, i.e. the count of trailing zeros.A test caller
unsigned test32(unsigned x) { return BSR32(x); }
inlines it to 1 instruction on all the major x86 compilers, on the Godbolt compiler explorer. BSR64 inlines the same way, to a 64-bit operand-size version. See also Is there an x86/x86_64 instruction which zeros all bits below the Most Significant Bit? for example use-cases.The point of this is to avoid slow code from the portable (to non-MSVC) version:
Without
-march=haswell
we get just BSR from clang, but:That's just nasty. (Interesting to see that ICC is doing a CMOV to produce
-1
if the input is zero. BSR sets ZF according to its input, unlike most instructions which set flags according to the result.)With
-march=haswell
(or otherwise enabling use of BMI1 instructions), it's not as bad, but still not as good as just BSR. Modulo output dependencies, which compilers mostly work to avoid for lzcnt but strangely not for BSR. (Where the output dependency is a true dependency, because of the input=0 behaviour.) Why does breaking the "output dependency" of LZCNT matter?如果您使用的是 x86,您可以使用 SSE2 操作结合查找第一位指令(在 gcc 世界中)发音为“ffs”来击败几乎任何逐字节或逐字解决方案” 代表最低位,“fls” 代表最高位。
请原谅我在回答中格式化“C”代码时遇到麻烦(!@#$%^);查看:
http://mischasan.wordpress。 com/2011/11/03/sse2-bit-trick-ffsfls-for-xmm-registers/
If you're using x86, you can beat practically any byte-by-byte or word-by-word solution using the SSE2 operations, combined with the find-first-bit instructions, which (in the gcc world) are pronounced "ffs" for the lowest bit and "fls" for the highest bit.
Pardon me for having trouble (!@#$%^) formatting "C" code in an answer; check out:
http://mischasan.wordpress.com/2011/11/03/sse2-bit-trick-ffsfls-for-xmm-registers/
我知道在纯 C 中执行此操作的两种最佳方法:
首先线性搜索字节/字数组以查找第一个非零的字节/字,然后对找到的字节/字进行展开的二进制搜索。
3(顺便说一句,即 log2(8))条件跳转以获得答案。在现代 x86 机器上,最后一个将被优化为条件 mov。
或者,使用查找表将字节映射到所设置的第一位的索引。
您可能想要查找的相关主题是整数 log2 函数。如果我记得的话,ffmpeg 有一个很好的实现。
编辑:您实际上可以将上述二分搜索变成无分支二分搜索,但我不确定在这种情况下它是否会更有效......
Two best ways I know to do this in pure C:
First linear-search the byte/word array to find the first byte/word that's nonzero, then do an unrolled binary-search of the byte/word you find.
3 (BTW that's log2(8)) conditional jumps to get the answer. On modern x86 machines the last one will be optimized to a conditional mov.
Alternatively, use a lookup table to map the byte to the index of the first bit that's set.
A related topic you might want to look up is integer log2 functions. If I recall, ffmpeg has a nice implementation.
Edit: You can actually make the above binary search into a branchless binary search, but I'm not sure if it would be more efficient in this case...
不是最快的,但它有效......
Not the fastest, but it works...
这是解释 __builtin_clz() 的代码片段
Here's a code snippet explaining __builtin_clz()
我来补充一张!
当然,这适用于 64 位数字(unsigned long long),而不是数组。另外,很多人指出了我不知道的内置 g++ 函数。多么有趣啊。
无论如何,这会在 6 次迭代中找到最高有效位,并且如果将 0 传递给函数,则会给出断言。如果您可以访问芯片组的指令,则不是最好使用的功能。
我还使用 |= 而不是 +=,因为它们始终是 2 的幂,并且 OR(通常)比加法更快。因为我只是将 2 的独特幂加在一起,所以我从来没有翻转过。
这是二分查找,这意味着它总是在 6 次迭代中找到结果。
再说一遍,这样更好:
I'll add one!
Of course, this is working on a 64 bit number (unsigned long long), and not an array. Also, plenty of people have pointed to inbuilt g++ functions I was not aware of. How interesting.
Anyhow, this finds the most significant bit in 6 iterations and gives an assert if you passed 0 to the function. Not the best function to use if you have access to an instruction of the chipset.
I also am also using |= instead of += because these are always powers of two, and OR is (classically) faster than addition. Since I'm only adding unique powers of 2 together, I never have roll over.
This is a binary search which means it always finds the result in 6 iterations.
Again, this is better:
这是一个针对任意大小的字节数组的简单强力算法:
我将把它作为练习,让读者提出适当的
msb()
函数以及优化处理int
或long long
大小的数据缝隙。Here's a simple, brute force algorithm for an arbitrary-sized array of bytes:
I'll leave it as a an exercise for the reader to come up with an appropriate
msb()
function as well as the optimization to work onint
orlong long
sized chinks of data.嗯,您的标签指示 32 位,但看起来您使用的值是 16 位。如果你的意思是 32 位,那么我认为 0x00a1 的答案应该是 24 而不是 8。
假设你正在从左侧查找 MSB 位索引,并且你知道你只会处理 uint32_t,这里是显而易见、简单的算法:
Um, your tag indicates 32bit but it looks like the values that you're using are 16 bit. If you did mean 32 bit, then I think the answer for 0x00a1 ought to be 24 and not 8.
Assuming that you are looking for the MSB bit index from the left hand side and you know that you will only be dealing with uint32_t's, here's the obvious, simple-minded algorithm:
对于java我使用这个:
并且:
For java I use this:
And: