检查字符数组是否为零的快速方法
我在内存中有一个字节数组。查看数组中的所有字节是否为零的最快方法是什么?
I have an array of bytes, in memory. What's the fastest way to see if all the bytes in the array are zero?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
如今,缺乏使用SIMD扩展(例如SSE(x86 处理器上)),您不妨迭代数组并比较每个值到 0。
在遥远的过去,为数组中的每个元素(除了循环分支本身)执行比较和条件分支被认为是昂贵的,并且取决于频率(或早)您可能期望数组中出现非零元素,您可能选择完全在循环内不使用条件,仅使用按位或来检测任何设置位并推迟实际检查直到循环完成之后:
但是,随着当今的流水线超标量处理器设计完成分支预测,所有非 SSE 方法在循环内几乎无法区分。如果有的话,从长远来看,将每个元素与零进行比较并尽早退出循环(一旦遇到第一个非零元素)可能比 sum |= array[i] 更有效 方法(始终遍历整个数组),除非您希望数组几乎总是完全由零组成(在这种情况下,使
sum |= array[i]
使用 GCC 的-funroll-loops
实现真正无分支的方法可以为您提供更好的数字 - 请参阅下面的 Athlon 处理器的数字,结果可能会因处理器型号和制造商而异 .)Nowadays, short of using SIMD extensions (such as SSE on x86 processors), you might as well iterate over the array and compare each value to 0.
In the distant past, performing a comparison and conditional branch for each element in the array (in addition to the loop branch itself) would have been deemed expensive and, depending on how often (or early) you could expect a non-zero element to appear in the array, you might have elected to completely do without conditionals inside the loop, using solely bitwise-or to detect any set bits and deferring the actual check until after the loop completes:
However, with today's pipelined super-scalar processor designs complete with branch prediction, all non-SSE approaches are virtualy indistinguishable within a loop. If anything, comparing each element to zero and breaking out of the loop early (as soon as the first non-zero element is encountered) could be, in the long run, more efficient than the
sum |= array[i]
approach (which always traverses the entire array) unless, that is, you expect your array to be almost always made up exclusively of zeroes (in which case making thesum |= array[i]
approach truly branchless by using GCC's-funroll-loops
could give you the better numbers -- see the numbers below for an Athlon processor, results may vary with processor model and manufacturer.)如果您可以使用内联汇编,这里有一个简短、快速的解决方案。
如果您不熟悉汇编,我将解释我们在这里所做的事情:我们将字符串的长度存储在寄存器中,并要求处理器扫描字符串是否为零(我们通过设置低 8 位来指定这一点)累加器的值,即
%%al
,为零),在每次迭代时减少所述寄存器的值,直到遇到非零字节。现在,如果字符串全为零,则寄存器也将为零,因为它被减少了length
次数。但是,如果遇到非零值,则检查零的“循环”会提前终止,因此寄存器将不会为零。然后我们获取该寄存器的值,并返回其布尔否定。分析得出以下结果:(
两个测试用例都在大小为 100000 的数组上运行了 100000 次。
or.exe
代码来自 Vlad 的回答。在这两种情况下都消除了函数调用。)Here's a short, quick solution, if you're okay with using inline assembly.
In case you're unfamiliar with assembly, I'll explain what we do here: we store the length of the string in a register, and ask the processor to scan the string for a zero (we specify this by setting the lower 8 bits of the accumulator, namely
%%al
, to zero), reducing the value of said register on each iteration, until a non-zero byte is encountered. Now, if the string was all zeroes, the register, too, will be zero, since it was decrementedlength
number of times. However, if a non-zero value was encountered, the "loop" that checked for zeroes terminated prematurely, and hence the register will not be zero. We then obtain the value of that register, and return its boolean negation.Profiling this yielded the following results:
(Both test cases ran 100000 times on arrays of size 100000. The
or.exe
code comes from Vlad's answer. Function calls were eliminated in both cases.)如果你想在 32 位 C 中执行此操作,可能只需将数组作为 32 位整数数组进行循环并将其与 0 进行比较,然后确保末尾的内容也是 0。
If you want to do this in 32-bit C, probably just loop over the array as a 32-bit integer array and compare it to 0, then make sure the stuff at the end is also 0.
将检查的内存分成两半,并将第一部分与第二部分进行比较。
一个。如有差异,不可能完全相同。
b.如果没有差异,则重复上半部分。
最坏情况 2*N。内存效率高且基于 memcmp。
不确定它是否应该在现实生活中使用,但我喜欢自我比较的想法。
它适用于奇数长度。你明白为什么吗? :-)
Split the checked memory half, and compare the first part to the second.
a. If any difference, it can't be all the same.
b. If no difference repeat for the first half.
Worst case 2*N. Memory efficient and memcmp based.
Not sure if it should be used in real life, but I liked the self-compare idea.
It works for odd length. Do you see why? :-)
如果数组的大小合适,现代 CPU 的限制因素将是对内存的访问。
确保使用诸如 __dcbt 或 prefetchnta 之类的缓存预取(如果您很快要再次使用缓冲区,则为 prefetch0),提前一段适当的距离(即 1-2K)。
您还需要一次对多个字节执行 SIMD 或 SWAR 等操作。即使使用 32 位字,其操作量也会比每个字符版本少 4 倍。我建议展开 or 并将它们放入 or 的“树”中。您可以在我的代码示例中明白我的意思 - 这利用了超标量功能,通过使用没有那么多中间数据依赖性的操作来并行执行两个整数操作(或)。我使用的树大小为 8(4x4,然后 2x2,然后 1x1),但您可以将其扩展到更大的数字,具体取决于 CPU 架构中拥有多少可用寄存器。
以下内部循环的伪代码示例(无序言/结尾)使用 32 位整数,但您可以使用 MMX/SSE 或任何可用的方式执行 64/128 位。如果您已将块预取到缓存中,这将相当快。此外,如果缓冲区不是 4 字节对齐,则可能需要在之前执行未对齐检查;如果缓冲区(对齐后)长度不是 32 字节的倍数,则可能需要在之后执行未对齐检查。
我实际上建议将“行”值的比较封装到单个函数中,然后通过缓存预取将其展开几次。
If the array is of any decent size, your limiting factor on a modern CPU is going to be access to the memory.
Make sure to use cache prefetching for a decent distance ahead (i.e. 1-2K) with something like __dcbt or prefetchnta (or prefetch0 if you are going to use the buffer again soon).
You will also want to do something like SIMD or SWAR to or multiple bytes at a time. Even with 32-bit words, it will be 4X less operations than a per character version. I'd recommend unrolling the or's and making them feed into a "tree" of or's. You can see what I mean in my code example - this takes advantage of superscalar capability to do two integer ops (the or's) in parallel by making use of ops that do not have as many intermediate data dependencies. I use a tree size of 8 (4x4, then 2x2, then 1x1) but you can expand that to a larger number depending on how many free registers you have in your CPU architecture.
The following pseudo-code example for the inner loop (no prolog/epilog) uses 32-bit ints but you could do 64/128-bit with MMX/SSE or whatever is available to you. This will be fairly fast if you have prefetched the block into the cache. Also you will possibly need to do unaligned check before if your buffer is not 4-byte aligned and after if your buffer (after alignment) is not a multiple of 32-bytes in length.
I would actually suggest encapsulating the compare of a "line" of values into a single function and then unrolling that a couple times with the cache prefetching.
在 ARM64 上测量了两种实现,一种使用提前返回 false 的循环,另一种对所有字节进行或运算:
结果:
所有结果(以微秒为单位):
仅 false 结果:
仅 true 结果:
总结: 仅对于错误结果概率非常小的数据集,由于省略了分支,使用 ORing 的第二种算法表现更好。否则,提前返回显然是跑赢大市的策略。
Measured two implementations on ARM64, one using a loop with early return on false, one that ORs all bytes:
Results:
All results, in microseconds:
only false results:
only true results:
Summary: only for datasets where the probability of false results is very small, the second algorithm using ORing performs better, due to the omitted branch. Otherwise, returning early is clearly the outperforming strategy.
Rusty Russel 的
memeqzero
速度非常快。它重用memcmp
来完成繁重的工作:https://github.com/rustyrussell/ccan/ blob/master/ccan/mem/mem.c#L92。
Rusty Russel's
memeqzero
is very fast. It reusesmemcmp
to do the heavy lifting:https://github.com/rustyrussell/ccan/blob/master/ccan/mem/mem.c#L92.