当前位置：文江博客话题详情

检查字符数组是否为零的快速方法

发布于 2024-08-28 02:46:18 字数 43 浏览 7 评论 0原文

我在内存中有一个字节数组。查看数组中的所有字节是否为零的最快方法是什么？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

橘味果▽酱 2024-09-04 02:46:18

如今，缺乏使用SIMD扩展（例如SSE（x86 处理器上）），您不妨迭代数组并比较每个值到 0。

在遥远的过去，为数组中的每个元素（除了循环分支本身）执行比较和条件分支被认为是昂贵的，并且取决于频率（或早）您可能期望数组中出现非零元素，您可能选择完全在循环内不使用条件，仅使用按位或来检测任何设置位并推迟实际检查直到循环完成之后：

int sum = 0;
for (i = 0; i < ARRAY_SIZE; ++i) {
  sum |= array[i];
}
if (sum != 0) {
  printf("At least one array element is non-zero\n");
}

但是，随着当今的流水线超标量处理器设计完成分支预测，所有非 SSE 方法在循环内几乎无法区分。如果有的话，从长远来看，将每个元素与零进行比较并尽早退出循环（一旦遇到第一个非零元素）可能比 sum |= array[i] 更有效方法（始终遍历整个数组），除非您希望数组几乎总是完全由零组成（在这种情况下，使 sum |= array[i] 使用 GCC 的 -funroll-loops 实现真正无分支的方法可以为您提供更好的数字 - 请参阅下面的 Athlon 处理器的数字，结果可能会因处理器型号和制造商而异 .)

#include <stdio.h>

int a[1024*1024];

/* Methods 1 & 2 are equivalent on x86 */  

int main() {
  int i, j, n;

# if defined METHOD3
  int x;
# endif

  for (i = 0; i < 100; ++i) {
#   if defined METHOD3
    x = 0;
#   endif
    for (j = 0, n = 0; j < sizeof(a)/sizeof(a[0]); ++j) {
#     if defined METHOD1
      if (a[j] != 0) { n = 1; }
#     elif defined METHOD2
      n |= (a[j] != 0);
#     elif defined METHOD3
      x |= a[j];
#     endif
    }
#   if defined METHOD3
    n = (x != 0);
#   endif

    printf("%d\n", n);
  }
}

$ uname -mp
i686 athlon
$ gcc -g -O3 -DMETHOD1 test.c
$ time ./a.out
real    0m0.376s
user    0m0.373s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD2 test.c
$ time ./a.out
real    0m0.377s
user    0m0.372s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD3 test.c
$ time ./a.out
real    0m0.376s
user    0m0.373s
sys     0m0.003s

$ gcc -g -O3 -DMETHOD1 -funroll-loops test.c
$ time ./a.out
real    0m0.351s
user    0m0.348s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD2 -funroll-loops test.c
$ time ./a.out
real    0m0.343s
user    0m0.340s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD3 -funroll-loops test.c
$ time ./a.out
real    0m0.209s
user    0m0.206s
sys     0m0.003s

Nowadays, short of using SIMD extensions (such as SSE on x86 processors), you might as well iterate over the array and compare each value to 0.

In the distant past, performing a comparison and conditional branch for each element in the array (in addition to the loop branch itself) would have been deemed expensive and, depending on how often (or early) you could expect a non-zero element to appear in the array, you might have elected to completely do without conditionals inside the loop, using solely bitwise-or to detect any set bits and deferring the actual check until after the loop completes:

int sum = 0;
for (i = 0; i < ARRAY_SIZE; ++i) {
  sum |= array[i];
}
if (sum != 0) {
  printf("At least one array element is non-zero\n");
}

However, with today's pipelined super-scalar processor designs complete with branch prediction, all non-SSE approaches are virtualy indistinguishable within a loop. If anything, comparing each element to zero and breaking out of the loop early (as soon as the first non-zero element is encountered) could be, in the long run, more efficient than the sum |= array[i] approach (which always traverses the entire array) unless, that is, you expect your array to be almost always made up exclusively of zeroes (in which case making the sum |= array[i] approach truly branchless by using GCC's -funroll-loops could give you the better numbers -- see the numbers below for an Athlon processor, results may vary with processor model and manufacturer.)

#include <stdio.h>

int a[1024*1024];

/* Methods 1 & 2 are equivalent on x86 */  

int main() {
  int i, j, n;

# if defined METHOD3
  int x;
# endif

  for (i = 0; i < 100; ++i) {
#   if defined METHOD3
    x = 0;
#   endif
    for (j = 0, n = 0; j < sizeof(a)/sizeof(a[0]); ++j) {
#     if defined METHOD1
      if (a[j] != 0) { n = 1; }
#     elif defined METHOD2
      n |= (a[j] != 0);
#     elif defined METHOD3
      x |= a[j];
#     endif
    }
#   if defined METHOD3
    n = (x != 0);
#   endif

    printf("%d\n", n);
  }
}

$ uname -mp
i686 athlon
$ gcc -g -O3 -DMETHOD1 test.c
$ time ./a.out
real    0m0.376s
user    0m0.373s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD2 test.c
$ time ./a.out
real    0m0.377s
user    0m0.372s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD3 test.c
$ time ./a.out
real    0m0.376s
user    0m0.373s
sys     0m0.003s

$ gcc -g -O3 -DMETHOD1 -funroll-loops test.c
$ time ./a.out
real    0m0.351s
user    0m0.348s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD2 -funroll-loops test.c
$ time ./a.out
real    0m0.343s
user    0m0.340s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD3 -funroll-loops test.c
$ time ./a.out
real    0m0.209s
user    0m0.206s
sys     0m0.003s

回复收藏 0 原文

无法言说的痛 2024-09-04 02:46:18

如果您可以使用内联汇编，这里有一个简短、快速的解决方案。

#include <stdio.h>

int main(void) {
    int checkzero(char *string, int length);
    char str1[] = "wow this is not zero!";
    char str2[] = {0, 0, 0, 0, 0, 0, 0, 0};
    printf("%d\n", checkzero(str1, sizeof(str1)));
    printf("%d\n", checkzero(str2, sizeof(str2)));
}

int checkzero(char *string, int length) {
    int is_zero;
    __asm__ (
        "cld\n"
        "xorb %%al, %%al\n"
        "repz scasb\n"
        : "=c" (is_zero)
        : "c" (length), "D" (string)
        : "eax", "cc"
    );
    return !is_zero;
}

如果您不熟悉汇编，我将解释我们在这里所做的事情：我们将字符串的长度存储在寄存器中，并要求处理器扫描字符串是否为零（我们通过设置低 8 位来指定这一点）累加器的值，即%%al，为零），在每次迭代时减少所述寄存器的值，直到遇到非零字节。现在，如果字符串全为零，则寄存器也将为零，因为它被减少了 length 次数。但是，如果遇到非零值，则检查零的“循环”会提前终止，因此寄存器将不会为零。然后我们获取该寄存器的值，并返回其布尔否定。

分析得出以下结果：（

$ time or.exe

real    0m37.274s
user    0m0.015s
sys     0m0.000s


$ time scasb.exe

real    0m15.951s
user    0m0.000s
sys     0m0.046s

两个测试用例都在大小为 100000 的数组上运行了 100000 次。or.exe 代码来自 Vlad 的回答。在这两种情况下都消除了函数调用。）

Here's a short, quick solution, if you're okay with using inline assembly.

#include <stdio.h>

int main(void) {
    int checkzero(char *string, int length);
    char str1[] = "wow this is not zero!";
    char str2[] = {0, 0, 0, 0, 0, 0, 0, 0};
    printf("%d\n", checkzero(str1, sizeof(str1)));
    printf("%d\n", checkzero(str2, sizeof(str2)));
}

int checkzero(char *string, int length) {
    int is_zero;
    __asm__ (
        "cld\n"
        "xorb %%al, %%al\n"
        "repz scasb\n"
        : "=c" (is_zero)
        : "c" (length), "D" (string)
        : "eax", "cc"
    );
    return !is_zero;
}

In case you're unfamiliar with assembly, I'll explain what we do here: we store the length of the string in a register, and ask the processor to scan the string for a zero (we specify this by setting the lower 8 bits of the accumulator, namely %%al, to zero), reducing the value of said register on each iteration, until a non-zero byte is encountered. Now, if the string was all zeroes, the register, too, will be zero, since it was decremented length number of times. However, if a non-zero value was encountered, the "loop" that checked for zeroes terminated prematurely, and hence the register will not be zero. We then obtain the value of that register, and return its boolean negation.

Profiling this yielded the following results:

$ time or.exe

real    0m37.274s
user    0m0.015s
sys     0m0.000s


$ time scasb.exe

real    0m15.951s
user    0m0.000s
sys     0m0.046s

(Both test cases ran 100000 times on arrays of size 100000. The or.exe code comes from Vlad's answer. Function calls were eliminated in both cases.)

回复收藏 0 原文

深海少女心 2024-09-04 02:46:18

如果你想在 32 位 C 中执行此操作，可能只需将数组作为 32 位整数数组进行循环并将其与 0 进行比较，然后确保末尾的内容也是 0。

回复收藏 0 原文

感性 2024-09-04 02:46:18

将检查的内存分成两半，并将第一部分与第二部分进行比较。
一个。如有差异，不可能完全相同。
b.如果没有差异，则重复上半部分。

最坏情况 2*N。内存效率高且基于 memcmp。
不确定它是否应该在现实生活中使用，但我喜欢自我比较的想法。
它适用于奇数长度。你明白为什么吗？ :-)

bool memcheck(char* p, char chr, size_t size) {
    // Check if first char differs from expected.
    if (*p != chr) 
        return false;
    int near_half, far_half;
    while (size > 1) {
        near_half = size/2;
        far_half = size-near_half;
        if (memcmp(p, p+far_half, near_half))
            return false;
        size = far_half;
    }
    return true;
}

Split the checked memory half, and compare the first part to the second.

a. If any difference, it can't be all the same.

b. If no difference repeat for the first half.

Worst case 2*N. Memory efficient and memcmp based.

Not sure if it should be used in real life, but I liked the self-compare idea.

It works for odd length. Do you see why? :-)

bool memcheck(char* p, char chr, size_t size) {
    // Check if first char differs from expected.
    if (*p != chr) 
        return false;
    int near_half, far_half;
    while (size > 1) {
        near_half = size/2;
        far_half = size-near_half;
        if (memcmp(p, p+far_half, near_half))
            return false;
        size = far_half;
    }
    return true;
}

回复收藏 0 原文

面犯桃花 2024-09-04 02:46:18

如果数组的大小合适，现代 CPU 的限制因素将是对内存的访问。

确保使用诸如 __dcbt 或 prefetchnta 之类的缓存预取（如果您很快要再次使用缓冲区，则为 prefetch0），提前一段适当的距离（即 1-2K）。

您还需要一次对多个字节执行 SIMD 或 SWAR 等操作。即使使用 32 位字，其操作量也会比每个字符版本少 4 倍。我建议展开 or 并将它们放入 or 的“树”中。您可以在我的代码示例中明白我的意思 - 这利用了超标量功能，通过使用没有那么多中间数据依赖性的操作来并行执行两个整数操作（或）。我使用的树大小为 8（4x4，然后 2x2，然后 1x1），但您可以将其扩展到更大的数字，具体取决于 CPU 架构中拥有多少可用寄存器。

以下内部循环的伪代码示例（无序言/结尾）使用 32 位整数，但您可以使用 MMX/SSE 或任何可用的方式执行 64/128 位。如果您已将块预取到缓存中，这将相当快。此外，如果缓冲区不是 4 字节对齐，则可能需要在之前执行未对齐检查；如果缓冲区（对齐后）长度不是 32 字节的倍数，则可能需要在之后执行未对齐检查。

const UINT32 *pmem = ***aligned-buffer-pointer***;

UINT32 a0,a1,a2,a3;
while(bytesremain >= 32)
{
    // Compare an aligned "line" of 32-bytes
    a0 = pmem[0] | pmem[1];
    a1 = pmem[2] | pmem[3];
    a2 = pmem[4] | pmem[5];
    a3 = pmem[6] | pmem[7];
    a0 |= a1; a2 |= a3;
    pmem += 8;
    a0 |= a2;
    bytesremain -= 32;
    if(a0 != 0) break;
}

if(a0!=0) then ***buffer-is-not-all-zeros***

我实际上建议将“行”值的比较封装到单个函数中，然后通过缓存预取将其展开几次。

If the array is of any decent size, your limiting factor on a modern CPU is going to be access to the memory.

Make sure to use cache prefetching for a decent distance ahead (i.e. 1-2K) with something like __dcbt or prefetchnta (or prefetch0 if you are going to use the buffer again soon).

You will also want to do something like SIMD or SWAR to or multiple bytes at a time. Even with 32-bit words, it will be 4X less operations than a per character version. I'd recommend unrolling the or's and making them feed into a "tree" of or's. You can see what I mean in my code example - this takes advantage of superscalar capability to do two integer ops (the or's) in parallel by making use of ops that do not have as many intermediate data dependencies. I use a tree size of 8 (4x4, then 2x2, then 1x1) but you can expand that to a larger number depending on how many free registers you have in your CPU architecture.

The following pseudo-code example for the inner loop (no prolog/epilog) uses 32-bit ints but you could do 64/128-bit with MMX/SSE or whatever is available to you. This will be fairly fast if you have prefetched the block into the cache. Also you will possibly need to do unaligned check before if your buffer is not 4-byte aligned and after if your buffer (after alignment) is not a multiple of 32-bytes in length.

const UINT32 *pmem = ***aligned-buffer-pointer***;

UINT32 a0,a1,a2,a3;
while(bytesremain >= 32)
{
    // Compare an aligned "line" of 32-bytes
    a0 = pmem[0] | pmem[1];
    a1 = pmem[2] | pmem[3];
    a2 = pmem[4] | pmem[5];
    a3 = pmem[6] | pmem[7];
    a0 |= a1; a2 |= a3;
    pmem += 8;
    a0 |= a2;
    bytesremain -= 32;
    if(a0 != 0) break;
}

if(a0!=0) then ***buffer-is-not-all-zeros***

I would actually suggest encapsulating the compare of a "line" of values into a single function and then unrolling that a couple times with the cache prefetching.

回复收藏 0 原文

柒夜笙歌凉 2024-09-04 02:46:18

在 ARM64 上测量了两种实现，一种使用提前返回 false 的循环，另一种对所有字节进行或运算：

int is_empty1(unsigned char * buf, int size)
{
    int i;
    for(i = 0; i < size; i++) {
        if(buf[i] != 0) return 0;
    }
    return 1;
}

int is_empty2(unsigned char * buf, int size)
{
    int sum = 0;
    for(int i = 0; i < size; i++) {
        sum |= buf[i];
    }
    return sum == 0;
}

结果：

所有结果（以微秒为单位）：

        is_empty1   is_empty2
MEDIAN  0.350       3.554
AVG     1.636       3.768

仅 false 结果：

        is_empty1   is_empty2
MEDIAN  0.003       3.560
AVG     0.382       3.777

仅 true 结果：

        is_empty1   is_empty2
MEDIAN  3.649       3,528
AVG     3.857       3.751

总结： 仅对于错误结果概率非常小的数据集，由于省略了分支，使用 ORing 的第二种算法表现更好。否则，提前返回显然是跑赢大市的策略。

Measured two implementations on ARM64, one using a loop with early return on false, one that ORs all bytes:

int is_empty1(unsigned char * buf, int size)
{
    int i;
    for(i = 0; i < size; i++) {
        if(buf[i] != 0) return 0;
    }
    return 1;
}

int is_empty2(unsigned char * buf, int size)
{
    int sum = 0;
    for(int i = 0; i < size; i++) {
        sum |= buf[i];
    }
    return sum == 0;
}

Results:

All results, in microseconds:

        is_empty1   is_empty2
MEDIAN  0.350       3.554
AVG     1.636       3.768

only false results:

        is_empty1   is_empty2
MEDIAN  0.003       3.560
AVG     0.382       3.777

only true results:

        is_empty1   is_empty2
MEDIAN  3.649       3,528
AVG     3.857       3.751

Summary: only for datasets where the probability of false results is very small, the second algorithm using ORing performs better, due to the omitted branch. Otherwise, returning early is clearly the outperforming strategy.

回复收藏 0 原文