实施memcmp

发布于 2024-10-17 16:23:30 字数 632 浏览 3 评论 0原文

以下是 memcmp 的 Microsoft CRT 实现：

int memcmp(const void* buf1,
           const void* buf2,
           size_t count)
{
    if(!count)
        return(0);

    while(--count && *(char*)buf1 == *(char*)buf2 ) {
        buf1 = (char*)buf1 + 1;
        buf2 = (char*)buf2 + 1;
    }

    return(*((unsigned char*)buf1) - *((unsigned char*)buf2));
}

它基本上执行逐字节比较。

我的问题分为两部分：

是否有任何理由不通过 int 比较将其更改为 int，直到 count count < sizeof(int)，然后逐字节比较剩下的是什么？
如果我要做 1，是否存在任何潜在/明显的问题？

注：我根本不使用CRT，所以无论如何我都必须实现这个功能。我只是在寻找有关如何正确实施它的建议。

原文

The following is the Microsoft CRT implementation of memcmp:

int memcmp(const void* buf1,
           const void* buf2,
           size_t count)
{
    if(!count)
        return(0);

    while(--count && *(char*)buf1 == *(char*)buf2 ) {
        buf1 = (char*)buf1 + 1;
        buf2 = (char*)buf2 + 1;
    }

    return(*((unsigned char*)buf1) - *((unsigned char*)buf2));
}

It basically performs a byte by byte comparision.

My question is in two parts:

Is there any reason to not alter this to an int by int comparison until count < sizeof(int), then do a byte by byte comparision for what remains?
If I were to do 1, are there any potential/obvious problems?

Notes: I'm not using the CRT at all, so I have to implement this function anyway. I'm just looking for advice on how to implement it correctly.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沉默的熊 2024-10-24 16:23:30

如果您愿意，您可以将其作为 int-by-int 比较或更广泛的数据类型来进行。

您必须注意（至少）的两件事是开始和结束处的悬垂，以及两个区域之间的对齐方式是否不同。

如果您在不遵循其对齐规则的情况下访问值，某些处理器的运行速度会变慢（如果您尝试这样做，有些处理器甚至会崩溃）。

因此，您的代码可能会先进行 char 比较，直到 int 对齐区域，然后进行 int 比较，然后进行 char 比较再次强调，两个区域的对齐可能很重要。

额外的代码复杂性是否值得您所节省的成本取决于您无法控制的许多因素。一种可能的方法是检测两个区域对齐相同的理想情况并以快速方式执行，否则只需逐个字符地执行。

回复收藏 0 原文

我早已燃尽 2024-10-24 16:23:30

您提出的优化很常见。最大的问题是，如果您尝试在不允许对单个字节以外的任何内容进行未对齐访问的处理器上运行它，或者在该模式下速度较慢； x86系列不存在这个问题。

它也更复杂，因此更有可能包含错误。

回复收藏 0 原文

病毒体 2024-10-24 16:23:30

不要忘记，当您在较大的块中发现不匹配时，您必须识别该块内的第一个不同的char，以便您可以计算正确的返回值（ memcmp() 返回第一个不同字节的差异，被视为 unsigned char 值。

回复收藏 0 原文

怪我鬧 2024-10-24 16:23:30

如果作为 int 进行比较，则需要检查对齐情况并检查 count 是否可被 sizeof(int) 整除（以将最后一个字节作为 char 进行比较）。

回复收藏 0 原文

微凉 2024-10-24 16:23:30

这真的是他们的实现吗？除了不明智地这样做之外，我还有其他问题：

抛弃常量。
该 return 语句有效吗？无符号字符 - 无符号字符 = 有符号整数？

int at a time 仅在指针对齐时才有效，或者如果您可以从每个指针的前面读取几个字节并且它们仍然对齐，因此如果在对齐边界之前两者均为 1，则您可以读取每个字符的一个字符，然后继续int-at-a-time，但如果它们的对齐方式不同，例如一个对齐，一个不对齐，则无法做到这一点。

当它们实际比较（必须走到末尾）并且数据很长时，memcmp 效率最低（即花费最长）。

我不会自己写，但如果你要比较大部分数据，你可以做一些事情，比如确保对齐，甚至填充末端，然后如果你愿意的话，一次逐字进行。

回复收藏 0 原文

萌︼了一个春 2024-10-24 16:23:30

另一个想法是优化处理器缓存和获取。处理器喜欢随机获取大块而不是单个字节。尽管内部运作可能已经考虑到了这一点，但无论如何这将是一个很好的练习。始终进行分析以确定最有效的解决方案。

伪代码：

while bytes remaining > (cache size) / 2 do // Half the cache for source, other for dest.
  fetch source bytes
  fetch destination bytes
  perform comparison using fetched bytes
end-while
perform byte by byte comparison for remainder.

有关更多信息，请在网络上搜索“数据驱动设计”和“面向数据的编程”。

某些处理器（例如 ARM 系列）允许条件执行指令（在 32 位、非 Thumb 模式下）。处理器获取指令，但仅在满足条件时才会执行它们。在这种情况下，请尝试根据布尔赋值重新表述比较。这还可以减少所采用的分支数量，从而提高性能。

另请参阅循环展开。
另请参见汇编语言。

通过针对特定处理器定制算法，您可以获得大量性能，但在可移植性方面却比较松散。

Another idea is to optimize for the processor cache and fetching. Processors like to fetch in large chunks rather than individual bytes at random times. Although the internal workings may already account for this, it would be a good exercise anyway. Always profile to determine the most efficient solution.

Psuedo code:

while bytes remaining > (cache size) / 2 do // Half the cache for source, other for dest.
  fetch source bytes
  fetch destination bytes
  perform comparison using fetched bytes
end-while
perform byte by byte comparison for remainder.

For more information, search the web for "Data Driven Design" and "data oriented programming".

Some processors, such as the ARM family, allow for conditional execution of instructions (in 32-bit, non-thumb) mode. The processor fetches the instructions but will only execute them if the conditions are satisfied. In this case, try rephrasing the comparison in terms of boolean assignments. This may also reduce the number of branches taken, which improves performance.

See also loop unrolling.
See also assembly language.

You can gain a lot of performance by tailoring the algorithm to a specific processor, but loose in the portability area.

回复收藏 0 原文