实施memcmp

发布于 2024-10-17 16:23:30 字数 632 浏览 11 评论 0原文

以下是 memcmp 的 Microsoft CRT 实现:

int memcmp(const void* buf1,
           const void* buf2,
           size_t count)
{
    if(!count)
        return(0);

    while(--count && *(char*)buf1 == *(char*)buf2 ) {
        buf1 = (char*)buf1 + 1;
        buf2 = (char*)buf2 + 1;
    }

    return(*((unsigned char*)buf1) - *((unsigned char*)buf2));
}

它基本上执行逐字节比较。

我的问题分为两部分:

  1. 是否有任何理由不通过 int 比较将其更改为 int,直到 count count < sizeof(int),然后逐字节比较剩下的是什么?
  2. 如果我要做 1,是否存在任何潜在/明显的问题?

注:我根本不使用CRT,所以无论如何我都必须实现这个功能。我只是在寻找有关如何正确实施它的建议。

The following is the Microsoft CRT implementation of memcmp:

int memcmp(const void* buf1,
           const void* buf2,
           size_t count)
{
    if(!count)
        return(0);

    while(--count && *(char*)buf1 == *(char*)buf2 ) {
        buf1 = (char*)buf1 + 1;
        buf2 = (char*)buf2 + 1;
    }

    return(*((unsigned char*)buf1) - *((unsigned char*)buf2));
}

It basically performs a byte by byte comparision.

My question is in two parts:

  1. Is there any reason to not alter this to an int by int comparison until count < sizeof(int), then do a byte by byte comparision for what remains?
  2. If I were to do 1, are there any potential/obvious problems?

Notes: I'm not using the CRT at all, so I have to implement this function anyway. I'm just looking for advice on how to implement it correctly.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

沉默的熊 2024-10-24 16:23:30

如果您愿意,您可以将其作为 int-by-int 比较或更广泛的数据类型来进行。

您必须注意(至少)的两件事是开始和结束处的悬垂,以及两个区域之间的对齐方式是否不同。

如果您在不遵循其对齐规则的情况下访问值,某些处理器的运行速度会变慢(如果您尝试这样做,有些处理器甚至会崩溃)。

因此,您的代码可能会先进行 char 比较,直到 int 对齐区域,然后进行 int 比较,然后进行 char 比较再次强调,两个区域的对齐可能很重要。

额外的代码复杂性是否值得您所节省的成本取决于您无法控制的许多因素。一种可能的方法是检测两个区域对齐相同的理想情况并以快速方式执行,否则只需逐个字符地执行。

You could do it as an int-by-int comparison or an even wider data type if you wish.

The two things you have to watch out for (at a minimum) are an overhang at the start as well as the end, and whether the alignments are different between the two areas.

Some processors run slower if you access values without following their alignment rules (some even crash if you try it).

So your code could probably do char comparisons up to an int alignment area, then int comparisons, then char comparisons again but, again, the alignments of both areas will probably matter.

Whether that extra code complexity is worth whatever savings you will get depends on many factors outside your control. A possible method would be to detect the ideal case where both areas are aligned identically and do it a fast way, otherwise just do it character by character.

我早已燃尽 2024-10-24 16:23:30

您提出的优化很常见。最大的问题是,如果您尝试在不允许对单个字节以外的任何内容进行未对齐访问的处理器上运行它,或者在该模式下速度较慢; x86系列不存在这个问题。

它也更复杂,因此更有可能包含错误。

The optimization you propose is very common. The biggest concern would be if you try to run it on a processor that doesn't allow unaligned accesses for anything other than a single byte, or is slower in that mode; the x86 family doesn't have that problem.

It's also more complicated, and thus more likely to contain a bug.

病毒体 2024-10-24 16:23:30

不要忘记,当您在较大的块中发现不匹配时,您必须识别该块内的第一个不同的char,以便您可以计算正确的返回值( memcmp() 返回第一个不同字节的差异,被视为 unsigned char 值。

Don't forget that when you find a mismatch within a larger chunk, you must then identify the first differing char within that chunk so that you can calculate the correct return value (memcmp() returns the difference of the first differing bytes, treated as unsigned char values).

怪我鬧 2024-10-24 16:23:30

如果作为 int 进行比较,则需要检查对齐情况并检查 count 是否可被 sizeof(int) 整除(以将最后一个字节作为 char 进行比较)。

If you compare as int, you will need to check alignment and check if count is divisible by sizeof(int) (to compare the last bytes as char).

微凉 2024-10-24 16:23:30

这真的是他们的实现吗?除了不明智地这样做之外,我还有其他问题:

  • 抛弃常量。
  • 该 return 语句有效吗?无符号字符 - 无符号字符 = 有符号整数?

int at a time 仅在指针对齐时才有效,或者如果您可以从每个指针的前面读取几个字节并且它们仍然对齐,因此如果在对齐边界之前两者均为 1,则您可以读取每个字符的一个字符,然后继续int-at-a-time,但如果它们的对齐方式不同,例如一个对齐,一个不对齐,则无法做到这一点。

当它们实际比较(必须走到末尾)并且数据很长时,memcmp 效率最低(即花费最长)。

我不会自己写,但如果你要比较大部分数据,你可以做一些事情,比如确保对齐,甚至填充末端,然后如果你愿意的话,一次逐字进行。

Is that really their implementation? I have other issues besides not doing it int-wise:

  • castng away constness.
  • does that return statement work? unsigned char - unsigned char = signed int?

int at a time only works if the pointers are aligned, or if you can read a few bytes from the front of each and they are both still aligned, so if both are 1 before the alignment boundary you can read one char of each then go int-at-a-time, but if they are aligned differently eg one is aligned and one is not, there is no way to do this.

memcmp is at its most inefficient (i.e. it takes the longest) when they do actually compare (it has to go to the end) and the data is long.

I would not write my own but if you are going to be comparing large portions of data you could do things like ensure alignment and even pad the ends, then do word-at-a-time, if you want.

萌︼了一个春 2024-10-24 16:23:30

另一个想法是优化处理器缓存和获取。处理器喜欢随机获取大块而不是单个字节。尽管内部运作可能已经考虑到了这一点,但无论如何这将是一个很好的练习。始终进行分析以确定最有效的解决方案。

伪代码:

while bytes remaining > (cache size) / 2 do // Half the cache for source, other for dest.
  fetch source bytes
  fetch destination bytes
  perform comparison using fetched bytes
end-while
perform byte by byte comparison for remainder.

有关更多信息,请在网络上搜索“数据驱动设计”和“面向数据的编程”。

某些处理器(例如 ARM 系列)允许条件执行指令(在 32 位、非 Thumb 模式下)。处理器获取指令,但仅在满足条件时才会执行它们。在这种情况下,请尝试根据布尔赋值重新表述比较。这还可以减少所采用的分支数量,从而提高性能。

另请参阅循环展开
另请参见汇编语言。

通过针对特定处理器定制算法,您可以获得大量性能,但在可移植性方面却比较松散。

Another idea is to optimize for the processor cache and fetching. Processors like to fetch in large chunks rather than individual bytes at random times. Although the internal workings may already account for this, it would be a good exercise anyway. Always profile to determine the most efficient solution.

Psuedo code:

while bytes remaining > (cache size) / 2 do // Half the cache for source, other for dest.
  fetch source bytes
  fetch destination bytes
  perform comparison using fetched bytes
end-while
perform byte by byte comparison for remainder.

For more information, search the web for "Data Driven Design" and "data oriented programming".

Some processors, such as the ARM family, allow for conditional execution of instructions (in 32-bit, non-thumb) mode. The processor fetches the instructions but will only execute them if the conditions are satisfied. In this case, try rephrasing the comparison in terms of boolean assignments. This may also reduce the number of branches taken, which improves performance.

See also loop unrolling.
See also assembly language.

You can gain a lot of performance by tailoring the algorithm to a specific processor, but loose in the portability area.

耶耶耶 2024-10-24 16:23:30

您找到的代码只是 memcmp 的调试实现,它针对简单性和可读性进行了优化,而不是针对性能进行了优化。

内在编译器实现是特定于平台的,并且足够智能,可以在可能的情况下立即生成比较双字或四字(取决于目标体系结构)的处理器指令。
此外,如果两个缓冲区具有相同的地址(buf1 == buf2),则内部实现可能会立即返回。调试实现中也缺少此检查。

最后,即使您确切知道将在哪个平台上运行,完美的实现仍然是不太通用的实现,因为它取决于一系列特定于程序其余部分的不同因素:

  • 最小保证缓冲区是多少结盟?
  • 您能否读取缓冲区末尾之后的任何填充字节而不触发访问冲突?
  • 缓冲区参数可以相同吗?
  • 缓冲区大小可以为0吗?
  • 您只需要比较缓冲区内容是否相等?或者还需要知道哪一个更大(返回值<0或>0)?
  • ...

如果性能是一个问题,我建议在汇编中编写比较例程。大多数编译器都为您提供了查看它们为源生成的程序集列表的选项。您可以使用该代码并根据您的需要进行调整。

The code you found is just a debug implementation of memcmp, it's optimized for simplicity and readability, not for performance.

The intrinsic compiler implementation is platform specific and smart enough to generate processor instructions that compare dwords or qwords (depending on the target architecture) at once whenever possible.
Also, an intrinsic implementation may return immediately if both buffers have the same address (buf1 == buf2). This check is also missing in the debug implementation.

Finally, even when you know exactly on which platform you'll be running, the perfect implementation is still the less generic one as it depends on a bunch of different factors that are specific to the rest of your program:

  • What is the minumum guaranteed buffer alignment?
  • Can you read any padding bytes past the end of a buffer without triggering an access violation?
  • May the buffer parameters be identical?
  • May the buffer size be 0?
  • Do you only need to compare buffer contents for equality? Or do you also need to know which one is larger (return value < 0 or > 0)?
  • ...

If performace is a concern, I suggest writing the comparison routine in assembly. Most compilers give you an option to see the assembly lising that they generate for a source. You could take that code and adapt it to your needs.

浅笑依然 2024-10-24 16:23:30

许多处理器将其实现为单个指令。如果您可以保证您正在运行的处理器可以使用单行内联汇编器来实现。

Many processors implement this as a single instruction. If you can guarantee the processor you're running on it can be implemented with a single line of inline assembler.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文