当 memcpy() 比 memmove() 更快时,什么是真正重要的情况?

发布于 2024-09-19 10:10:25 字数 515 浏览 14 评论 0原文

memcpy() 之间的主要区别和 memmove() 的优点是,当源和目标重叠时,memmove() 可以正常工作。当缓冲区肯定不重叠时 memcpy()更可取,因为它可能更快。

让我烦恼的是这个潜在。这是一个微优化,还是有真正重要的例子,当 memcpy() 更快时,我们确实需要使用 memcpy() 而不是坚持使用 memmove()到处都是?

The key difference between memcpy() and memmove() is that memmove() will work fine when source and destination overlap. When buffers surely don't overlap memcpy() is preferable since it's potentially faster.

What bothers me is this potentially. Is it a microoptimization or are there real significant examples when memcpy() is faster so that we really need to use memcpy() and not stick to memmove() everywhere?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

不必了 2024-09-26 10:10:25

如果编译器无法推断出重叠是不可能的,那么至少有一个隐式分支可以向前或向后复制 memmove() 。这意味着,如果无法针对 memcpy() 进行优化,memmove() 至少会慢一个分支,并且内联指令占用的任何额外空间都需要处理每种情况(如果可以内联)。

阅读 eglibc-2.11.1 代码://www.eglibc.org/cgi-bin/viewcvs.cgi/trunk/libc/string/memcpy.c?rev=77&view=markup" rel="noreferrer">memcpy()memmove() 证实了这一点。此外,在向后复制期间不可能进行页面复制,只有在没有重叠机会的情况下才能显着加快速度。

总之,这意味着:如果您可以保证区域不重叠,那么选择 memcpy() 而不是 memmove() 可以避免分支。如果源和目标包含相应的页面对齐和页面大小区域,并且不重叠,则某些架构可以为这些区域使用硬件加速副本,无论您调用的是 memmove() 还是 memcpy()。

Update0

除了我上面列出的假设和观察结果之外,实际上还有一个差异。从 C99 开始,这两个函数存在以下原型:

void *memcpy(void * restrict s1, const void * restrict s2, size_t n);
void *memmove(void * s1, const void * s2, size_t n);

由于能够假设 2 个指针 s1s2 不指向重叠内存,因此可以直接使用 C 实现memcpy 能够利用这一点生成更高效的代码,而无需求助于汇编程序,请参阅 此处了解更多信息。我确信 memmove 可以做到这一点,但是除了我在 eglibc 中看到的检查之外,还需要进行额外的检查,这意味着性能成本可能略高于单个分支这些函数的 C 实现。

There's at least an implicit branch to copy either forwards or backwards for memmove() if the compiler is not able to deduce that an overlap is not possible. This means that without the ability to optimize in favor of memcpy(), memmove() is at least slower by one branch, and any additional space occupied by inlined instructions to handle each case (if inlining is possible).

Reading the eglibc-2.11.1 code for both memcpy() and memmove() confirms this as suspected. Furthermore, there's no possibility of page copying during backward copying, a significant speedup only available if there's no chance for overlapping.

In summary this means: If you can guarantee the regions are not overlapped, then selecting memcpy() over memmove() avoids a branch. If the source and destination contain corresponding page aligned and page sized regions, and don't overlap, some architectures can employ hardware accelerated copies for those regions, regardless of whether you called memmove() or memcpy().

Update0

There is actually one more difference beyond the assumptions and observations I've listed above. As of C99, the following prototypes exist for the 2 functions:

void *memcpy(void * restrict s1, const void * restrict s2, size_t n);
void *memmove(void * s1, const void * s2, size_t n);

Due to the ability to assume the 2 pointers s1 and s2 do not point at overlapping memory, straightforward C implementations of memcpy are able to leverage this to generate more efficient code without resorting to assembler, see here for more. I'm sure that memmove can do this, however additional checks would be required above those I saw present in eglibc, meaning the performance cost may be slightly more than a single branch for C implementations of these functions.

顾忌 2024-09-26 10:10:25

最好的情况是,调用 memcpy 而不是 memmove 将保存指针比较和条件分支。对于大型副本来说,这完全是微不足道的。如果您要制作许多小副本,那么可能值得衡量差异;这是你判断它是否重要的​​唯一方法。

这绝对是一个微优化,但这并不意味着当您可以轻松证明它是安全的时,您不应该使用memcpy。过早的悲观主义是万恶之源。

At best, calling memcpy rather than memmove will save a pointer comparison and a conditional branch. For a large copy, this is completely insignificant. If you are doing many small copies, then it might be worth measuring the difference; that is the only way you can tell whether it's significant or not.

It is definitely a microoptimisation, but that doesn't mean you shouldn't use memcpy when you can easily prove that it is safe. Premature pessimisation is the root of much evil.

绝不放开 2024-09-26 10:10:25

嗯,当源和目标重叠时,memmove 必须向后复制,并且源位于目标之前。因此,当源位于目标之前时,memmove 的某些实现只是向后复制,而不考虑两个区域是否重叠。

memmove 的高质量实现可以检测区域是否重叠,并在不重叠时进行前向复制。在这种情况下,与 memcpy 相比,唯一的额外开销就是重叠检查。

Well, memmove has to copy backwards when the source and destination overlap, and the source is before the destination. So, some implementations of memmove simply copy backwards when the source is before the destination, without regard for whether the two regions overlap.

A quality implementation of memmove can detect whether the regions overlap, and do a forward-copy when they don't. In such a case, the only extra overhead compared to memcpy is simply the overlap checks.

望她远 2024-09-26 10:10:25

简单来说,memmove 需要测试重叠,然后执行适当的操作;使用memcpy,可以断言不存在重叠,因此不需要额外的测试。

话虽如此,我见过具有完全相同的 memcpymemmove 代码的平台。

Simplistically, memmove needs to test for overlap and then do the appropriate thing; with memcpy, one asserts that there is not overlap so no need for additional tests.

Having said that, I have seen platforms that have exactly the same code for memcpy and memmove.

财迷小姐 2024-09-26 10:10:25

当然有可能 memcpy 只是对 memmove 的调用,在这种情况下,使用 memcpy 没有任何好处。在另一个极端,实现者可能假设很少使用 memmove,并使用 C 中最简单的一次字节循环来实现它,在这种情况下,它可能是十次比优化的 memcpy 慢。正如其他人所说,最可能的情况是 memmove 在检测到可以进行正向复制时使用 memcpy,但某些实现可能只是简单地比较源地址和目标地址而不查看为重叠。

话虽如此,我建议永远不要使用 memmove ,除非您要在单个缓冲区内移动数据。它可能不会慢,但话又说回来,它可能会慢,所以当您知道不需要 memmove 时为什么还要冒险呢?

It's certainly possible that memcpy is merely a call to memmove, in which case there'd be no benefit to using memcpy. On the other extreme, it's possible that an implementor assumed memmove would rarely be used, and implemented it with the simplest possible byte-at-a-time loops in C, in which case it could be ten times slower than an optimized memcpy. As others have said, the likeliest case is that memmove uses memcpy when it detects that a forward copy is possible, but some implementations may simply compare the source and destination addresses without looking for overlap.

With that said, I would recommend never using memmove unless you're shifting data within a single buffer. It might not be slower, but then again, it might be, so why risk it when you know there's no need for memmove?

断舍离 2024-09-26 10:10:25

只需简化并始终使用 memmove。始终正确的函数比只有一半正确的函数要好。

Just simplify and always use memmove. A function that's right all the time is better than a function that's only right half the time.

各空 2024-09-26 10:10:25

在大多数实现中,在定义了两者行为的任何场景中,memmove() 函数调用的成本完全有可能不会显着大于 memcpy()。不过,还有两点尚未提及:

  1. 在某些实现中,地址重叠的确定可能代价高昂。标准 C 中无法确定源对象和目标对象是否指向同一分配的内存区域,因此无法在不自发地导致猫和狗的情况下对它们使用大于或小于运算符彼此相处(或调用其他未定义的行为)。任何实际的实现都可能有一些有效的方法来确定指针是否重叠,但标准并不要求存在这样的方法。在许多平台上,完全用可移植 C 语言编写的 memmove() 函数的执行时间可能至少是完全用可移植 C 语言编写的 memcpy() 的两倍。
  2. 当这样做不会改变它们的语义时,允许实现内联扩展函数。在 80x86 编译器上,如果 ESI 和 EDI 寄存器恰好没有保存任何重要内容,则 memcpy(src, dest, 1234) 可以生成代码:
    <前> mov esi,[src]
    mov edi,[目的地]
    mov ecx,1234/4;编译器可能会注意到它是一个常量
    CLD
    代表 movsl

    这将需要相同数量的内联代码,但运行速度比:
    <前>推[源]
    推[目的地]
    推送双字 1234
    调用_memcpy

    ...

    _memcpy:
    推送ebp
    移动 ebp,esp
    mov ecx,[ebp+numbytes]
    测试 ecx,3 ;看看是否是四的倍数
    jz 四倍数

    四倍数:
    推ESI;无法知道调用者是否需要保留该值
    推送电子数据交换;无法知道调用者是否需要保留该值
    mov esi,[ebp+src]
    mov edi,[ebp+dest]
    代表 movsl
    流行电子数据编辑
    流行ESI
    雷特

相当多的编译器会使用 memcpy() 执行此类优化。我不知道有什么可以用 memmove 来做到这一点,尽管在某些情况下 memcpy 的优化版本可能提供与 memmove 相同的语义。例如,如果 numbytes 为 20:

; Assuming values in eax, ebx, ecx, edx, esi, and edi are not needed
  mov esi,[src]
  mov eax,[esi]
  mov ebx,[esi+4]
  mov ecx,[esi+8]
  mov edx,[esi+12]
  mov edi,[esi+16]
  mov esi,[dest]
  mov [esi],eax
  mov [esi+4],ebx
  mov [esi+8],ecx
  mov [esi+12],edx
  mov [esi+16],edi

即使地址范围重叠,这也能正常工作,因为它在写入任何内容之前有效地复制了要移动的整个区域(在寄存器中)。理论上,编译器可以通过查看将 memmove() 作为 memcpy() 执行是否会产生即使地址范围重叠也是安全的实现来处理 memmove(),并在无法替换 memcpy() 实现的情况下调用 _memmove安全的。不过,我不知道有谁做过这样的优化。

It is entirely possible that in most implementations, the cost of a memmove() function call will not be significantly greater than memcpy() in any scenario in which the behavior of both is defined. There are two points not yet mentioned, though:

  1. In some implementations, the determination of address overlap may be expensive. There is no way in standard C to determine whether the source and destination objects point to the same allocated area of memory, and thus no way that the greater-than or less-than operators can be used upon them without spontaneously causing cats and dogs to get along with each other (or invoking other Undefined Behavior). It is likely that any practical implementation will have some efficient means of determining whether or not the pointers overlap, but the standard doesn't require that such a means exist. A memmove() function written entirely in portable C would on many platforms probably take at least twice as long to execute as would a memcpy() also written entirely in portable C.
  2. Implementations are allowed to expand functions in-line when doing so would not alter their semantics. On an 80x86 compiler, if the ESI and EDI registers don't happen to hold anything important, a memcpy(src, dest, 1234) could generate code:
      mov esi,[src]
      mov edi,[dest]
      mov ecx,1234/4 ; Compiler could notice it's a constant
      cld
      rep movsl
    

    This would take the same amount of in-line code, but run much faster than:

      push [src]
      push [dest]
      push dword 1234
      call _memcpy
    
      ...
    
    _memcpy:
      push ebp
      mov  ebp,esp
      mov  ecx,[ebp+numbytes]
      test ecx,3   ; See if it's a multiple of four
      jz   multiple_of_four
    
    multiple_of_four:
      push esi ; Can't know if caller needs this value preserved
      push edi ; Can't know if caller needs this value preserved
      mov esi,[ebp+src]
      mov edi,[ebp+dest]
      rep movsl
      pop edi
      pop esi
      ret  
    

Quite a number of compilers will perform such optimizations with memcpy(). I don't know of any that will do it with memmove, although in some cases an optimized version of memcpy may offer the same semantics as memmove. For example, if numbytes was 20:

; Assuming values in eax, ebx, ecx, edx, esi, and edi are not needed
  mov esi,[src]
  mov eax,[esi]
  mov ebx,[esi+4]
  mov ecx,[esi+8]
  mov edx,[esi+12]
  mov edi,[esi+16]
  mov esi,[dest]
  mov [esi],eax
  mov [esi+4],ebx
  mov [esi+8],ecx
  mov [esi+12],edx
  mov [esi+16],edi

This will work correctly even if the address ranges overlap, since it effectively makes a copy (in registers) of the entire region to be moved before any of it is written. In theory, a compiler could process memmove() by seeing if treading it as memcpy() would yield an implementation that would be safe even if the address ranges overlap, and call _memmove in those cases where substituting the memcpy() implementation would not be safe. I don't know of any that do such optimization, though.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文