为什么 memcpy() 和 memmove() 比指针增量更快?

发布于 12-10 00:30 字数 226 浏览 1 评论 0 原文

我正在将 N 个字节从 pSrc 复制到 pDest。这可以在单个循环中完成:

for (int i = 0; i < N; i++)
    *pDest++ = *pSrc++

为什么这比 memcpymemmove 慢?他们使用什么技巧来加快速度?

I am copying N bytes from pSrc to pDest. This can be done in a single loop:

for (int i = 0; i < N; i++)
    *pDest++ = *pSrc++

Why is this slower than memcpy or memmove? What tricks do they use to speed it up?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

横笛休吹塞上声 2024-12-17 00:30:28

由于 memcpy 使用字指针而不是字节指针,因此 memcpy 实现通常使用 SIMD 指令编写,这些指令使得一次可以洗牌 128 位。

SIMD 指令是汇编指令,可以对最多 16 字节长的向量中的每个元素执行相同的操作。这包括加载和存储指令。

Because memcpy uses word pointers instead of byte pointers, also the memcpy implementations are often written with SIMD instructions which makes it possible to shuffle 128 bits at a time.

SIMD instructions are assembly instructions that can perform the same operation on each element in a vector up to 16 bytes long. That includes load and store instructions.

知你几分 2024-12-17 00:30:28

内存复制例程可能比通过指针进行的简单内存复制要复杂得多,速度也要快得多,例如:

void simple_memory_copy(void* dst, void* src, unsigned int bytes)
{
  unsigned char* b_dst = (unsigned char*)dst;
  unsigned char* b_src = (unsigned char*)src;
  for (int i = 0; i < bytes; ++i)
    *b_dst++ = *b_src++;
}

改进

第一个改进是将其中一个指针对齐在字边界上(我的意思是“字”)本机整数大小(通常为 32 位/4 字节,但在较新的体系结构上可以为 64 位/8 字节)并使用字大小的移动/复制指令。这需要使用字节到字节的复制,直到指针对齐。

void aligned_memory_copy(void* dst, void* src, unsigned int bytes)
{
  unsigned char* b_dst = (unsigned char*)dst;
  unsigned char* b_src = (unsigned char*)src;

  // Copy bytes to align source pointer
  while ((b_src & 0x3) != 0)
  {
    *b_dst++ = *b_src++;
    bytes--;
  }

  unsigned int* w_dst = (unsigned int*)b_dst;
  unsigned int* w_src = (unsigned int*)b_src;
  while (bytes >= 4)
  {
    *w_dst++ = *w_src++;
    bytes -= 4;
  }

  // Copy trailing bytes
  if (bytes > 0)
  {
    b_dst = (unsigned char*)w_dst;
    b_src = (unsigned char*)w_src;
    while (bytes > 0)
    {
      *b_dst++ = *b_src++;
      bytes--;
    }
  }
}

根据源指针或目标指针是否适当对齐,不同的体系结构将执行不同的操作。例如,在 XScale 处理器上,我通过对齐目标指针而不是源指针获得了更好的性能。

为了进一步提高性能,可以进行一些循环展开,以便更多的处理器寄存器加载数据,这意味着加载/存储指令可以交错,并通过附加指令(例如循环计数等)隐藏其延迟。这带来的好处因处理器而异,因为加载/存储指令延迟可能有很大不同。

在此阶段,代码最终会用汇编语言而不是 C(或 C++)编写,因为您需要手动放置加载和存储指令以获得延迟隐藏和吞吐量的最大优势。

通常,应在展开循环的一次迭代中复制整个缓存行数据。

这让我想到了下一个改进,即添加预取。这些是特殊指令,告诉处理器的缓存系统将内存的特定部分加载到其缓存中。由于发出指令和填充高速缓存行之间存在延迟,因此需要以这样的方式放置指令,以便数据在要复制时可用,而不是早/晚。

这意味着将预取指令放在函数的开头以及主复制循环内。在复制循环中间使用预取指令来获取将在几次迭代时间内复制的数据。

我不记得了,但预取目标地址和源地址也可能是有益的。

因素

影响内存复制速度的主要因素有:

  • 处理器、缓存和主内存之间的延迟。
  • 处理器高速缓存行的大小和结构。
  • 处理器的内存移动/复制指令(延迟、吞吐量、寄存器大小等)。

因此,如果您想编写一个高效且快速的内存处理例程,您将需要了解有关您正在编写的处理器和体系结构的大量信息。可以说,除非您在某些嵌入式平台上编写,否则使用内置的内存复制例程会更容易。

Memory copy routines can be far more complicated and faster than a simple memory copy via pointers such as:

void simple_memory_copy(void* dst, void* src, unsigned int bytes)
{
  unsigned char* b_dst = (unsigned char*)dst;
  unsigned char* b_src = (unsigned char*)src;
  for (int i = 0; i < bytes; ++i)
    *b_dst++ = *b_src++;
}

Improvements

The first improvement one can make is to align one of the pointers on a word boundary (by word I mean native integer size, usually 32 bits/4 bytes, but can be 64 bits/8 bytes on newer architectures) and use word sized move/copy instructions. This requires using a byte to byte copy until a pointer is aligned.

void aligned_memory_copy(void* dst, void* src, unsigned int bytes)
{
  unsigned char* b_dst = (unsigned char*)dst;
  unsigned char* b_src = (unsigned char*)src;

  // Copy bytes to align source pointer
  while ((b_src & 0x3) != 0)
  {
    *b_dst++ = *b_src++;
    bytes--;
  }

  unsigned int* w_dst = (unsigned int*)b_dst;
  unsigned int* w_src = (unsigned int*)b_src;
  while (bytes >= 4)
  {
    *w_dst++ = *w_src++;
    bytes -= 4;
  }

  // Copy trailing bytes
  if (bytes > 0)
  {
    b_dst = (unsigned char*)w_dst;
    b_src = (unsigned char*)w_src;
    while (bytes > 0)
    {
      *b_dst++ = *b_src++;
      bytes--;
    }
  }
}

Different architectures will perform differently based on if the source or the destination pointer is appropriately aligned. For instance on an XScale processor I got better performance by aligning the destination pointer rather than the source pointer.

To further improve performance some loop unrolling can be done, so that more of the processor's registers are loaded with data and that means the load/store instructions can be interleaved and have their latency hidden by additional instructions (such as loop counting etc). The benefit this brings varies quite a bit by the processor, since load/store instruction latencies can be quite different.

At this stage the code ends up being written in Assembly rather than C (or C++) since you need to manually place the load and store instructions to get maximum benefit of latency hiding and throughput.

Generally a whole cache line of data should be copied in one iteration of the unrolled loop.

Which brings me to the next improvement, adding pre-fetching. These are special instructions that tell the processor's cache system to load specific parts of memory into its cache. Since there is a delay between issuing the instruction and having the cache line filled, the instructions need to be placed in such a way so that the data is available when just as it is to be copied, and no sooner/later.

This means putting prefetch instructions at the start of the function as well as inside the main copy loop. With the prefetch instructions in the middle of the copy loop fetching data that will be copied in several iterations time.

I can't remember, but it may also be beneficial to prefetch the destination addresses as well as the source ones.

Factors

The main factors that affect how fast memory can be copied are:

  • The latency between the processor, its caches, and main memory.
  • The size and structure of the processor's cache lines.
  • The processor's memory move/copy instructions (latency, throughput, register size, etc).

So if you want to write an efficient and fast memory cope routine you'll need to know quite a lot about the processor and architecture you are writing for. Suffice to say, unless you're writing on some embedded platform it would be much easier to just use the built in memory copy routines.

娇妻 2024-12-17 00:30:28

memcpy 可以一次复制多个字节,具体取决于计算机的体系结构。大多数现代计算机可以在单个处理器指令中使用 32 位或更多位。

来自一个示例实现 :

    00026          * For speedy copying, optimize the common case where both pointers
    00027          * and the length are word-aligned, and copy word-at-a-time instead
    00028          * of byte-at-a-time. Otherwise, copy by bytes.

memcpy can copy more than one byte at once depending on the computer's architecture. Most modern computers can work with 32 bits or more in a single processor instruction.

From one example implementation:

    00026          * For speedy copying, optimize the common case where both pointers
    00027          * and the length are word-aligned, and copy word-at-a-time instead
    00028          * of byte-at-a-time. Otherwise, copy by bytes.
饮湿 2024-12-17 00:30:28

您可以使用以下任何技术来实现 memcpy(),其中一些技术取决于您的体系结构以获得性能提升,并且它们都将比您的代码快得多:

  1. 使用更大的单位,例如 32 -位字而不是字节。您也可以(或可能必须)在这里处理对齐问题。例如,在某些平台上,您无法将 32 位字读/写到奇怪的内存位置,而在其他平台上,您会付出巨大的性能损失。要解决此问题,地址必须是可被 4 整除的单位。对于 64 位 CPU,您可以将其提高到 64 位,甚至可以使用 "="">SIMD(单指令,多数据)指令 (MMXSSE 等)

  2. 您可以使用编译器可能无法从 C 进行优化的特殊 CPU 指令。例如,在 80386 上,您可以使用“rep”前缀指令 +“movsb”指令来移动 N 个字节,方法是将 N 放入计数寄存器。好的编译器会为您做这件事,但您可能使用的平台缺乏好的编译器。请注意,该示例往往无法很好地展示速度,但与对齐 + 较大的单元指令相结合,它可能比某些 CPU 上的大多数其他指令都快。

  3. 循环展开 -- 分支在某些 CPU 上可能非常昂贵,因此展开循环可以减少分支数量。这也是与 SIMD 指令和非常大的单元相结合的好技术。

例如, http://www.agner.org/optimize/#asmlib 具有 < code>memcpy 实现击败了大多数现有的(非常微小的量)。如果您阅读源代码,它将充满大量内联汇编代码,这些代码实现了上述所有三种技术,并根据您运行的 CPU 选择哪种技术。

请注意,也可以进行类似的优化来查找缓冲区中的字节。 strchr() 和朋友们通常会比你的手卷等价物更快。对于 .NET 和 "="" rel="nofollow">Java。例如,在 .NET 中,内置的 String.IndexOf() 甚至比 Boyer–Moore 字符串搜索,因为它使用了上述优化技术。

You can implement memcpy() using any of the following techniques, some dependent on your architecture for performance gains, and they will all be much faster than your code:

  1. Use larger units, such as 32-bit words instead of bytes. You can also (or may have to) deal with alignment here as well. You can't go reading/writing a 32-bit word to a odd memory location for example on some platforms, and on other platforms you pay a massive performance penalty. To fix this, the address has to be a unit divisible by 4. You can take this up to 64-bits for 64bit CPUs, or even higher using SIMD (Single instruction, multiple data) instructions (MMX, SSE, etc.)

  2. You can use special CPU instructions that your compiler may not be able to optimize from C. For example, on a 80386, you can use the "rep" prefix instruction + "movsb" instruction to move N bytes dictated by placing N in the count register. Good compilers will just do this for you, but you may be on a platform that lacks a good compiler. Note, that example tends to be a bad demonstration of speed, but combined with alignment + larger unit instructions, it can be faster than mostly everything else on certain CPUs.

  3. Loop unrolling -- branches can be quite expensive on some CPUs, so unrolling the loops can lower the number of branches. This is also a good technique for combining with SIMD instructions and very large sized units.

For example, http://www.agner.org/optimize/#asmlib has a memcpy implementation that beats most out there (by a very tiny amount). If you read the source code, it will be full of tons of inlined assembly code that pulls off all of the above three techniques, choosing which of those techniques based on what CPU you are running on.

Note, there are similar optimizations that can be made for finding bytes in a buffer too. strchr() and friends will often by faster than your hand rolled equivalent. This is especially true for .NET and Java. For example, in .NET, the built-in String.IndexOf() is much faster than even a Boyer–Moore string search, because it uses the above optimization techniques.

迷离° 2024-12-17 00:30:28

我不知道它是否实际用于 memcpy 的任何实际实现中,但我认为 达夫的设备值得在这里提及。

来自 Wikipedia

send(to, from, count)
register short *to, *from;
register count;
{
        register n = (count + 7) / 8;
        switch(count % 8) {
        case 0:      do {     *to = *from++;
        case 7:              *to = *from++;
        case 6:              *to = *from++;
        case 5:              *to = *from++;
        case 4:              *to = *from++;
        case 3:              *to = *from++;
        case 2:              *to = *from++;
        case 1:              *to = *from++;
                } while(--n > 0);
        }
}

请注意,上面的内容不是 memcpy因为它故意不增加 to 指针。它实现了一个稍微不同的操作:写入内存映射寄存器。有关详细信息,请参阅维基百科文章。

I don't know whether it is actually used in any real-world implementations of memcpy, but I think Duff's Device deserves a mention here.

From Wikipedia:

send(to, from, count)
register short *to, *from;
register count;
{
        register n = (count + 7) / 8;
        switch(count % 8) {
        case 0:      do {     *to = *from++;
        case 7:              *to = *from++;
        case 6:              *to = *from++;
        case 5:              *to = *from++;
        case 4:              *to = *from++;
        case 3:              *to = *from++;
        case 2:              *to = *from++;
        case 1:              *to = *from++;
                } while(--n > 0);
        }
}

Note that the above isn't a memcpy since it deliberately doesn't increment the to pointer. It implements a slightly different operation: the writing into a memory-mapped register. See the Wikipedia article for details.

泡沫很甜 2024-12-17 00:30:28

简短的回答:

  • 缓存填充
  • 字大小传输而不是字节传输(如果可能)
  • SIMD 魔法

Short answer:

  • cache fill
  • wordsize transfers instead of byte ones where possible
  • SIMD magic
浅笑依然 2024-12-17 00:30:28

正如其他人所说,memcpy 复制大于 1 字节的块。以字大小的块进行复制要快得多。然而,大多数实现更进一步,在循环之前运行多个 MOV(字)指令。例如,每个循环复制 8 个字块的优点是循环本身成本高昂。该技术将条件分支的数量减少了 8 倍,优化了巨型块的复制。

Like others say memcpy copies larger than 1-byte chunks. Copying in word sized chunks is much faster. However, most implementations take it a step further and run several MOV (word) instructions before looping. The advantage to copying in say, 8 word blocks per loop is that the loop itself is costly. This technique reduces the number of conditional branches by a factor of 8, optimizing the copy for giant blocks.

吃不饱 2024-12-17 00:30:28

答案很好,但如果您仍然想自己实现快速 memcpy,有一篇关于快速 memcpy 的有趣博客文章,C 中的快速 memcpy

void *memcpy(void* dest, const void* src, size_t count)
{
    char* dst8 = (char*)dest;
    char* src8 = (char*)src;

    if (count & 1) {
        dst8[0] = src8[0];
        dst8 += 1;
        src8 += 1;
    }

    count /= 2;
    while (count--) {
        dst8[0] = src8[0];
        dst8[1] = src8[1];

        dst8 += 2;
        src8 += 2;
    }
    return dest;
}

甚至,通过优化内存访问可以做得更好。

The answers are great, but if you still want implement a fast memcpy yourself, there is an interesting blog post about fast memcpy, Fast memcpy in C.

void *memcpy(void* dest, const void* src, size_t count)
{
    char* dst8 = (char*)dest;
    char* src8 = (char*)src;

    if (count & 1) {
        dst8[0] = src8[0];
        dst8 += 1;
        src8 += 1;
    }

    count /= 2;
    while (count--) {
        dst8[0] = src8[0];
        dst8[1] = src8[1];

        dst8 += 2;
        src8 += 2;
    }
    return dest;
}

Even, it can be better with optimizing memory accesses.

一场春暖 2024-12-17 00:30:28

因为像许多库例程一样,它已经针对您正在运行的体系结构进行了优化。其他人发布了各种可以使用的技术。

如果可以选择,请使用库例程而不是自己推出。这是 DRY 的变体,我称之为 DRO(不要重复其他)。此外,库例程比您自己的实现更不可能出错。

我见过内存访问检查器抱怨内存或字符串缓冲区的越界读取,这些缓冲区不是字大小的倍数。这是使用优化的结果。

Because like many library routines it has been optimized for the architecture you are running on. Others have posted various techniques which can be used.

Given the choice, use library routines rather than roll your own. This is a variation on DRY that I call DRO (Don't Repeat Others). Also, library routines are less likely be wrong than your own implementation.

I have seen memory access checkers complain about out of bounds reads on memory or string buffers which were not a multiple of the word size. This is a result of the optimization being used.

待"谢繁草 2024-12-17 00:30:28

你可以看看memset、memcpy和memmove的MacOS实现。

在启动时,操作系统确定它在哪个处理器上运行。它为每个支持的处理器内置了专门优化的代码,并在启动时将 jmp 指令存储到固定的只读/只读位置中的正确代码。

C memset、memcpy 和 memmove 实现只是跳转到该固定位置。

根据 memcpy 和 memmove 的源和目标的对齐方式,实现使用不同的代码。显然他们使用了所有可用的矢量功能。当您复制大量数据时,它们还会使用非缓存变体,并具有最小化页表等待的说明。它不仅仅是汇编代码,它是由对每种处理器架构非常了解的人编写的汇编代码。

英特尔还添加了汇编指令,可以使字符串操作更快。例如,使用支持 strstr 的指令,该指令在一个周期内进行 256 字节比较。

You can look at the MacOS implementation of memset, memcpy and memmove.

At boot time, the OS determines which processor it's running on. It has built in specifically optimised code for each supported processor, and at boot time stores a jmp instruction to the right code in a fixed read/only location.

The C memset, memcpy and memmove implementations are just a jump to that fixed location.

The implementations use different code depending on alignment of source and destination for memcpy and memmove. They obviously use all available vector capabilities. They also use non-caching variants when you copy large amounts of data, and have instructions to minimise waits for page tables. It's not just assembler code, it's assembler code written by someone with extremely good knowledge of each processor architecture.

Intel also added assembler instructions that can make string operations faster. For example with an instruction to support strstr which does 256 byte compares in one cycle.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文