更快的 memcpy 替代品?

发布于 2024-09-03 19:53:25 字数 67 浏览 5 评论 0原文

我有一个正在执行 memcpy 的函数,但它占用了大量的周期。有没有比使用 memcpy 移动一块内存更快的替代/方法?

I have a function that is doing memcpy, but it's taking up an enormous amount of cycles. Is there a faster alternative/approach than using memcpy to move a piece of memory?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(16

み青杉依旧 2024-09-10 19:53:25

memcpy 可能是在内存中复制字节的最快方法。如果您需要更快的速度 - 尝试找出一种不复制内容的方法,例如仅交换指针,而不是数据本身。

memcpy is likely to be the fastest way you can copy bytes around in memory. If you need something faster - try figuring out a way of not copying things around, e.g. swap pointers only, not the data itself.

小ぇ时光︴ 2024-09-10 19:53:25

这是针对具有 AVX2 指令集的 x86_64 的答案。尽管类似的情况可能适用于带有 SIMD 的 ARM/AArch64。

在单内存通道完全填满(2 个插槽,每个插槽 16 GB DDR4)的 Ryzen 1800X 上,以下代码比 MSVC++2017 编译器上的 memcpy() 快 1.56 倍。如果您用 2 个 DDR4 模块填充两个内存通道,即所有 4 个 DDR4 插槽均处于繁忙状态,则内存复制速度可能会进一步提高 2 倍。对于三(四)通道内存系统,如果将代码扩展为类似的 AVX512 代码,则内存复制速度可以进一步提高 1.5(2.0)倍。对于所有插槽都繁忙的仅 AVX2 三通道/四通道系统,预计不会更快,因为要完全加载它们,您需要一次加载/存储超过 32 个字节(三通道为 48 字节,四通道为 64 字节)系统),而 AVX2 一次可以加载/存储不超过 32 个字节。尽管某些系统上的多线程可以在没有 AVX512 甚至 AVX2 的情况下缓解这一问题。

这里的复制代码假设您正在复制一个大的内存块,其大小是 32 的倍数,并且该块是 32 字节对齐的。

对于非多种大小和非对齐块,可以编写序言/结尾代码,将块头和尾部的宽度减少到 16 (SSE4.1)、8、4、2,最后一次减少到 1 个字节。另外,在中间,包含 2-3 个 __m256i 值的本地数组可以用作来自源的对齐读取和对目标的对齐写入之间的代理。

#include <immintrin.h>
#include <cstdint>
/* ... */
void fastMemcpy(void *pvDest, void *pvSrc, size_t nBytes) {
  assert(nBytes % 32 == 0);
  assert((intptr_t(pvDest) & 31) == 0);
  assert((intptr_t(pvSrc) & 31) == 0);
  const __m256i *pSrc = reinterpret_cast<const __m256i*>(pvSrc);
  __m256i *pDest = reinterpret_cast<__m256i*>(pvDest);
  int64_t nVects = nBytes / sizeof(*pSrc);
  for (; nVects > 0; nVects--, pSrc++, pDest++) {
    const __m256i loaded = _mm256_stream_load_si256(pSrc);
    _mm256_stream_si256(pDest, loaded);
  }
  _mm_sfence();
}

该代码的一个关键特性是它在复制时跳过CPU缓存:当涉及CPU缓存时(即使用不带_stream_的AVX指令),复制速度在我的系统上下降数倍。

我的DDR4内存是2.6GHz CL13。因此,当将 8GB 数据从一个数组复制到另一个数组时,我得到了以下速度:

memcpy(): 17,208,004,271 bytes/sec.
Stream copy: 26,842,874,528 bytes/sec.

请注意,在这些测量中,输入和输出缓冲区的总大小除以经过的秒数。因为对于数组的每个字节都有 2 次内存访问:一次从输入数组读取字节,另一次将字节写入输出数组。换句话说,当将 8GB 从一个阵列复制到另一个阵列时,您会执行相当于 16GB 的内存访问操作。

适度的多线程可以进一步提高性能约 1.44 倍,因此在我的机器上相对于 memcpy() 的总提升达到 2.55 倍。
以下是流复制性能如何取决于我的机器上使用的线程数:

Stream copy 1 threads: 27114820909.821 bytes/sec
Stream copy 2 threads: 37093291383.193 bytes/sec
Stream copy 3 threads: 39133652655.437 bytes/sec
Stream copy 4 threads: 39087442742.603 bytes/sec
Stream copy 5 threads: 39184708231.360 bytes/sec
Stream copy 6 threads: 38294071248.022 bytes/sec
Stream copy 7 threads: 38015877356.925 bytes/sec
Stream copy 8 threads: 38049387471.070 bytes/sec
Stream copy 9 threads: 38044753158.979 bytes/sec
Stream copy 10 threads: 37261031309.915 bytes/sec
Stream copy 11 threads: 35868511432.914 bytes/sec
Stream copy 12 threads: 36124795895.452 bytes/sec
Stream copy 13 threads: 36321153287.851 bytes/sec
Stream copy 14 threads: 36211294266.431 bytes/sec
Stream copy 15 threads: 35032645421.251 bytes/sec
Stream copy 16 threads: 33590712593.876 bytes/sec

代码是:

void AsyncStreamCopy(__m256i *pDest, const __m256i *pSrc, int64_t nVects) {
  for (; nVects > 0; nVects--, pSrc++, pDest++) {
    const __m256i loaded = _mm256_stream_load_si256(pSrc);
    _mm256_stream_si256(pDest, loaded);
  }
}

void BenchmarkMultithreadStreamCopy(double *gpdOutput, const double *gpdInput, const int64_t cnDoubles) {
  assert((cnDoubles * sizeof(double)) % sizeof(__m256i) == 0);
  const uint32_t maxThreads = std::thread::hardware_concurrency();
  std::vector<std::thread> thrs;
  thrs.reserve(maxThreads + 1);

  const __m256i *pSrc = reinterpret_cast<const __m256i*>(gpdInput);
  __m256i *pDest = reinterpret_cast<__m256i*>(gpdOutput);
  const int64_t nVects = cnDoubles * sizeof(*gpdInput) / sizeof(*pSrc);

  for (uint32_t nThreads = 1; nThreads <= maxThreads; nThreads++) {
    auto start = std::chrono::high_resolution_clock::now();
    lldiv_t perWorker = div((long long)nVects, (long long)nThreads);
    int64_t nextStart = 0;
    for (uint32_t i = 0; i < nThreads; i++) {
      const int64_t curStart = nextStart;
      nextStart += perWorker.quot;
      if ((long long)i < perWorker.rem) {
        nextStart++;
      }
      thrs.emplace_back(AsyncStreamCopy, pDest + curStart, pSrc+curStart, nextStart-curStart);
    }
    for (uint32_t i = 0; i < nThreads; i++) {
      thrs[i].join();
    }
    _mm_sfence();
    auto elapsed = std::chrono::high_resolution_clock::now() - start;
    double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
    printf("Stream copy %d threads: %.3lf bytes/sec\n", (int)nThreads, cnDoubles * 2 * sizeof(double) / nSec);

    thrs.clear();
  }
}

更新 2023-01-18:
我不再拥有该系统,但 2666MHz DDR4 标记为 PC4-21300U,这意味着一个 RAM 插槽的 22334668800 字节/秒。由于我有 2 个 RAM 插槽,因此最大带宽为 44669337600 字节/秒。使用 SIMD 和多线程的方法在使用 5 个线程时实现了理论带宽的 87.72%

This is an answer for x86_64 with AVX2 instruction set present. Though something similar may apply for ARM/AArch64 with SIMD.

On Ryzen 1800X with single memory channel filled completely (2 slots, 16 GB DDR4 in each), the following code is 1.56 times faster than memcpy() on MSVC++2017 compiler. If you fill both memory channels with 2 DDR4 modules, i.e. you have all 4 DDR4 slots busy, you may get further 2 times faster memory copying. For triple-(quad-)channel memory systems, you can get further 1.5(2.0) times faster memory copying if the code is extended to analogous AVX512 code. With AVX2-only triple/quad channel systems with all slots busy are not expected to be faster because to load them fully you need to load/store more than 32 bytes at once (48 bytes for triple- and 64-bytes for quad-channel systems), while AVX2 can load/store no more than 32 bytes at once. Though multithreading on some systems can alleviate this without AVX512 or even AVX2.

So here is the copy code that assumes you are copying a large block of memory whose size is a multiple of 32 and the block is 32-byte aligned.

For non-multiple size and non-aligned blocks, prologue/epilogue code can be written reducing the width to 16 (SSE4.1), 8, 4, 2 and finally 1 byte at once for the block head and tail. Also in the middle a local array of 2-3 __m256i values can be used as a proxy between aligned reads from the source and aligned writes to the destination.

#include <immintrin.h>
#include <cstdint>
/* ... */
void fastMemcpy(void *pvDest, void *pvSrc, size_t nBytes) {
  assert(nBytes % 32 == 0);
  assert((intptr_t(pvDest) & 31) == 0);
  assert((intptr_t(pvSrc) & 31) == 0);
  const __m256i *pSrc = reinterpret_cast<const __m256i*>(pvSrc);
  __m256i *pDest = reinterpret_cast<__m256i*>(pvDest);
  int64_t nVects = nBytes / sizeof(*pSrc);
  for (; nVects > 0; nVects--, pSrc++, pDest++) {
    const __m256i loaded = _mm256_stream_load_si256(pSrc);
    _mm256_stream_si256(pDest, loaded);
  }
  _mm_sfence();
}

A key feature of this code is that it skips CPU cache when copying: when CPU cache is involved (i.e. AVX instructions without _stream_ are used), the copy speed drops several times on my system.

My DDR4 memory is 2.6GHz CL13 . So when copying 8GB of data from one array to another I got the following speeds:

memcpy(): 17,208,004,271 bytes/sec.
Stream copy: 26,842,874,528 bytes/sec.

Note that in these measurements the total size of both input and output buffers is divided by the number of seconds elapsed. Because for each byte of the array there are 2 memory accesses: one to read the byte from the input array, another to write the byte to the output array. In the other words, when copying 8GB from one array to another, you do 16GB worth of memory access operations.

Moderate multithreading can further improve performance about 1.44 times, so total increase over memcpy() reaches 2.55 times on my machine.
Here's how stream copy performance depends on the number of threads used on my machine:

Stream copy 1 threads: 27114820909.821 bytes/sec
Stream copy 2 threads: 37093291383.193 bytes/sec
Stream copy 3 threads: 39133652655.437 bytes/sec
Stream copy 4 threads: 39087442742.603 bytes/sec
Stream copy 5 threads: 39184708231.360 bytes/sec
Stream copy 6 threads: 38294071248.022 bytes/sec
Stream copy 7 threads: 38015877356.925 bytes/sec
Stream copy 8 threads: 38049387471.070 bytes/sec
Stream copy 9 threads: 38044753158.979 bytes/sec
Stream copy 10 threads: 37261031309.915 bytes/sec
Stream copy 11 threads: 35868511432.914 bytes/sec
Stream copy 12 threads: 36124795895.452 bytes/sec
Stream copy 13 threads: 36321153287.851 bytes/sec
Stream copy 14 threads: 36211294266.431 bytes/sec
Stream copy 15 threads: 35032645421.251 bytes/sec
Stream copy 16 threads: 33590712593.876 bytes/sec

The code is:

void AsyncStreamCopy(__m256i *pDest, const __m256i *pSrc, int64_t nVects) {
  for (; nVects > 0; nVects--, pSrc++, pDest++) {
    const __m256i loaded = _mm256_stream_load_si256(pSrc);
    _mm256_stream_si256(pDest, loaded);
  }
}

void BenchmarkMultithreadStreamCopy(double *gpdOutput, const double *gpdInput, const int64_t cnDoubles) {
  assert((cnDoubles * sizeof(double)) % sizeof(__m256i) == 0);
  const uint32_t maxThreads = std::thread::hardware_concurrency();
  std::vector<std::thread> thrs;
  thrs.reserve(maxThreads + 1);

  const __m256i *pSrc = reinterpret_cast<const __m256i*>(gpdInput);
  __m256i *pDest = reinterpret_cast<__m256i*>(gpdOutput);
  const int64_t nVects = cnDoubles * sizeof(*gpdInput) / sizeof(*pSrc);

  for (uint32_t nThreads = 1; nThreads <= maxThreads; nThreads++) {
    auto start = std::chrono::high_resolution_clock::now();
    lldiv_t perWorker = div((long long)nVects, (long long)nThreads);
    int64_t nextStart = 0;
    for (uint32_t i = 0; i < nThreads; i++) {
      const int64_t curStart = nextStart;
      nextStart += perWorker.quot;
      if ((long long)i < perWorker.rem) {
        nextStart++;
      }
      thrs.emplace_back(AsyncStreamCopy, pDest + curStart, pSrc+curStart, nextStart-curStart);
    }
    for (uint32_t i = 0; i < nThreads; i++) {
      thrs[i].join();
    }
    _mm_sfence();
    auto elapsed = std::chrono::high_resolution_clock::now() - start;
    double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
    printf("Stream copy %d threads: %.3lf bytes/sec\n", (int)nThreads, cnDoubles * 2 * sizeof(double) / nSec);

    thrs.clear();
  }
}

UPDATE 2023-01-18:
I don't have that system anymore, but the 2666MHz DDR4 is marked PC4-21300U, meaning 22334668800 bytes/second from one RAM slot. As I had 2 RAM slots, the max bandwidth was 44669337600 bytes/second. And the approach with SIMD and multithreading achieved 87.72% of the theoretical bandwidth when using 5 threads.

甜心 2024-09-10 19:53:25

请向我们提供更多详细信息。在 i386 架构上,memcpy 很可能是最快的复制方式。但在编译器没有优化版本的不同体系结构上,最好重写 memcpy 函数。我使用汇编语言在自定义 ARM 架构上完成了此操作。如果您传输大块内存,那么 DMA 可能就是您正在寻找的答案。

请提供更多详细信息 - 架构、操作系统(如果相关)。

Please offer us more details. On i386 architecture it is very possible that memcpy is the fastest way of copying. But on different architecture for which the compiler doesn't have an optimized version it is best that you rewrite your memcpy function. I did this on a custom ARM architecture using assembly language. If you transfer BIG chunks of memory then DMA is probably the answer you are looking for.

Please offer more details - architecture, operating system (if relevant).

梦开始←不甜 2024-09-10 19:53:25

通常,编译器附带的标准库将以目标平台可能的最快方式实现 memcpy()

Usually the standard library shipped with the compiler will implement memcpy() the fastest way possible for the target platform already.

稚气少女 2024-09-10 19:53:25

实际上,memcpy 并不是最快的方法,特别是如果您多次调用它。我还有一些确实需要加速的代码,而 memcpy 很慢,因为它有太多不必要的检查。例如,它检查目标内存块和源内存块是否重叠,以及是否应该从块的后面而不是前面开始复制。如果你不关心这些考虑因素,你当然可以做得更好。我有一些代码,但这里可能是一个更好的版本:

用于图像处理的非常快的 memcpy ?

如果您搜索,还可以找到其他实现。但为了真正的速度,你需要一个汇编版本。

Actually, memcpy is NOT the fastest way, especially if you call it many times. I also had some code that I really needed to speed up, and memcpy is slow because it has too many unnecessary checks. For example, it checks to see if the destination and source memory blocks overlap and if it should start copying from the back of the block rather than the front. If you do not care about such considerations, you can certainly do significantly better. I have some code, but here is perhaps an ever better version:

Very fast memcpy for image processing?.

If you search, you can find other implementations as well. But for true speed you need an assembly version.

苍白女子 2024-09-10 19:53:25

有时像 memcpy、memset 等函数以两种不同的方式实现:

  • 一次作为真正的函数
  • 一次作为立即内联的某个程序集

并非所有编译器默认都采用内联程序集版本,您的编译器可能默认使用函数变体,由于函数调用而导致一些开销。
检查您的编译器以了解如何获取函数的内在变体(命令行选项、编译指示等)。

编辑:请参阅 http://msdn.microsoft.com /en-us/library/tzkfha43%28VS.80%29.aspx 了解 Microsoft C 编译器上的内在函数的说明。

Sometimes functions like memcpy, memset, ... are implemented in two different ways:

  • once as a real function
  • once as some assembly that's immediately inlined

Not all compilers take the inlined-assembly version by default, your compiler may use the function variant by default, causing some overhead because of the function call.
Check your compiler to see how to take the intrinsic variant of the function (command line option, pragma's, ...).

Edit: See http://msdn.microsoft.com/en-us/library/tzkfha43%28VS.80%29.aspx for an explanation of intrinsics on the Microsoft C compiler.

淡淡的优雅 2024-09-10 19:53:25

这是 memcpy 的替代 C 版本,它是可内联的,我发现在我使用它的应用程序中,它比适用于 Arm64 的 GCC 的 memcpy 性能高出大约 50%。它独立于 64 位平台。如果使用实例不需要尾部处理来提高速度,则可以删除尾部处理。复制 uint32_t 数组,较小的数据类型未经测试但可能有效。可能能够适应其他数据类型。 64 位复制(同时复制两个索引)。 32 位也应该可以工作,但速度较慢。归功于 Neoscrypt 项目。

    static inline void newmemcpy(void *__restrict__ dstp, 
                  void *__restrict__ srcp, uint len)
        {
            ulong *dst = (ulong *) dstp;
            ulong *src = (ulong *) srcp;
            uint i, tail;

            for(i = 0; i < (len / sizeof(ulong)); i++)
                *dst++ = *src++;
            /*
              Remove below if your application does not need it.
              If console application, you can uncomment the printf to test
              whether tail processing is being used.
            */
            tail = len & (sizeof(ulong) - 1);
            if(tail) {
                //printf("tailused\n");
                uchar *dstb = (uchar *) dstp;
                uchar *srcb = (uchar *) srcp;

                for(i = len - tail; i < len; i++)
                    dstb[i] = srcb[i];
            }
        }

Here is an alternative C version of memcpy that is inlineable and I find it outperforms memcpy for GCC for Arm64 by about 50% in the application I used it for. It is 64-bit platform independent. The tail processing can be removed if the usage instance does not need it for a bit more speed. Copies uint32_t arrays, smaller datatypes not tested but might work. Might be able to adapt for other datatypes. 64-bit copy (two indexes are copied simultaneously). 32-bit should also work but slower. Credits to Neoscrypt project.

    static inline void newmemcpy(void *__restrict__ dstp, 
                  void *__restrict__ srcp, uint len)
        {
            ulong *dst = (ulong *) dstp;
            ulong *src = (ulong *) srcp;
            uint i, tail;

            for(i = 0; i < (len / sizeof(ulong)); i++)
                *dst++ = *src++;
            /*
              Remove below if your application does not need it.
              If console application, you can uncomment the printf to test
              whether tail processing is being used.
            */
            tail = len & (sizeof(ulong) - 1);
            if(tail) {
                //printf("tailused\n");
                uchar *dstb = (uchar *) dstp;
                uchar *srcb = (uchar *) srcp;

                for(i = len - tail; i < len; i++)
                    dstb[i] = srcb[i];
            }
        }
桃气十足 2024-09-10 19:53:25

这个问题已经12年了,我又写了另一个答案。但它仍然会出现在搜索中,而且答案总是在不断变化。

令人惊讶的是还没有人提到 Agner Fog 的 asmlib
减少了 memcpy() 的替代品以及许多其他 SIMD 优化的 C 库替代品,例如 memmove()memset()strlen ()
将自动使用您的 CPU 支持的最佳指令集(最高可达 AVX-512)。附带针对多个 x86/AMD64 平台的预构建库。

This question is 12 years old as I write yet another answer. But then it comes up in searches still and the answers are always evolving.

Surprised no one mentioned Agner Fog's asmlib yet.
A drop in replacement for memcpy() plus many other SIMD optimized C lib replacements like memmove(), memset(), strlen(), etc.
Will automatically use the best your CPU supports up to the AVX-512 instruction set. Comes with prebuilt libs for several x86/AMD64 platforms.

薄荷港 2024-09-10 19:53:25

您应该检查为您的代码生成的汇编代码。您不想要的是让 memcpy 调用生成对标准库中 memcpy 函数的调用 - 您想要的是重复调用最好的用于复制最大量数据的 ASM 指令 - 类似 rep movsq

你怎样才能做到这一点?好吧,只要编译器知道应该复制多少数据,编译器就会通过用简单的 mov 替换来优化对 memcpy 的调用。如果您编写一个具有明确确定的 (constexpr) 值的 memcpy,您就可以看到这一点。如果编译器不知道该值,它将不得不回退到 memcpy 的字节级实现 - 问题是 memcpy 必须尊重这一点 -字节粒度。它仍然会一次移动 128 位,但在每个 128b 之后,它必须检查是否有足够的数据可以复制为 128b,否则必须回落到 64 位,然后回落到 32 和 8(我认为 16 可能不是最佳选择)无论如何,但我不确定)。

因此,您想要的是能够使用编译器可以优化的 const 表达式告诉 memcpy 数据的大小是多少。这样就不会执行对 memcpy 的调用。您不希望向 memcpy 传递一个仅在运行时才知道的变量。这转化为函数调用和大量测试来检查最佳复制指令。有时,出于这个原因,一个简单的 for 循环比 memcpy 更好(消除了一个函数调用)。 您真正不想要的是将奇数个字节传递给 memcpy 进行复制。

You should check the assembly code generated for your code. What you don't want is to have the memcpy call generate a call to the memcpy function in the standard library - what you want is to have a repeated call to the best ASM instruction to copy the largest amount of data - something like rep movsq.

How can you achieve this? Well, the compiler optimizes calls to memcpy by replacing it with simple movs as long as it knows how much data it should copy. You can see this if you write a memcpy with a well determined (constexpr) value. If the compiler doesn't know the value, it will have to fall back to the byte-level implementation of memcpy - the issue being that memcpy has to respect the one-byte granularity. It will still move 128 bits at a time, but after each 128b it will have to check if it has enough data to copy as 128b or it has to fall back to 64bits, then to 32 and 8 (I think that 16 might be suboptimal anyway, but I don't know for sure).

So what you want is either be able to tell to memcpy what's the size of your data with const expressions that the compiler can optimize. This way no call to memcpy is performed. What you don't want is to pass to memcpy a variable that will only be known at run-time. That translates into a function call and tons of tests to check the best copy instruction. Sometimes, a simple for loop is better than memcpy for this reason (eliminating one function call). And what you really really don't want is pass to memcpy an odd number of bytes to copy.

琉璃繁缕 2024-09-10 19:53:25

检查编译器/平台手册。对于某些微处理器和 DSP 套件,使用 memcpy 比 内在函数DMA 操作。

Check you Compiler/Platform manual. For some micro-processors and DSP-kits using memcpy is much slower than intrinsic functions or DMA operations.

生寂 2024-09-10 19:53:25

如果您的平台支持它,请查看是否可以使用 mmap() 系统调用将数据保留在文件中......通常操作系统可以更好地管理它。而且,正如大家所说,尽可能避免复制;在这种情况下,指针是你的朋友。

If your platform supports it, look into if you can use the mmap() system call to leave your data in the file... generally the OS can manage that better. And, as everyone has been saying, avoid copying if at all possible; pointers are your friend in cases like this.

宛菡 2024-09-10 19:53:25

以下是 Visual C++/Ryzen 1700 的一些基准测试。

该基准测试从 128 MiB 环形缓冲区复制 16 KiB(非重叠)数据块 8*8192 次(总共复制 1 GiB 数据)。

然后我对结果进行归一化,这里我们显示以毫秒为单位的挂钟时间和 60 Hz 的吞吐量值(即该函数在 16.667 毫秒内可以处理多少数据)。

memcpy                           2.761 milliseconds ( 772.555 MiB/frame)

正如您所看到的,内置的 memcpy 速度很快,但是有多快呢?

64-wide load/store              39.889 milliseconds (  427.853 MiB/frame)
32-wide load/store              33.765 milliseconds (  505.450 MiB/frame)
16-wide load/store              24.033 milliseconds (  710.129 MiB/frame)
 8-wide load/store              23.962 milliseconds (  712.245 MiB/frame)
 4-wide load/store              22.965 milliseconds (  743.176 MiB/frame)
 2-wide load/store              22.573 milliseconds (  756.072 MiB/frame)
 1-wide load/store              35.032 milliseconds (  487.169 MiB/frame)

上面只是下面的代码,带有 n 的变体。

// n is the "wideness" from the benchmark

auto src = (__m128i*)get_src_chunk();
auto dst = (__m128i*)get_dst_chunk();

for (int32_t i = 0; i < (16 * 1024) / (16 * n); i += n) {
  __m128i temp[n];

  for (int32_t i = 0; i < n; i++) {
    temp[i] = _mm_loadu_si128(dst++);
  }

  for (int32_t i = 0; i < n; i++) {
    _mm_store_si128(src++, temp[i]);
  }
}

这些是我对结果的最佳猜测。根据我对 Zen 微架构的了解,它每个周期只能获取 32 个字节。这就是为什么我们最大支持 2x 16 字节加载/存储。

  • 1x 将字节加载到 xmm0,128 位
  • 2x 将字节加载到 ymm0,256 位

这就是为什么它的速度大约是两倍,并且内部完全相同memcpy 会执行(或者如果您为平台启用正确的优化,它应该执行的操作)。

也不可能让它变得更快,因为我们现在受到缓存带宽的限制,而缓存带宽并没有变得更快。我认为这是一个非常重要的事实,因为如果您受内存限制并寻找更快的解决方案,您将需要很长时间。

Here's some benchmarks Visual C++/Ryzen 1700.

The benchmark copies 16 KiB (non-overlapping) chunks of data from a 128 MiB ring buffer 8*8192 times (in total, 1 GiB of data is copied).

I then normalize the result, here we present wall clock time in milliseconds and a throughput value for 60 Hz (i.e. how much data can this function process over 16.667 milliseconds).

memcpy                           2.761 milliseconds ( 772.555 MiB/frame)

As you can see the builtin memcpy is fast, but how fast?

64-wide load/store              39.889 milliseconds (  427.853 MiB/frame)
32-wide load/store              33.765 milliseconds (  505.450 MiB/frame)
16-wide load/store              24.033 milliseconds (  710.129 MiB/frame)
 8-wide load/store              23.962 milliseconds (  712.245 MiB/frame)
 4-wide load/store              22.965 milliseconds (  743.176 MiB/frame)
 2-wide load/store              22.573 milliseconds (  756.072 MiB/frame)
 1-wide load/store              35.032 milliseconds (  487.169 MiB/frame)

The above is just the code below with variations of n.

// n is the "wideness" from the benchmark

auto src = (__m128i*)get_src_chunk();
auto dst = (__m128i*)get_dst_chunk();

for (int32_t i = 0; i < (16 * 1024) / (16 * n); i += n) {
  __m128i temp[n];

  for (int32_t i = 0; i < n; i++) {
    temp[i] = _mm_loadu_si128(dst++);
  }

  for (int32_t i = 0; i < n; i++) {
    _mm_store_si128(src++, temp[i]);
  }
}

These are my best guesses for the results that I have. Based on what I know about the Zen microarchitecture it can only fetch 32 bytes per cycle. That's why we max out at 2x 16-byte load/store.

  • The 1x load the bytes into xmm0, 128-bit
  • The 2x load the bytes into ymm0, 256-bit

And that's why it is about twice as fast, and internally exactly what memcpy does (or what it should be doing if you enable the right optimizations for your platform).

It is also impossible to make this faster since we are now limited by the cache bandwidth which doesn't go any faster. I think this is a quite important fact to point our because if you are memory bound and looking for faster solution, you will be looking for a very long time.

滥情空心 2024-09-10 19:53:25

如果 memcpy 的性能对您来说是一个问题,我想您一定有大量的内存需要复制?

在这种情况下,我同意 nos 的建议,即找出某种不复制内容的方法。

您应该尝试一些替代的数据结构,而不是在需要更改它时复制一大块内存。 。

在不真正了解您的问题领域的情况下,我建议仔细研究持久数据结构以及实现您自己的实现或重用现有的实现。

I assume you must have huge areas of memory that you want to copy around, if the performance of memcpy has become an issue for you?

In this case, I'd agree with nos's suggestion to figure out some way NOT to copy stuff..

Instead of having one huge blob of memory to be copied around whenever you need to change it, you should probably try some alternative data structures instead.

Without really knowing anything about your problem area, I would suggest taking a good look at persistent data structures and either implementing one of your own or reusing an existing implementation.

[旋木] 2024-09-10 19:53:25

如果其中一个指针(输入参数)未与 32 位对齐,则此函数可能会导致数据中止异常。

This function could cause data abort exception if one of the pointers (input arguments) are not aligned to 32bits.

極樂鬼 2024-09-10 19:53:25

您可能想看看这个:

http:// www.danielvik.com/2010/02/fast-memcpy-in-c.html

我会尝试的另一个想法是使用 COW 技术来复制内存块,并让操作系统根据需要尽快处理复制该页被写入。这里有一些使用 mmap() 的提示:我可以在 Linux 中进行写时复制 memcpy 吗?

You may want to have a look at this:

http://www.danielvik.com/2010/02/fast-memcpy-in-c.html

Another idea I would try is to use COW techniques to duplicate the memory block and let the OS handle the copying on demand as soon as the page is written to. There are some hints here using mmap(): Can I do a copy-on-write memcpy in Linux?

小镇女孩 2024-09-10 19:53:25

CPU的命令集通常支持内存到内存,memcpy通常会使用它。这通常是最快的方法。

您应该检查您的 CPU 到底在做什么。在 Linux 上,使用 sar -B 1 或 vmstat 1 或通过查看 /proc/memstat 来监视 swapi 输入和输出以及虚拟内存有效性。您可能会发现您的副本必须推出大量页面以释放空间,或读入它们等。

这意味着您的问题不在于您用于副本的内容,而在于您的系统如何使用内存。您可能需要减少文件缓存或提前开始写出,或锁定内存中的页面等。

memory to memory is usually supported in CPU's command set, and memcpy will usually use that. And this is usually the fastest way.

You should check what exactly your CPU is doing. On Linux, watch for swapi in and out and virtual memory effectiveness with sar -B 1 or vmstat 1 or by looking in /proc/memstat. You may see that your copy has to push out a lot of pages to free space, or read them in, etc.

That would mean your problem isn't in what you use for the copy, but how your system uses memory. You may need to decrease file cache or start writing out earlier, or lock the pages in memory, etc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文