更快的 memcpy 替代品？

み青杉依旧 2024-09-10 19:53:25

memcpy 可能是在内存中复制字节的最快方法。如果您需要更快的速度 - 尝试找出一种不复制内容的方法，例如仅交换指针，而不是数据本身。

小ぇ时光︴ 2024-09-10 19:53:25

这是针对具有 AVX2 指令集的 x86_64 的答案。尽管类似的情况可能适用于带有 SIMD 的 ARM/AArch64。

在单内存通道完全填满（2 个插槽，每个插槽 16 GB DDR4）的 Ryzen 1800X 上，以下代码比 MSVC++2017 编译器上的 memcpy() 快 1.56 倍。如果您用 2 个 DDR4 模块填充两个内存通道，即所有 4 个 DDR4 插槽均处于繁忙状态，则内存复制速度可能会进一步提高 2 倍。对于三（四）通道内存系统，如果将代码扩展为类似的 AVX512 代码，则内存复制速度可以进一步提高 1.5（2.0）倍。对于所有插槽都繁忙的仅 AVX2 三通道/四通道系统，预计不会更快，因为要完全加载它们，您需要一次加载/存储超过 32 个字节（三通道为 48 字节，四通道为 64 字节）系统），而 AVX2 一次可以加载/存储不超过 32 个字节。尽管某些系统上的多线程可以在没有 AVX512 甚至 AVX2 的情况下缓解这一问题。

这里的复制代码假设您正在复制一个大的内存块，其大小是 32 的倍数，并且该块是 32 字节对齐的。

对于非多种大小和非对齐块，可以编写序言/结尾代码，将块头和尾部的宽度减少到 16 (SSE4.1)、8、4、2，最后一次减少到 1 个字节。另外，在中间，包含 2-3 个 __m256i 值的本地数组可以用作来自源的对齐读取和对目标的对齐写入之间的代理。

#include <immintrin.h>
#include <cstdint>
/* ... */
void fastMemcpy(void *pvDest, void *pvSrc, size_t nBytes) {
  assert(nBytes % 32 == 0);
  assert((intptr_t(pvDest) & 31) == 0);
  assert((intptr_t(pvSrc) & 31) == 0);
  const __m256i *pSrc = reinterpret_cast<const __m256i*>(pvSrc);
  __m256i *pDest = reinterpret_cast<__m256i*>(pvDest);
  int64_t nVects = nBytes / sizeof(*pSrc);
  for (; nVects > 0; nVects--, pSrc++, pDest++) {
    const __m256i loaded = _mm256_stream_load_si256(pSrc);
    _mm256_stream_si256(pDest, loaded);
  }
  _mm_sfence();
}

该代码的一个关键特性是它在复制时跳过CPU缓存：当涉及CPU缓存时（即使用不带_stream_的AVX指令），复制速度在我的系统上下降数倍。

我的DDR4内存是2.6GHz CL13。因此，当将 8GB 数据从一个数组复制到另一个数组时，我得到了以下速度：

memcpy(): 17,208,004,271 bytes/sec.
Stream copy: 26,842,874,528 bytes/sec.

请注意，在这些测量中，输入和输出缓冲区的总大小除以经过的秒数。因为对于数组的每个字节都有 2 次内存访问：一次从输入数组读取字节，另一次将字节写入输出数组。换句话说，当将 8GB 从一个阵列复制到另一个阵列时，您会执行相当于 16GB 的内存访问操作。

适度的多线程可以进一步提高性能约 1.44 倍，因此在我的机器上相对于 memcpy() 的总提升达到 2.55 倍。
以下是流复制性能如何取决于我的机器上使用的线程数：

Stream copy 1 threads: 27114820909.821 bytes/sec
Stream copy 2 threads: 37093291383.193 bytes/sec
Stream copy 3 threads: 39133652655.437 bytes/sec
Stream copy 4 threads: 39087442742.603 bytes/sec
Stream copy 5 threads: 39184708231.360 bytes/sec
Stream copy 6 threads: 38294071248.022 bytes/sec
Stream copy 7 threads: 38015877356.925 bytes/sec
Stream copy 8 threads: 38049387471.070 bytes/sec
Stream copy 9 threads: 38044753158.979 bytes/sec
Stream copy 10 threads: 37261031309.915 bytes/sec
Stream copy 11 threads: 35868511432.914 bytes/sec
Stream copy 12 threads: 36124795895.452 bytes/sec
Stream copy 13 threads: 36321153287.851 bytes/sec
Stream copy 14 threads: 36211294266.431 bytes/sec
Stream copy 15 threads: 35032645421.251 bytes/sec
Stream copy 16 threads: 33590712593.876 bytes/sec

代码是：

void AsyncStreamCopy(__m256i *pDest, const __m256i *pSrc, int64_t nVects) {
  for (; nVects > 0; nVects--, pSrc++, pDest++) {
    const __m256i loaded = _mm256_stream_load_si256(pSrc);
    _mm256_stream_si256(pDest, loaded);
  }
}

void BenchmarkMultithreadStreamCopy(double *gpdOutput, const double *gpdInput, const int64_t cnDoubles) {
  assert((cnDoubles * sizeof(double)) % sizeof(__m256i) == 0);
  const uint32_t maxThreads = std::thread::hardware_concurrency();
  std::vector<std::thread> thrs;
  thrs.reserve(maxThreads + 1);

  const __m256i *pSrc = reinterpret_cast<const __m256i*>(gpdInput);
  __m256i *pDest = reinterpret_cast<__m256i*>(gpdOutput);
  const int64_t nVects = cnDoubles * sizeof(*gpdInput) / sizeof(*pSrc);

  for (uint32_t nThreads = 1; nThreads <= maxThreads; nThreads++) {
    auto start = std::chrono::high_resolution_clock::now();
    lldiv_t perWorker = div((long long)nVects, (long long)nThreads);
    int64_t nextStart = 0;
    for (uint32_t i = 0; i < nThreads; i++) {
      const int64_t curStart = nextStart;
      nextStart += perWorker.quot;
      if ((long long)i < perWorker.rem) {
        nextStart++;
      }
      thrs.emplace_back(AsyncStreamCopy, pDest + curStart, pSrc+curStart, nextStart-curStart);
    }
    for (uint32_t i = 0; i < nThreads; i++) {
      thrs[i].join();
    }
    _mm_sfence();
    auto elapsed = std::chrono::high_resolution_clock::now() - start;
    double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
    printf("Stream copy %d threads: %.3lf bytes/sec\n", (int)nThreads, cnDoubles * 2 * sizeof(double) / nSec);

    thrs.clear();
  }
}

更新 2023-01-18：
我不再拥有该系统，但 2666MHz DDR4 标记为 PC4-21300U，这意味着一个 RAM 插槽的 22334668800 字节/秒。由于我有 2 个 RAM 插槽，因此最大带宽为 44669337600 字节/秒。使用 SIMD 和多线程的方法在使用 5 个线程时实现了理论带宽的 87.72%。

This is an answer for x86_64 with AVX2 instruction set present. Though something similar may apply for ARM/AArch64 with SIMD.

On Ryzen 1800X with single memory channel filled completely (2 slots, 16 GB DDR4 in each), the following code is 1.56 times faster than memcpy() on MSVC++2017 compiler. If you fill both memory channels with 2 DDR4 modules, i.e. you have all 4 DDR4 slots busy, you may get further 2 times faster memory copying. For triple-(quad-)channel memory systems, you can get further 1.5(2.0) times faster memory copying if the code is extended to analogous AVX512 code. With AVX2-only triple/quad channel systems with all slots busy are not expected to be faster because to load them fully you need to load/store more than 32 bytes at once (48 bytes for triple- and 64-bytes for quad-channel systems), while AVX2 can load/store no more than 32 bytes at once. Though multithreading on some systems can alleviate this without AVX512 or even AVX2.

So here is the copy code that assumes you are copying a large block of memory whose size is a multiple of 32 and the block is 32-byte aligned.

For non-multiple size and non-aligned blocks, prologue/epilogue code can be written reducing the width to 16 (SSE4.1), 8, 4, 2 and finally 1 byte at once for the block head and tail. Also in the middle a local array of 2-3 __m256i values can be used as a proxy between aligned reads from the source and aligned writes to the destination.

#include <immintrin.h>
#include <cstdint>
/* ... */
void fastMemcpy(void *pvDest, void *pvSrc, size_t nBytes) {
  assert(nBytes % 32 == 0);
  assert((intptr_t(pvDest) & 31) == 0);
  assert((intptr_t(pvSrc) & 31) == 0);
  const __m256i *pSrc = reinterpret_cast<const __m256i*>(pvSrc);
  __m256i *pDest = reinterpret_cast<__m256i*>(pvDest);
  int64_t nVects = nBytes / sizeof(*pSrc);
  for (; nVects > 0; nVects--, pSrc++, pDest++) {
    const __m256i loaded = _mm256_stream_load_si256(pSrc);
    _mm256_stream_si256(pDest, loaded);
  }
  _mm_sfence();
}

A key feature of this code is that it skips CPU cache when copying: when CPU cache is involved (i.e. AVX instructions without _stream_ are used), the copy speed drops several times on my system.

My DDR4 memory is 2.6GHz CL13 . So when copying 8GB of data from one array to another I got the following speeds:

memcpy(): 17,208,004,271 bytes/sec.
Stream copy: 26,842,874,528 bytes/sec.

Note that in these measurements the total size of both input and output buffers is divided by the number of seconds elapsed. Because for each byte of the array there are 2 memory accesses: one to read the byte from the input array, another to write the byte to the output array. In the other words, when copying 8GB from one array to another, you do 16GB worth of memory access operations.

Moderate multithreading can further improve performance about 1.44 times, so total increase over memcpy() reaches 2.55 times on my machine.
Here's how stream copy performance depends on the number of threads used on my machine:

Stream copy 1 threads: 27114820909.821 bytes/sec
Stream copy 2 threads: 37093291383.193 bytes/sec
Stream copy 3 threads: 39133652655.437 bytes/sec
Stream copy 4 threads: 39087442742.603 bytes/sec
Stream copy 5 threads: 39184708231.360 bytes/sec
Stream copy 6 threads: 38294071248.022 bytes/sec
Stream copy 7 threads: 38015877356.925 bytes/sec
Stream copy 8 threads: 38049387471.070 bytes/sec
Stream copy 9 threads: 38044753158.979 bytes/sec
Stream copy 10 threads: 37261031309.915 bytes/sec
Stream copy 11 threads: 35868511432.914 bytes/sec
Stream copy 12 threads: 36124795895.452 bytes/sec
Stream copy 13 threads: 36321153287.851 bytes/sec
Stream copy 14 threads: 36211294266.431 bytes/sec
Stream copy 15 threads: 35032645421.251 bytes/sec
Stream copy 16 threads: 33590712593.876 bytes/sec

The code is:

void AsyncStreamCopy(__m256i *pDest, const __m256i *pSrc, int64_t nVects) {
  for (; nVects > 0; nVects--, pSrc++, pDest++) {
    const __m256i loaded = _mm256_stream_load_si256(pSrc);
    _mm256_stream_si256(pDest, loaded);
  }
}

void BenchmarkMultithreadStreamCopy(double *gpdOutput, const double *gpdInput, const int64_t cnDoubles) {
  assert((cnDoubles * sizeof(double)) % sizeof(__m256i) == 0);
  const uint32_t maxThreads = std::thread::hardware_concurrency();
  std::vector<std::thread> thrs;
  thrs.reserve(maxThreads + 1);

  const __m256i *pSrc = reinterpret_cast<const __m256i*>(gpdInput);
  __m256i *pDest = reinterpret_cast<__m256i*>(gpdOutput);
  const int64_t nVects = cnDoubles * sizeof(*gpdInput) / sizeof(*pSrc);

  for (uint32_t nThreads = 1; nThreads <= maxThreads; nThreads++) {
    auto start = std::chrono::high_resolution_clock::now();
    lldiv_t perWorker = div((long long)nVects, (long long)nThreads);
    int64_t nextStart = 0;
    for (uint32_t i = 0; i < nThreads; i++) {
      const int64_t curStart = nextStart;
      nextStart += perWorker.quot;
      if ((long long)i < perWorker.rem) {
        nextStart++;
      }
      thrs.emplace_back(AsyncStreamCopy, pDest + curStart, pSrc+curStart, nextStart-curStart);
    }
    for (uint32_t i = 0; i < nThreads; i++) {
      thrs[i].join();
    }
    _mm_sfence();
    auto elapsed = std::chrono::high_resolution_clock::now() - start;
    double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
    printf("Stream copy %d threads: %.3lf bytes/sec\n", (int)nThreads, cnDoubles * 2 * sizeof(double) / nSec);

    thrs.clear();
  }
}

UPDATE 2023-01-18:
I don't have that system anymore, but the 2666MHz DDR4 is marked PC4-21300U, meaning 22334668800 bytes/second from one RAM slot. As I had 2 RAM slots, the max bandwidth was 44669337600 bytes/second. And the approach with SIMD and multithreading achieved 87.72% of the theoretical bandwidth when using 5 threads.

更快的 memcpy 替代品？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（16）

关于作者

相关话题

热门标签

推荐作者

佚名

羁客

天天爱笑的徐老师

星

夏日落

隐诗

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。