如何提高memcpy的性能

发布于 2024-10-03 20:57:41 字数 2365 浏览 11 评论 0原文

摘要：

在真实或测试应用程序中，memcpy 似乎无法在我的系统上传输超过 2GB/秒。我该怎么做才能获得更快的内存到内存复制？

完整细节：

作为数据捕获应用程序的一部分（使用一些专用硬件），我需要将大约 3 GB/秒从临时缓冲区复制到主内存中。为了获取数据，我为硬件驱动程序提供了一系列缓冲区（每个缓冲区 2MB）。硬件 DMA 将数据传输到每个缓冲区，然后在每个缓冲区已满时通知我的程序。我的程序清空缓冲区（memcpy 到另一个更大的 RAM 块），并将处理后的缓冲区重新发布到卡上以再次填充。我在 memcpy 移动数据速度不够快时遇到问题。看起来内存到内存的复制应该足够快，可以在我运行的硬件上支持 3GB/秒。 Lavalys EVEREST 为我提供了 9337MB/秒的内存复制基准测试结果，但即使在简单的测试程序中，我也无法使用 memcpy 达到接近这些速度的速度。

我通过在缓冲区处理代码中添加/删除 memcpy 调用来隔离性能问题。如果没有 memcpy，我可以运行全数据速率 - 大约 3GB/秒。启用 memcpy 后，我的速度限制为大约 550Mb/秒（使用当前编译器）。

为了在我的系统上对 memcpy 进行基准测试，我编写了一个单独的测试程序，它只对某些数据块调用 memcpy。（我已经发布了下面的代码）我已经在我使用的编译器/IDE（National Instruments CVI）以及 Visual Studio 2010 中运行了这个。虽然我目前没有使用 Visual Studio，但我愿意如果能够产生必要的性能，则进行转换。然而，在盲目转移之前，我想确保它能够解决我的 memcpy 性能问题。

Visual C++ 2010：1900 MB/秒

NI CVI 2009：550 MB/秒

虽然我对 CVI 明显慢于 Visual Studio 并不感到惊讶，但我对 memcpy 性能如此之低感到惊讶。虽然我不确定这是否可以直接比较，但这比 EVEREST 基准带宽低得多。虽然我不需要那么高的性能水平，但至少需要 3GB/秒。当然，标准库的实现不会比 EVEREST 使用的任何东西差那么多！

在这种情况下，我能做些什么来使 memcpy 更快？

硬件详细信息： AMD Magny Cours - 4x 八进制核心 128GB DDR3 Windows Server 2003 Enterprise X64

测试程序：

#include <windows.h>
#include <stdio.h>

const size_t NUM_ELEMENTS = 2*1024 * 1024;
const size_t ITERATIONS = 10000;

int main (int argc, char *argv[])
{
    LARGE_INTEGER start, stop, frequency;

    QueryPerformanceFrequency(&frequency);

    unsigned short * src = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);
    unsigned short * dest = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);

    for(int ctr = 0; ctr < NUM_ELEMENTS; ctr++)
    {
        src[ctr] = rand();
    }

    QueryPerformanceCounter(&start);

    for(int iter = 0; iter < ITERATIONS; iter++)
        memcpy(dest, src, NUM_ELEMENTS * sizeof(unsigned short));

    QueryPerformanceCounter(&stop);

    __int64 duration = stop.QuadPart - start.QuadPart;

    double duration_d = (double)duration / (double) frequency.QuadPart;

    double bytes_sec = (ITERATIONS * (NUM_ELEMENTS/1024/1024) * sizeof(unsigned short)) / duration_d;

    printf("Duration: %.5lfs for %d iterations, %.3lfMB/sec\n", duration_d, ITERATIONS, bytes_sec);

    free(src);
    free(dest);

    getchar();

    return 0;
}

编辑：如果您有额外的五分钟时间并且想要做出贡献，您可以在您的计算机上运行上述代码并将您的时间作为评论发布吗？

原文

Summary:

memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies?

Full details:

As part of a data capture application (using some specialized hardware), I need to copy about 3 GB/sec from temporary buffers into main memory. To acquire data, I provide the hardware driver with a series of buffers (2MB each). The hardware DMAs data to each buffer, and then notifies my program when each buffer is full. My program empties the buffer (memcpy to another, larger block of RAM), and reposts the processed buffer to the card to be filled again. I am having issues with memcpy moving the data fast enough. It seems that the memory-to-memory copy should be fast enough to support 3GB/sec on the hardware that I am running on. Lavalys EVEREST gives me a 9337MB/sec memory copy benchmark result, but I can't get anywhere near those speeds with memcpy, even in a simple test program.

I have isolated the performance issue by adding/removing the memcpy call inside the buffer processing code. Without the memcpy, I can run full data rate- about 3GB/sec. With the memcpy enabled, I am limited to about 550Mb/sec (using current compiler).

In order to benchmark memcpy on my system, I've written a separate test program that just calls memcpy on some blocks of data. (I've posted the code below) I've run this both in the compiler/IDE that I'm using (National Instruments CVI) as well as Visual Studio 2010. While I'm not currently using Visual Studio, I am willing to make the switch if it will yield the necessary performance. However, before blindly moving over, I wanted to make sure that it would solve my memcpy performance problems.

Visual C++ 2010: 1900 MB/sec

NI CVI 2009: 550 MB/sec

While I am not surprised that CVI is significantly slower than Visual Studio, I am surprised that the memcpy performance is this low. While I'm not sure if this is directly comparable, this is much lower than the EVEREST benchmark bandwidth. While I don't need quite that level of performance, a minimum of 3GB/sec is necessary. Surely the standard library implementation can't be this much worse than whatever EVEREST is using!

What, if anything, can I do to make memcpy faster in this situation?

Hardware details:
AMD Magny Cours- 4x octal core
128 GB DDR3
Windows Server 2003 Enterprise X64

Test program:

#include <windows.h>
#include <stdio.h>

const size_t NUM_ELEMENTS = 2*1024 * 1024;
const size_t ITERATIONS = 10000;

int main (int argc, char *argv[])
{
    LARGE_INTEGER start, stop, frequency;

    QueryPerformanceFrequency(&frequency);

    unsigned short * src = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);
    unsigned short * dest = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);

    for(int ctr = 0; ctr < NUM_ELEMENTS; ctr++)
    {
        src[ctr] = rand();
    }

    QueryPerformanceCounter(&start);

    for(int iter = 0; iter < ITERATIONS; iter++)
        memcpy(dest, src, NUM_ELEMENTS * sizeof(unsigned short));

    QueryPerformanceCounter(&stop);

    __int64 duration = stop.QuadPart - start.QuadPart;

    double duration_d = (double)duration / (double) frequency.QuadPart;

    double bytes_sec = (ITERATIONS * (NUM_ELEMENTS/1024/1024) * sizeof(unsigned short)) / duration_d;

    printf("Duration: %.5lfs for %d iterations, %.3lfMB/sec\n", duration_d, ITERATIONS, bytes_sec);

    free(src);
    free(dest);

    getchar();

    return 0;
}

EDIT: If you have an extra five minutes and want to contribute, can you run the above code on your machine and post your time as a comment?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一生独一 2024-10-10 20:57:41

我找到了一种在这种情况下提高速度的方法。我编写了一个多线程版本的 memcpy，在线程之间分割要复制的区域。以下是设定块大小的一些性能缩放数字，使用与上面相同的计时代码。我不知道性能，特别是对于这么小的块来说，会扩展到这么多线程。我怀疑这与这台机器上的大量内存控制器（16个）有关。

Performance (10000x 4MB block memcpy):

 1 thread :  1826 MB/sec
 2 threads:  3118 MB/sec
 3 threads:  4121 MB/sec
 4 threads: 10020 MB/sec
 5 threads: 12848 MB/sec
 6 threads: 14340 MB/sec
 8 threads: 17892 MB/sec
10 threads: 21781 MB/sec
12 threads: 25721 MB/sec
14 threads: 25318 MB/sec
16 threads: 19965 MB/sec
24 threads: 13158 MB/sec
32 threads: 12497 MB/sec

我不明白 3 线程和 4 线程之间的巨大性能跳跃。什么会导致这样的跳跃？

我已经包含了我在下面为可能遇到同样问题的其他人编写的 memcpy 代码。请注意，此代码中没有错误检查 - 可能需要为您的应用程序添加此代码。

#define NUM_CPY_THREADS 4

HANDLE hCopyThreads[NUM_CPY_THREADS] = {0};
HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0};
HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0};
typedef struct
{
    int ct;
    void * src, * dest;
    size_t size;
} mt_cpy_t;

mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};

DWORD WINAPI thread_copy_proc(LPVOID param)
{
    mt_cpy_t * p = (mt_cpy_t * ) param;

    while(1)
    {
        WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);
        memcpy(p->dest, p->src, p->size);
        ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);
    }

    return 0;
}

int startCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        mtParamters[ctr].ct = ctr;
        hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0, NULL); 
    }

    return 0;
}

void * mt_memcpy(void * dest, void * src, size_t bytes)
{
    //set up parameters
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes / NUM_CPY_THREADS;
    }

    //release semaphores to start computation
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);

    //wait for all threads to finish
    WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);

    return dest;
}

int stopCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        TerminateThread(hCopyThreads[ctr], 0);
        CloseHandle(hCopyStartSemaphores[ctr]);
        CloseHandle(hCopyStopSemaphores[ctr]);
    }
    return 0;
}

I have found a way to increase speed in this situation. I wrote a multi-threaded version of memcpy, splitting the area to be copied between threads. Here are some performance scaling numbers for a set block size, using the same timing code as found above. I had no idea that the performance, especially for this small size of block, would scale to this many threads. I suspect that this has something to do with the large number of memory controllers (16) on this machine.

Performance (10000x 4MB block memcpy):

 1 thread :  1826 MB/sec
 2 threads:  3118 MB/sec
 3 threads:  4121 MB/sec
 4 threads: 10020 MB/sec
 5 threads: 12848 MB/sec
 6 threads: 14340 MB/sec
 8 threads: 17892 MB/sec
10 threads: 21781 MB/sec
12 threads: 25721 MB/sec
14 threads: 25318 MB/sec
16 threads: 19965 MB/sec
24 threads: 13158 MB/sec
32 threads: 12497 MB/sec

I don't understand the huge performance jump between 3 and 4 threads. What would cause a jump like this?

I've included the memcpy code that I wrote below for other that may run into this same issue. Please note that there is no error checking in this code- this may need to be added for your application.

#define NUM_CPY_THREADS 4

HANDLE hCopyThreads[NUM_CPY_THREADS] = {0};
HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0};
HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0};
typedef struct
{
    int ct;
    void * src, * dest;
    size_t size;
} mt_cpy_t;

mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};

DWORD WINAPI thread_copy_proc(LPVOID param)
{
    mt_cpy_t * p = (mt_cpy_t * ) param;

    while(1)
    {
        WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);
        memcpy(p->dest, p->src, p->size);
        ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);
    }

    return 0;
}

int startCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        mtParamters[ctr].ct = ctr;
        hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0, NULL); 
    }

    return 0;
}

void * mt_memcpy(void * dest, void * src, size_t bytes)
{
    //set up parameters
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes / NUM_CPY_THREADS;
    }

    //release semaphores to start computation
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);

    //wait for all threads to finish
    WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);

    return dest;
}

int stopCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        TerminateThread(hCopyThreads[ctr], 0);
        CloseHandle(hCopyStartSemaphores[ctr]);
        CloseHandle(hCopyStopSemaphores[ctr]);
    }
    return 0;
}

回复收藏 0 原文

祁梦 2024-10-10 20:57:41

我不确定它是在运行时完成还是必须在编译时完成，但您应该启用 SSE 或类似的扩展，因为向量单元通常可以向内存写入 128 位，而 CPU 则为 64 位。

~~尝试这个实现。~~

是的，并确保源和目标均与 128 位对齐。如果你的源和目标没有相互对齐，你的 memcpy() 将不得不做一些严肃的事情。 :)

回复收藏 0 原文

童话 2024-10-10 20:57:41

需要注意的一件事是，您的进程（以及 memcpy() 的性能）受到操作系统任务调度的影响 - 很难说这对您的计时有多大影响，但是山雀很难控制。设备 DMA 操作不受此影响，因为一旦启动，它就不再在 CPU 上运行。由于您的应用程序是一个实际的实时应用程序，因此您可能想尝试一下 Windows 的进程/线程优先级设置（如果您还没有这样做的话）。请记住，您必须小心这一点，因为它可能会对其他进程（以及计算机上的用户体验）产生真正的负面影响。

另一件需要记住的事情是，操作系统内存虚拟化可能会产生影响 - 如果您要复制到的内存页面实际上并未由物理 RAM 页面支持，则 memcpy() 操作将操作系统的故障导致物理支持到位。您的 DMA 页可能被锁定到物理内存中（因为它们必须用于 DMA 操作），因此 memcpy() 的源内存在这方面可能不是问题。您可能会考虑使用 Win32 VirtualAlloc() API 来确保 memcpy() 的目标内存已提交（我认为 VirtualAlloc()是正确的 API，但可能有一个更好的 API，我忘记了 - 我已经有一段时间没有需要做这样的事情了）。

最后，看看您是否可以使用Skizz 解释的技术完全避免使用 memcpy() - 如果资源允许，这是最好的选择。

回复收藏 0 原文

晌融 2024-10-10 20:57:41

要获得所需的内存性能，存在一些障碍：

带宽 - 数据从内存移动到 CPU 并再次返回的速度存在限制。根据这篇维基百科文章，266MHz DDR3 RAM 的上限约为 17GB/s。现在，使用 memcpy，您需要将其减半才能获得最大传输速率，因为数据是读取然后写入的。从您的基准测试结果来看，您的系统似乎没有运行最快的 RAM。如果您负担得起，请升级主板/RAM（而且价格不会便宜，英国的超频玩家目前拥有 3x4GB PC16000，售价 400 英镑）
操作系统 - Windows 是一个抢占式多任务操作系统，因此您的进程经常会受到影响将被挂起以允许其他进程查看并执行操作。这将破坏您的缓存并阻止您的传输。在最坏的情况下，您的整个进程可能会缓存到磁盘！
CPU - 正在移动的数据还有很长的路要走：RAM ->二级缓存-> L1 缓存 ->中央处理器-> L1-> L2->内存。甚至可能还有 L3 缓存。如果你想涉及 CPU，你确实需要在复制 L1 的同时加载 L2。不幸的是，现代 CPU 运行 L1 缓存块的速度比加载 L1 所需的时间还要快。 CPU 有一个内存控制器，在您将数据按顺序流式传输到 CPU 但仍然会遇到问题的情况下，它可以提供很大帮助。

当然，做某事更快的方法就是不做。捕获的数据可以写入 RAM 中的任何位置，还是在固定位置使用缓冲区。如果你可以把它写在任何地方，那么你根本不需要memcpy。如果它是固定的，您可以就地处理数据并使用双缓冲区类型系统吗？即开始捕获数据，当数据半满时，开始处理前半部分数据。当缓冲区已满时，开始将捕获的数据写入开头并处理后半部分。这要求算法处理数据的速度比采集卡产生数据的速度快。它还假设数据在处理后被丢弃。实际上，这是一个在复制过程中进行转换的 memcpy，因此您可以：

load -> transform -> save
\--/                 \--/
 capture card        RAM
   buffer

而不是：

load -> save -> load -> transform -> save
\-----------/
memcpy from
capture card
buffer to RAM

或者获得更快的 RAM！

编辑：另一种选择是处理数据源和 PC 之间的数据 - 你能在那里放置 DSP / FPGA 吗？定制硬件总是比通用 CPU 更快。

另一个想法：我已经有一段时间没有做过任何高性能图形工作了，但是你能将数据 DMA 到显卡中，然后再次 DMA 出来吗？您甚至可以利用 CUDA 进行一些处理。这将使 CPU 完全脱离内存传输循环。

You have a few barriers to obtaining the required memory performance:

Bandwidth - there is a limit to how quickly data can move from memory to the CPU and back again. According to this Wikipedia article, 266MHz DDR3 RAM has an upper limit of around 17GB/s. Now, with a memcpy you need to halve this to get your maximum transfer rate since the data is read and then written. From your benchmark results, it looks like you're not running the fastest possible RAM in your system. If you can afford it, upgrade the motherboard / RAM (and it won't be cheap, Overclockers in the UK currently have 3x4GB PC16000 at £400)
The OS - Windows is a preemptive multitasking OS so every so often your process will be suspended to allow other processes to have a look in and do stuff. This will clobber your caches and stall your transfer. In the worst case your entire process could be cached to disk!
The CPU - the data being moved has a long way to go: RAM -> L2 Cache -> L1 Cache -> CPU -> L1 -> L2 -> RAM. There may even be an L3 cache. If you want to involve the CPU you really want to be loading L2 whilst copying L1. Unfortunately, modern CPUs can run through an L1 cache block quicker than the time taken to load the L1. The CPU has a memory controller that helps a lot in these cases where your streaming data into the CPU sequentially but you're still going to have problems.

Of course, the faster way to do something is to not do it. Can the captured data be written anywhere in RAM or is the buffer used at a fixed location. If you can write it anywhere, then you don't need the memcpy at all. If it's fixed, could you process the data in place and use a double buffer type system? That is, start capturing data and when it's half full, start processing the first half of the data. When the buffer's full, start writing captured data to the start and process the second half. This requires that the algorithm can process the data faster than the capture card produces it. It also assumes that the data is discarded after processing. Effectively, this is a memcpy with a transformation as part of the copy process, so you've got:

load -> transform -> save
\--/                 \--/
 capture card        RAM
   buffer

instead of:

load -> save -> load -> transform -> save
\-----------/
memcpy from
capture card
buffer to RAM

Or get faster RAM!

EDIT: Another option is to process the data between the data source and the PC - could you put a DSP / FPGA in there at all? Custom hardware will always be faster than a general purpose CPU.

Another thought: It's been a while since I've done any high performance graphics stuff, but could you DMA the data into the graphics card and then DMA it out again? You could even take advantage of CUDA to do some of the processing. This would take the CPU out of the memory transfer loop altogether.

回复收藏 0 原文