如何提高memcpy的性能

发布于 2024-10-03 20:57:41 字数 2365 浏览 11 评论 0原文

摘要:

在真实或测试应用程序中,memcpy 似乎无法在我的系统上传输超过 2GB/秒。我该怎么做才能获得更快的内存到内存复制?

完整细节:

作为数据捕获应用程序的一部分(使用一些专用硬件),我需要将大约 3 GB/秒从临时缓冲区复制到主内存中。为了获取数据,我为硬件驱动程序提供了一系列缓冲区(每个缓冲区 2MB)。硬件 DMA 将数据传输到每个缓冲区,然后在每个缓冲区已满时通知我的程序。我的程序清空缓冲区(memcpy 到另一个更大的 RAM 块),并将处理后的缓冲区重新发布到卡上以再次填充。我在 memcpy 移动数据速度不够快时遇到问题。看起来内存到内存的复制应该足够快,可以在我运行的硬件上支持 3GB/秒。 Lavalys EVEREST 为我提供了 9337MB/秒的内存复制基准测试结果,但即使在简单的测试程序中,我也无法使用 memcpy 达到接近这些速度的速度。

我通过在缓冲区处理代码中添加/删除 memcpy 调用来隔离性能问题。如果没有 memcpy,我可以运行全数据速率 - 大约 3GB/秒。启用 memcpy 后,我的速度限制为大约 550Mb/秒(使用当前编译器)。

为了在我的系统上对 memcpy 进行基准测试,我编写了一个单独的测试程序,它只对某些数据块调用 memcpy。 (我已经发布了下面的代码)我已经在我使用的编译器/IDE(National Instruments CVI)以及 Visual Studio 2010 中运行了这个。虽然我目前没有使用 Visual Studio,但我愿意如果能够产生必要的性能,则进行转换。然而,在盲目转移之前,我想确保它能够解决我的 memcpy 性能问题。

Visual C++ 2010:1900 MB/秒

NI CVI 2009:550 MB/秒

虽然我对 CVI 明显慢于 Visual Studio 并不感到惊讶,但我对 memcpy 性能如此之低感到惊讶。虽然我不确定这是否可以直接比较,但这比 EVEREST 基准带宽低得多。虽然我不需要那么高的性能水平,但至少需要 3GB/秒。当然,标准库的实现不会比 EVEREST 使用的任何东西差那么多!

在这种情况下,我能做些什么来使 memcpy 更快?


硬件详细信息: AMD Magny Cours - 4x 八进制核心 128GB DDR3 Windows Server 2003 Enterprise X64

测试程序:

#include <windows.h>
#include <stdio.h>

const size_t NUM_ELEMENTS = 2*1024 * 1024;
const size_t ITERATIONS = 10000;

int main (int argc, char *argv[])
{
    LARGE_INTEGER start, stop, frequency;

    QueryPerformanceFrequency(&frequency);

    unsigned short * src = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);
    unsigned short * dest = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);

    for(int ctr = 0; ctr < NUM_ELEMENTS; ctr++)
    {
        src[ctr] = rand();
    }

    QueryPerformanceCounter(&start);

    for(int iter = 0; iter < ITERATIONS; iter++)
        memcpy(dest, src, NUM_ELEMENTS * sizeof(unsigned short));

    QueryPerformanceCounter(&stop);

    __int64 duration = stop.QuadPart - start.QuadPart;

    double duration_d = (double)duration / (double) frequency.QuadPart;

    double bytes_sec = (ITERATIONS * (NUM_ELEMENTS/1024/1024) * sizeof(unsigned short)) / duration_d;

    printf("Duration: %.5lfs for %d iterations, %.3lfMB/sec\n", duration_d, ITERATIONS, bytes_sec);

    free(src);
    free(dest);

    getchar();

    return 0;
}

编辑:如果您有额外的五分钟时间并且想要做出贡献,您可以在您的计算机上运行上述代码并将您的时间作为评论发布吗?

Summary:

memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies?

Full details:

As part of a data capture application (using some specialized hardware), I need to copy about 3 GB/sec from temporary buffers into main memory. To acquire data, I provide the hardware driver with a series of buffers (2MB each). The hardware DMAs data to each buffer, and then notifies my program when each buffer is full. My program empties the buffer (memcpy to another, larger block of RAM), and reposts the processed buffer to the card to be filled again. I am having issues with memcpy moving the data fast enough. It seems that the memory-to-memory copy should be fast enough to support 3GB/sec on the hardware that I am running on. Lavalys EVEREST gives me a 9337MB/sec memory copy benchmark result, but I can't get anywhere near those speeds with memcpy, even in a simple test program.

I have isolated the performance issue by adding/removing the memcpy call inside the buffer processing code. Without the memcpy, I can run full data rate- about 3GB/sec. With the memcpy enabled, I am limited to about 550Mb/sec (using current compiler).

In order to benchmark memcpy on my system, I've written a separate test program that just calls memcpy on some blocks of data. (I've posted the code below) I've run this both in the compiler/IDE that I'm using (National Instruments CVI) as well as Visual Studio 2010. While I'm not currently using Visual Studio, I am willing to make the switch if it will yield the necessary performance. However, before blindly moving over, I wanted to make sure that it would solve my memcpy performance problems.

Visual C++ 2010: 1900 MB/sec

NI CVI 2009: 550 MB/sec

While I am not surprised that CVI is significantly slower than Visual Studio, I am surprised that the memcpy performance is this low. While I'm not sure if this is directly comparable, this is much lower than the EVEREST benchmark bandwidth. While I don't need quite that level of performance, a minimum of 3GB/sec is necessary. Surely the standard library implementation can't be this much worse than whatever EVEREST is using!

What, if anything, can I do to make memcpy faster in this situation?


Hardware details:
AMD Magny Cours- 4x octal core
128 GB DDR3
Windows Server 2003 Enterprise X64

Test program:

#include <windows.h>
#include <stdio.h>

const size_t NUM_ELEMENTS = 2*1024 * 1024;
const size_t ITERATIONS = 10000;

int main (int argc, char *argv[])
{
    LARGE_INTEGER start, stop, frequency;

    QueryPerformanceFrequency(&frequency);

    unsigned short * src = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);
    unsigned short * dest = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);

    for(int ctr = 0; ctr < NUM_ELEMENTS; ctr++)
    {
        src[ctr] = rand();
    }

    QueryPerformanceCounter(&start);

    for(int iter = 0; iter < ITERATIONS; iter++)
        memcpy(dest, src, NUM_ELEMENTS * sizeof(unsigned short));

    QueryPerformanceCounter(&stop);

    __int64 duration = stop.QuadPart - start.QuadPart;

    double duration_d = (double)duration / (double) frequency.QuadPart;

    double bytes_sec = (ITERATIONS * (NUM_ELEMENTS/1024/1024) * sizeof(unsigned short)) / duration_d;

    printf("Duration: %.5lfs for %d iterations, %.3lfMB/sec\n", duration_d, ITERATIONS, bytes_sec);

    free(src);
    free(dest);

    getchar();

    return 0;
}

EDIT: If you have an extra five minutes and want to contribute, can you run the above code on your machine and post your time as a comment?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

一生独一 2024-10-10 20:57:41

我找到了一种在这种情况下提高速度的方法。我编写了一个多线程版本的 memcpy,在线程之间分割要复制的区域。以下是设定块大小的一些性能缩放数字,使用与上面相同的计时代码。我不知道性能,特别是对于这么小的块来说,会扩展到这么多线程。我怀疑这与这台机器上的大量内存控制器(16个)有关。

Performance (10000x 4MB block memcpy):

 1 thread :  1826 MB/sec
 2 threads:  3118 MB/sec
 3 threads:  4121 MB/sec
 4 threads: 10020 MB/sec
 5 threads: 12848 MB/sec
 6 threads: 14340 MB/sec
 8 threads: 17892 MB/sec
10 threads: 21781 MB/sec
12 threads: 25721 MB/sec
14 threads: 25318 MB/sec
16 threads: 19965 MB/sec
24 threads: 13158 MB/sec
32 threads: 12497 MB/sec

我不明白 3 线程和 4 线程之间的巨大性能跳跃。什么会导致这样的跳跃?

我已经包含了我在下面为可能遇到同样问题的其他人编写的 memcpy 代码。请注意,此代码中没有错误检查 - 可能需要为您的应用程序添加此代码。

#define NUM_CPY_THREADS 4

HANDLE hCopyThreads[NUM_CPY_THREADS] = {0};
HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0};
HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0};
typedef struct
{
    int ct;
    void * src, * dest;
    size_t size;
} mt_cpy_t;

mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};

DWORD WINAPI thread_copy_proc(LPVOID param)
{
    mt_cpy_t * p = (mt_cpy_t * ) param;

    while(1)
    {
        WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);
        memcpy(p->dest, p->src, p->size);
        ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);
    }

    return 0;
}

int startCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        mtParamters[ctr].ct = ctr;
        hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0, NULL); 
    }

    return 0;
}

void * mt_memcpy(void * dest, void * src, size_t bytes)
{
    //set up parameters
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes / NUM_CPY_THREADS;
    }

    //release semaphores to start computation
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);

    //wait for all threads to finish
    WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);

    return dest;
}

int stopCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        TerminateThread(hCopyThreads[ctr], 0);
        CloseHandle(hCopyStartSemaphores[ctr]);
        CloseHandle(hCopyStopSemaphores[ctr]);
    }
    return 0;
}

I have found a way to increase speed in this situation. I wrote a multi-threaded version of memcpy, splitting the area to be copied between threads. Here are some performance scaling numbers for a set block size, using the same timing code as found above. I had no idea that the performance, especially for this small size of block, would scale to this many threads. I suspect that this has something to do with the large number of memory controllers (16) on this machine.

Performance (10000x 4MB block memcpy):

 1 thread :  1826 MB/sec
 2 threads:  3118 MB/sec
 3 threads:  4121 MB/sec
 4 threads: 10020 MB/sec
 5 threads: 12848 MB/sec
 6 threads: 14340 MB/sec
 8 threads: 17892 MB/sec
10 threads: 21781 MB/sec
12 threads: 25721 MB/sec
14 threads: 25318 MB/sec
16 threads: 19965 MB/sec
24 threads: 13158 MB/sec
32 threads: 12497 MB/sec

I don't understand the huge performance jump between 3 and 4 threads. What would cause a jump like this?

I've included the memcpy code that I wrote below for other that may run into this same issue. Please note that there is no error checking in this code- this may need to be added for your application.

#define NUM_CPY_THREADS 4

HANDLE hCopyThreads[NUM_CPY_THREADS] = {0};
HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0};
HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0};
typedef struct
{
    int ct;
    void * src, * dest;
    size_t size;
} mt_cpy_t;

mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};

DWORD WINAPI thread_copy_proc(LPVOID param)
{
    mt_cpy_t * p = (mt_cpy_t * ) param;

    while(1)
    {
        WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);
        memcpy(p->dest, p->src, p->size);
        ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);
    }

    return 0;
}

int startCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        mtParamters[ctr].ct = ctr;
        hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0, NULL); 
    }

    return 0;
}

void * mt_memcpy(void * dest, void * src, size_t bytes)
{
    //set up parameters
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes / NUM_CPY_THREADS;
    }

    //release semaphores to start computation
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);

    //wait for all threads to finish
    WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);

    return dest;
}

int stopCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        TerminateThread(hCopyThreads[ctr], 0);
        CloseHandle(hCopyStartSemaphores[ctr]);
        CloseHandle(hCopyStopSemaphores[ctr]);
    }
    return 0;
}
祁梦 2024-10-10 20:57:41

我不确定它是在运行时完成还是必须在编译时完成,但您应该启用 SSE 或类似的扩展,因为向量单元通常可以向内存写入 128 位,而 CPU 则为 64 位。

尝试这个实现

是的,并确保源和目标均与 128 位对齐。如果你的源和目标没有相互对齐,你的 memcpy() 将不得不做一些严肃的事情。 :)

I'm not sure if it's done in run time or if you have to do it compile time, but you should have SSE or similar extensions enabled as the vector unit often can write 128 bits to the memory compared to 64 bits for the CPU.

Try this implementation.

Yeah, and make sure that both the source and destination is aligned to 128 bits. If your source and destination are not aligned respective to each other your memcpy() will have to do some serious magic. :)

童话 2024-10-10 20:57:41

需要注意的一件事是,您的进程(以及 memcpy() 的性能)受到操作系统任务调度的影响 - 很难说这对您的计时有多大影响,但是山雀很难控制。设备 DMA 操作不受此影响,因为一旦启动,它就不再在 CPU 上运行。由于您的应用程序是一个实际的实时应用程序,因此您可能想尝试一下 Windows 的进程/线程优先级设置(如果您还没有这样做的话)。请记住,您必须小心这一点,因为它可能会对其他进程(以及计算机上的用户体验)产生真正的负面影响。

另一件需要记住的事情是,操作系统内存虚拟化可能会产生影响 - 如果您要复制到的内存页面实际上并未由物理 RAM 页面支持,则 memcpy() 操作将操作系统的故障导致物理支持到位。您的 DMA 页可能被锁定到物理内存中(因为它们必须用于 DMA 操作),因此 memcpy() 的源内存在这方面可能不是问题。您可能会考虑使用 Win32 VirtualAlloc() API 来确保 memcpy() 的目标内存已提交(我认为 VirtualAlloc()是正确的 API,但可能有一个更好的 API,我忘记了 - 我已经有一段时间没有需要做这样的事情了)。

最后,看看您是否可以使用Skizz 解释的技术完全避免使用 memcpy() - 如果资源允许,这是最好的选择。

One thing to be aware of is that your process (and hence the performance of memcpy()) is impacted by the OS scheduling of tasks - it's hard to say how much of a factor this is in your timings, bu tit is difficult to control. The device DMA operation isn't subject to this, since it isn't running on the CPU once it's kicked off. Since your application is an actual real-time application though, you might want to experiment with Windows' process/thread priority settings if you haven't already. Just keep in mind that you have to be careful about this because it can have a really negative impact in other processes (and the user experience on the machine).

Another thing to keep in mind is that the OS memory virtualization might have an impact here - if the memory pages you're copying to aren't actually backed by physical RAM pages, the memcpy() operation will fault to the OS to get that physical backing in place. Your DMA pages are likely to be locked into physical memory (since they have to be for the DMA operation), so the source memory to memcpy() is likely not an issue in this regard. You might consider using the Win32 VirtualAlloc() API to ensure that your destination memory for the memcpy() is committed (I think VirtualAlloc() is the right API for this, but there might be a better one that I'm forgetting - it's been a while since I've had a need to do anything like this).

Finally, see if you can use the technique explained by Skizz to avoid the memcpy() altogether - that's your best bet if resources permit.

晌融 2024-10-10 20:57:41

要获得所需的内存性能,存在一些障碍:

  1. 带宽 - 数据从内存移动到 CPU 并再次返回的速度存在限制。根据这篇维基百科文章,266MHz DDR3 RAM 的上限约为 17GB/s。现在,使用 memcpy,您需要将其减半才能获得最大传输速率,因为数据是读取然后写入的。从您的基准测试结果来看,您的系统似乎没有运行最快的 RAM。如果您负担得起,请升级主板/RAM(而且价格不会便宜,英国的超频玩家目前拥有 3x4GB PC16000,售价 400 英镑)

  2. 操作系统 - Windows 是一个抢占式多任务操作系统,因此您的进程经常会受到影响将被挂起以允许其他进程查看并执行操作。这将破坏您的缓存并阻止您的传输。在最坏的情况下,您的整个进程可能会缓存到磁盘!

  3. CPU - 正在移动的数据还有很长的路要走:RAM ->二级缓存-> L1 缓存 ->中央处理器-> L1-> L2->内存。甚至可能还有 L3 缓存。如果你想涉及 CPU,你确实需要在复制 L1 的同时加载 L2。不幸的是,现代 CPU 运行 L1 缓存块的速度比加载 L1 所需的时间还要快。 CPU 有一个内存控制器,在您将数据按顺序流式传输到 CPU 但仍然会遇到问题的情况下,它可以提供很大帮助。

当然,做某事更快的方法就是不做。捕获的数据可以写入 RAM 中的任何位置,还是在固定位置使用缓冲区。如果你可以把它写在任何地方,那么你根本不需要memcpy。如果它是固定的,您可以就地处理数据并使用双缓冲区类型系统吗?即开始捕获数据,当数据半满时,开始处理前半部分数据。当缓冲区已满时,开始将捕获的数据写入开头并处理后半部分。这要求算法处理数据的速度比采集卡产生数据的速度快。它还假设数据在处理后被丢弃。实际上,这是一个在复制过程中进行转换的 memcpy,因此您可以:

load -> transform -> save
\--/                 \--/
 capture card        RAM
   buffer

而不是:

load -> save -> load -> transform -> save
\-----------/
memcpy from
capture card
buffer to RAM

或者获得更快的 RAM!

编辑:另一种选择是处理数据源和 PC 之间的数据 - 你能在那里放置 DSP / FPGA 吗?定制硬件总是比通用 CPU 更快。

另一个想法:我已经有一段时间没有做过任何高性能图形工作了,但是你能将数据 DMA 到显卡中,然后再次 DMA 出来吗?您甚至可以利用 CUDA 进行一些处理。这将使 CPU 完全脱离内存传输循环。

You have a few barriers to obtaining the required memory performance:

  1. Bandwidth - there is a limit to how quickly data can move from memory to the CPU and back again. According to this Wikipedia article, 266MHz DDR3 RAM has an upper limit of around 17GB/s. Now, with a memcpy you need to halve this to get your maximum transfer rate since the data is read and then written. From your benchmark results, it looks like you're not running the fastest possible RAM in your system. If you can afford it, upgrade the motherboard / RAM (and it won't be cheap, Overclockers in the UK currently have 3x4GB PC16000 at £400)

  2. The OS - Windows is a preemptive multitasking OS so every so often your process will be suspended to allow other processes to have a look in and do stuff. This will clobber your caches and stall your transfer. In the worst case your entire process could be cached to disk!

  3. The CPU - the data being moved has a long way to go: RAM -> L2 Cache -> L1 Cache -> CPU -> L1 -> L2 -> RAM. There may even be an L3 cache. If you want to involve the CPU you really want to be loading L2 whilst copying L1. Unfortunately, modern CPUs can run through an L1 cache block quicker than the time taken to load the L1. The CPU has a memory controller that helps a lot in these cases where your streaming data into the CPU sequentially but you're still going to have problems.

Of course, the faster way to do something is to not do it. Can the captured data be written anywhere in RAM or is the buffer used at a fixed location. If you can write it anywhere, then you don't need the memcpy at all. If it's fixed, could you process the data in place and use a double buffer type system? That is, start capturing data and when it's half full, start processing the first half of the data. When the buffer's full, start writing captured data to the start and process the second half. This requires that the algorithm can process the data faster than the capture card produces it. It also assumes that the data is discarded after processing. Effectively, this is a memcpy with a transformation as part of the copy process, so you've got:

load -> transform -> save
\--/                 \--/
 capture card        RAM
   buffer

instead of:

load -> save -> load -> transform -> save
\-----------/
memcpy from
capture card
buffer to RAM

Or get faster RAM!

EDIT: Another option is to process the data between the data source and the PC - could you put a DSP / FPGA in there at all? Custom hardware will always be faster than a general purpose CPU.

Another thought: It's been a while since I've done any high performance graphics stuff, but could you DMA the data into the graphics card and then DMA it out again? You could even take advantage of CUDA to do some of the processing. This would take the CPU out of the memory transfer loop altogether.

你如我软肋 2024-10-10 20:57:41

首先,您需要检查内存是否在 16 字节边界上对齐,否则您会受到惩罚。这是最重要的。

如果您不需要符合标准的解决方案,您可以通过使用某些编译器特定的扩展(例如memcpy64)来检查情况是否有所改善(如果有可用的东西,请检查您的编译器文档)。事实上,memcpy 必须能够处理单字节复制,但如果没有此限制,一次移动 4 或 8 个字节会快得多。

同样,您可以选择编写内联汇编代码吗?

First of all, you need to check that memory is aligned on 16 byte boundary, otherwise you get penalties. This is the most important thing.

If you don't need a standard-compliant solution, you could check if things improve by using some compiler specific extension such as memcpy64 (check with your compiler doc if there's something available). Fact is that memcpymust be able to deal with single byte copy, but moving 4 or 8 bytes at a time is much faster if you don't have this restriction.

Again, is it an option for you to write inline assembly code?

橪书 2024-10-10 20:57:41

也许您可以更多地解释一下如何处理更大的内存区域?

在您的应用程序中是否可以简单地传递缓冲区的所有权,而不是复制它?这将完全消除问题。

或者您使用 memcpy 不仅仅用于复制?也许您正在使用更大的内存区域来根据捕获的数据构建顺序数据流?特别是如果你一次处理一个角色,你可能会半途而废。例如,可以调整处理代码以适应表示为“缓冲区数组”而不是“连续内存区域”的流。

Perhaps you can explain some more about how you're processing the larger memory area?

Would it be possible within your application to simply pass ownership of the buffer, rather than copy it? This would eliminate the problem altogether.

Or are you using memcpy for more than just copying? Perhaps you're using the larger area of memory to build a sequential stream of data from what you've captured? Especially if you're processing one character at a time, you may be able to meet halfway. For example, it may be possible to adapt your processing code to accommodate for a stream represented as ‘an array of buffers’, rather than ‘a continuous memory area’.

恋竹姑娘 2024-10-10 20:57:41

您可以使用 SSE2 寄存器编写更好的 memcpy 实现。 VC2010的版本已经做到了这一点。所以问题是,如果你要给它对齐的内存。

也许你可以比 VC 2010 的版本做得更好,但它确实需要一些了解如何去做。

PS:您可以在反向调用中将缓冲区传递给用户模式程序,以完全防止复制。

You can write a better implementation of memcpy using SSE2 registers. The version in VC2010 does this already. So the question is more, if you are handing it aligned memory.

Maybe you can do better then the version of VC 2010, but it does need some understanding, of how to do it.

PS: You can pass the buffer to the user mode program in an inverted call, to prevent the copy altogether.

呆头 2024-10-10 20:57:41

我建议您阅读的一个来源是 MPlayer 的 fast_memcpy 函数。还要考虑预期的使用模式,并注意现代 cpu 具有特殊的存储指令,可让您通知 cpu 是否需要读回正在写入的数据。使用指示您不会读回数据(因此不需要缓存)的指令对于大型 memcpy 操作来说可能是一个巨大的胜利。

One source I would recommend you read is MPlayer's fast_memcpy function. Also consider the expected usage patterns, and note that modern cpus have special store instructions which let you inform the cpu whether or not you will need to read back the data you're writing. Using the instructions that indicate you won't be reading back the data (and thus it doesn't need to be cached) can be a huge win for large memcpy operations.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文