图像处理的 memcpy 速度非常快?

发布于 2024-08-10 09:30:27 字数 614 浏览 5 评论 0原文

我正在用 C 语言进行图像处理,需要在内存中复制大量数据 - 源和目标永远不会重叠。

在 x86 平台上使用 GCC (其中 SSE、SSE2 可用,但不可用)?

我希望解决方案要么采用汇编形式,要么使用 GCC 内在函数?

我找到了以下链接,但不知道这是否是最好的方法(作者还说它有一些错误): http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2006-02/msg00123.html编辑

:请注意,副本是必要的,我无法避免复制数据(我可以解释原因,但我不会解释:))

I am doing image processing in C that requires copying large chunks of data around memory - the source and destination never overlap.

What is the absolute fastest way to do this on the x86 platform using GCC (where SSE, SSE2 but NOT SSE3 are available)?

I expect the solution will either be in assembly or using GCC intrinsics?

I found the following link but have no idea whether it's the best way to go about it (the author also says it has a few bugs): http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2006-02/msg00123.html

EDIT: note that a copy is necessary, I cannot get around having to copy the data (I could explain why but I'll spare you the explanation :))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

眼眸里的快感 2024-08-17 09:30:27

陈伟霆和 Google 提供。比 Microsoft Visual Studio 2005 中的 memcpy 快 30-70%。

void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size)
{

  __asm
  {
    mov esi, src;    //src pointer
    mov edi, dest;   //dest pointer

    mov ebx, size;   //ebx is our counter 
    shr ebx, 7;      //divide by 128 (8 * 128bit registers)


    loop_copy:
      prefetchnta 128[ESI]; //SSE2 prefetch
      prefetchnta 160[ESI];
      prefetchnta 192[ESI];
      prefetchnta 224[ESI];

      movdqa xmm0, 0[ESI]; //move data from src to registers
      movdqa xmm1, 16[ESI];
      movdqa xmm2, 32[ESI];
      movdqa xmm3, 48[ESI];
      movdqa xmm4, 64[ESI];
      movdqa xmm5, 80[ESI];
      movdqa xmm6, 96[ESI];
      movdqa xmm7, 112[ESI];

      movntdq 0[EDI], xmm0; //move data from registers to dest
      movntdq 16[EDI], xmm1;
      movntdq 32[EDI], xmm2;
      movntdq 48[EDI], xmm3;
      movntdq 64[EDI], xmm4;
      movntdq 80[EDI], xmm5;
      movntdq 96[EDI], xmm6;
      movntdq 112[EDI], xmm7;

      add esi, 128;
      add edi, 128;
      dec ebx;

      jnz loop_copy; //loop please
    loop_copy_end:
  }
}

您可以根据您的具体情况和您能够做出的任何假设进一步优化它。

您可能还想查看 memcpy 源代码 (memcpy.asm) 并去掉其特殊情况处理。或许还可以进一步优化!

Courtesy of William Chan and Google. 30-70% faster than memcpy in Microsoft Visual Studio 2005.

void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size)
{

  __asm
  {
    mov esi, src;    //src pointer
    mov edi, dest;   //dest pointer

    mov ebx, size;   //ebx is our counter 
    shr ebx, 7;      //divide by 128 (8 * 128bit registers)


    loop_copy:
      prefetchnta 128[ESI]; //SSE2 prefetch
      prefetchnta 160[ESI];
      prefetchnta 192[ESI];
      prefetchnta 224[ESI];

      movdqa xmm0, 0[ESI]; //move data from src to registers
      movdqa xmm1, 16[ESI];
      movdqa xmm2, 32[ESI];
      movdqa xmm3, 48[ESI];
      movdqa xmm4, 64[ESI];
      movdqa xmm5, 80[ESI];
      movdqa xmm6, 96[ESI];
      movdqa xmm7, 112[ESI];

      movntdq 0[EDI], xmm0; //move data from registers to dest
      movntdq 16[EDI], xmm1;
      movntdq 32[EDI], xmm2;
      movntdq 48[EDI], xmm3;
      movntdq 64[EDI], xmm4;
      movntdq 80[EDI], xmm5;
      movntdq 96[EDI], xmm6;
      movntdq 112[EDI], xmm7;

      add esi, 128;
      add edi, 128;
      dec ebx;

      jnz loop_copy; //loop please
    loop_copy_end:
  }
}

You may be able to optimize it further depending on your exact situation and any assumptions you are able to make.

You may also want to check out the memcpy source (memcpy.asm) and strip out its special case handling. It may be possible to optimise further!

瑶笙 2024-08-17 09:30:27

hapalibashi 发布的 SSE 代码就是正确的选择。

如果您需要更高的性能并且不要回避编写设备驱动程序的漫长而曲折的道路:现在所有重要的平台都有一个 DMA 控制器,它能够更快地执行复制作业并与 CPU 代码并行可以做。

但这涉及编写驱动程序。由于存在安全风险,据我所知,没有哪个大型操作系统会向用户端公开此功能。

但是,这可能是值得的(如果您需要性能),因为地球上没有任何代码可以胜过专为完成此类工作而设计的硬件。

The SSE-Code posted by hapalibashi is the way to go.

If you need even more performance and don't shy away from the long and winding road of writing a device-driver: All important platforms nowadays have a DMA-controller that is capable of doing a copy-job faster and in parallel to CPU code could do.

That involves writing a driver though. No big OS that I'm aware of exposes this functionality to the user-side because of the security risks.

However, it may be worth it (if you need the performance) since no code on earth could outperform a piece of hardware that is designed to do such a job.

空心空情空意 2024-08-17 09:30:27

这个问题已经有四年了,我有点惊讶没有人提到内存带宽。 CPU-Z 报告我的机器有 PC3-10700 RAM。 RAM 的峰值带宽(又称传输速率、吞吐量等)为 10700 MB/秒。我机器中的CPU是i5-2430M CPU,峰值睿频为3 GHz。

理论上,使用无限快的 CPU 和 RAM,memcpy 可以达到 5300 MBytes/sec,即 10700 的一半,因为 memcpy 必须读取 RAM,然后写入 RAM。 (编辑:正如 v.oddou 指出的,这是一个简单化的近似)。

另一方面,假设我们拥有无限快的 RAM 和现实的 CPU,我们能实现什么?让我们以我的 3 GHz CPU 为例。如果它可以在每个周期执行 32 位读取和 32 位写入,那么它可以传输 3e9 * 4 = 12000 MBytes/sec。对于现代 CPU 来说,这似乎很容易实现。我们已经可以看到,在 CPU 上运行的代码并不是真正的瓶颈。这是现代机器具有数据缓存的原因之一。

当我们知道数据被缓存时,我们可以通过对 memcpy 进行基准测试来衡量 CPU 真正能做什么。准确地做到这一点是很繁琐的。我制作了一个简单的应用程序,它将随机数写入一个数组,将它们memcpy到另一个数组,然后对复制的数据进行校验和。我单步执行调试器中的代码,以确保聪明的编译器没有删除副本。改变数组的大小会改变缓存的性能——小数组适合缓存,大数组则不太适合。我得到以下结果:

  • 40 KByte 数组:16000 MBytes/sec
  • 400 KByte 数组:11000 MBytes/sec
  • 4000 KByte 数组:3100 MBytes/sec

显然,我的 CPU 每个周期可以读写超过 32 位,因为 16000 超过上面我理论上计算的12000。这意味着 CPU 的瓶颈比我想象的还要小。我使用 Visual Studio 2005,并进入标准 memcpy 实现,我可以看到它在我的机器上使用 movqda 指令。我猜这个每个周期可以读写 64 位。

hapalibashi 发布的漂亮代码在我的机器上达到了 4200 MB/秒——比 VS 2005 实现快了大约 40%。我猜它更快,因为它使用预取指令来提高缓存性能。

总之,在 CPU 上运行的代码不是瓶颈,调整该代码只会带来很小的改进。

This question is four years old now and I'm a little surprised nobody has mentioned memory bandwidth yet. CPU-Z reports that my machine has PC3-10700 RAM. That the RAM has a peak bandwidth (aka transfer rate, throughput etc) of 10700 MBytes/sec. The CPU in my machine is an i5-2430M CPU, with peak turbo frequency of 3 GHz.

Theoretically, with an infinitely fast CPU and my RAM, memcpy could go at 5300 MBytes/sec, ie half of 10700 because memcpy has to read from and then write to RAM. (edit: As v.oddou pointed out, this is a simplistic approximation).

On the other hand, imagine we had infinitely fast RAM and a realistic CPU, what could we achieve? Let's use my 3 GHz CPU as an example. If it could do a 32-bit read and a 32-bit write each cycle, then it could transfer 3e9 * 4 = 12000 MBytes/sec. This seems easily within reach for a modern CPU. Already, we can see that the code running on the CPU isn't really the bottleneck. This is one of the reasons that modern machines have data caches.

We can measure what the CPU can really do by benchmarking memcpy when we know the data is cached. Doing this accurately is fiddly. I made a simple app that wrote random numbers into an array, memcpy'd them to another array, then checksumed the copied data. I stepped through the code in the debugger to make sure that the clever compiler had not removed the copy. Altering the size of the array alters the cache performance - small arrays fit in the cache, big ones less so. I got the following results:

  • 40 KByte arrays: 16000 MBytes/sec
  • 400 KByte arrays: 11000 MBytes/sec
  • 4000 KByte arrays: 3100 MBytes/sec

Obviously, my CPU can read and write more than 32 bits per cycle, since 16000 is more than the 12000 I calculated theoretically above. This means the CPU is even less of a bottleneck than I already thought. I used Visual Studio 2005, and stepping into the standard memcpy implementation, I can see that it uses the movqda instruction on my machine. I guess this can read and write 64 bits per cycle.

The nice code hapalibashi posted achieves 4200 MBytes/sec on my machine - about 40% faster than the VS 2005 implementation. I guess it is faster because it uses the prefetch instruction to improve cache performance.

In summary, the code running on the CPU isn't the bottleneck and tuning that code will only make small improvements.

玉环 2024-08-17 09:30:27

-O1 或以上的任何优化级别,GCC 都会使用 memcpy 等函数的内置定义 - 带有正确的 -march 参数(-march) >-march=pentium4 对于您提到的一组功能)它应该生成非常优化的特定于体系结构的内联代码。

我会对它进行基准测试,看看结果如何。

At any optimisation level of -O1 or above, GCC will use builtin definitions for functions like memcpy - with the right -march parameter (-march=pentium4 for the set of features you mention) it should generate pretty optimal architecture-specific inline code.

I'd benchmark it and see what comes out.

恍梦境° 2024-08-17 09:30:27

如果特定于英特尔处理器,您可能会受益于 IPP。如果您知道它将与 Nvidia GPU 一起运行,也许您可​​以使用 CUDA - 在两者中在这种情况下,看起来更广泛可能比优化 memcpy() 更好 - 它们提供了在更高级别改进算法的机会。然而,它们都依赖于特定的硬件。

If specific to Intel processors, you might benefit from IPP. If you know it will run with an Nvidia GPU perhaps you could use CUDA - in both cases it may be better to look wider than optimising memcpy() - they provide opportunities for improving your algorithm at a higher level. They are both however reliant on specific hardware.

放赐 2024-08-17 09:30:27

如果您使用的是 Windows,请使用 DirectX API,它具有特定的 GPU - 用于图形处理的优化例程(能有多快?你的 CPU 没有加载。在执行其他操作时GPU 会咀嚼它)。

如果您想与操作系统无关,请尝试 OpenGL

不要摆弄汇编程序,因为你很可能会惨败,无法超越 10 年以上精通库制作的软件工程师。

If you're on Windows, use the DirectX APIs, which has specific GPU-optimized routines for graphics handling (how fast could it be? Your CPU isn't loaded. Do something else while the GPU munches it).

If you want to be OS agnostic, try OpenGL.

Do not fiddle with assembler, because it is all too likely that you'll fail miserably to outperform 10 year+ proficient library-making software engineers.

始终不够爱げ你 2024-08-17 09:30:27

老问题,但到目前为止没有人指出两件事:

  1. 大多数编译器都有自己的 memcpy 版本;由于 memcpy 定义良好并且也是 C 标准的一部分,因此编译器不必使用系统库附带的实现,它们可以自由地使用自己的实现。正如问题提到的“内在函数”,好吧,实际上大多数时候你在代码中编写 memcpy 时,你实际上是在使用编译器内在函数,因为这是编译器在内部使用的而不是使对 memcpy 的真正调用,因为它甚至可以内联它,从而消除任何函数调用开销。

  2. 我知道的大多数 memcpy 实现已经在内部使用了类似 SSE2 的东西,至少好的实现是这样的。 Visual Studio 2005 可能没有使用它,但 GCC 已经使用它很多年了。当然,他们使用什么取决于构建设置。它们只会使用代码应在其上运行的所有 CPU 可用的指令,因此请务必正确设置架构(例如 marchmtune)以及其他标志(例如启用对可选指令集的支持)。所有这些都会影响编译器在最终二进制文件中为 memcpy 生成的代码。

因此,一如既往,不要假设您可以胜过编译器或系统(对于不同的 CPU 也可能有不同的 memcpy 实现),通过基准测试来证明这一点!除非基准测试表明您的手写代码在现实生活中更快,否则请将其留给编译器和系统,因为它们将采用新的 CPU,并且系统可能会获得更新,自动使您的代码在未来运行得更快,而您必须自己重新优化手写代码,除非您自己发布更新,否则它永远不会变得更快。

Old question but two things nobody has pointed out so far:

  1. Most compilers have their own version of memcpy; since memcpy is well defined and also part of the C standard, compilers don't have to use the implementation that ships with system libraries, they are free to use their own one. As the question mentions "intrinsics", well, actually most of the time you write memcpy in your code, you are in fact using a compiler intrinsic function, as that's what the compiler will internally use instead of making a real call to memcpy as then it can even inline it and thus eliminates any function call overhead.

  2. Most memcpy implementations I know already do use stuff like SSE2 internally when available, at least the good ones do. The one of Visual Studio 2005 may not have used that but GCC has been using that for ages. Of course, what they use depends on the build settings. They will only use instructions available to all CPUs the code shall run on, so be sure to set architecture correctly (e.g. march and mtune), as well as other flags (e.g. enabling support for optional instruction sets). All of that influences what code a compiler generates for memcpy in the final binary.

So as always, don't assume you can outsmart the compiler or the system (which may have different memcpy implementations available for different CPUs as well), benchmark to proof that! Unless a benchmark shows that your handwritten code is any faster in real life, rather leave it to the compiler and the system, as they will adopt to new CPUs and the system may get updates that automatically makes your code run faster in the future, whereas you have to re-optimized handwritten code all by yourself and it will never get any faster unless you ship an update yourself.

只是一片海 2024-08-17 09:30:27

如果您可以使用 DMA 引擎,那么速度将会更快。

If you have access to a DMA engine, nothing will be faster.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文