使用 movsd 让编译器复制字符
我想在时间关键的函数中复制相对较短的内存序列(小于 1 KB,通常为 2-200 字节)。 CPU 端的最佳代码似乎是 rep movsd
。 但是我不知何故无法让我的编译器生成此代码。 我希望(我隐约记得看到过)使用 memcpy 可以使用编译器内置的内在函数来完成此操作,但基于反汇编和调试,编译器似乎正在使用对 memcpy/memmove 库实现的调用来代替。 我还希望编译器足够聪明,能够识别以下循环并单独使用rep movsd,但似乎没有。
char *dst;
const char *src;
// ...
for (int r=size; --r>=0; ) *dst++ = *src++;
除了使用内联汇编之外,还有其他方法可以使 Visual Studio 编译器生成 rep movsd
序列吗?
I would like to copy a relatively short sequence of memory (less than 1 KB, typically 2-200 bytes) in a time critical function. The best code for this on CPU side seems to be rep movsd
. However I somehow cannot make my compiler to generate this code. I hoped (and I vaguely remember seeing so) using memcpy would do this using compiler built-in intrinsics, but based on disassembly and debugging it seems compiler is using call to memcpy/memmove library implementation instead. I also hoped the compiler might be smart enough to recognize following loop and use rep movsd
on its own, but it seems it does not.
char *dst;
const char *src;
// ...
for (int r=size; --r>=0; ) *dst++ = *src++;
Is there some way to make the Visual Studio compiler to generate rep movsd
sequence other than using inline assembly?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我想到了几个问题。
首先,你怎么知道 movsd 会更快? 您查看过它的延迟/吞吐量吗? x86 架构充满了不应该使用的旧指令,因为它们在现代 CPU 上效率不高。
其次,如果使用 std::copy 而不是 memcpy 会发生什么?
std::copy
可能更快,因为它可以在编译时针对特定数据类型进行专门化。第三,您是否在项目属性下启用了内部函数 -> C/C++-> 优化?
当然,我假设还启用了其他优化。
Several questions come to mind.
First, how do you know movsd would be faster? Have you looked up its latency/throughput? The x86 architecture is full of crufty old instructions that should not be used because they're just not very efficient on modern CPU's.
Second, what happens if you use
std::copy
instead of memcpy?std::copy
is potentially faster, as it can be specialized at compile-time for the specific data type.And third, have you enabled intrinsic functions under project properties -> C/C++ -> Optimization?
Of course I assume other optimizations are enabled as well.
您正在运行优化的构建吗? 除非启用优化,否则它不会使用内在函数。 还值得注意的是,它可能会使用比rep movsd更好的复制循环。 它应该尝试并至少使用 MMX 一次执行 64 位复制。 事实上,六七年前,我编写了一个 MMX 优化的复制循环来完成此类事情。 不幸的是,编译器的内在 memcpy 的性能比我的 MMX 副本高出大约 1%。 这确实教会了我不要对编译器正在做什么做出假设。
Are you running an optimised build? It won't use an intrinsic unless optimisation is on. Its also worth noting that it will probably use a better copy loop than rep movsd. It should try and use MMX, at the least, to perform a 64-bit at a time copy. In fact 6 or 7 years back I wrote an MMX optimised copy loop for doing this sort of thing. Unfortunately the compiler's intrinsic memcpy outperformed my MMX copy by about 1%. That really taught me not to make assumptions about what the compiler is doing.
使用具有恒定大小的 memcpy
同时我发现:
当复制的块大小在编译时已知时,编译器将使用内在函数。 如果不是,则调用库实现。 当大小已知时,生成的代码非常好,根据大小进行选择。 根据需要,它可以是单个 mov、movsd、或 movsd 后跟 movsb。
看来,如果我真的想始终使用 movsb 或 movsd,即使使用“动态”大小,我也必须使用内联汇编或特殊内在函数(见下文)。 我知道大小“相当短”,但编译器不知道,我无法将其传达给它 - 我什至尝试使用 __assume(size<16),但这还不够。
演示代码,使用“-Ob1(仅内联扩展)进行编译:
专门的内在函数
我最近发现存在非常简单的方法如何使 Visual Studio 编译器使用 movsd 复制字符 - 非常自然和简单:使用内在函数。以下内在函数可能会派上用场:
Using memcpy with a constant size
What I have found meanwhile:
Compiler will use intrinsic when the copied block size is compile time known. When it is not, is calls the library implementation. When the size is known, the code generated is very nice, selected based on the size. It may be a single mov, or movsd, or movsd followed by movsb, as needed.
It seems that if I really want to use movsb or movsd always, even with a "dynamic" size I will have to use inline assembly or special intrinsic (see below). I know the size is "quite short", but the compiler does not know it and I cannot communicate this to it - I have even tried to use __assume(size<16), but it is not enough.
Demo code, compile with "-Ob1 (expansion for inline only):
Specialized intrinsics
I have found recently there exists very simple way how to make Visual Studio compiler copy characters using movsd - very natural and simple: using intrinsics. Following intrinsics may come handy:
你对memcpy计时了吗? 在最新版本的 Visual Studio 上,memcpy 实现使用 SSE2...它应该比
rep movsd
更快。 如果您要复制的块是 1 KB,那么编译器不使用内在函数并不是真正的问题,因为与复制时间相比,函数调用的时间可以忽略不计。Have you timed memcpy? On recent versions of Visual Studio, the memcpy implementation uses SSE2... which should be faster than
rep movsd
. If the block you're copying is 1 KB, then it's not really a problem that the compiler isn't using an intrinsic since the time for the function call will be negligible compared to the time for the copy.请注意,为了使用 movsd,
src
必须指向与 32 位边界对齐的内存,并且其长度必须是 4 字节的倍数。如果是,为什么你的代码使用
char *
而不是int *
或其他东西? 如果不是,那么你的问题就没有意义了。如果将
char *
更改为int *
,可能从std::copy
获得更好的结果。编辑:您是否测量过复制是瓶颈?
Note that in order to use
movsd
,src
must point to a memory aligned to 32-bit boundary and its length must be a multiple of 4 bytes.If it is, why does your code use
char *
instead ofint *
or something? If it's not, your question is moot.If you change
char *
toint *
, you might get better result fromstd::copy
.Edit: have you measured that the copying is the bottleneck?
使用memcpy。 这个问题已经解决了。
仅供参考,rep movsd 并不总是最好的,rep movsb 在某些情况下可能会更快,对于 SSE 等,最好的是 movntq [edi]、xmm0。 您甚至可以通过将数据移动到缓冲区,然后将其移动到目标来进一步优化使用页面局部性的大量内存。
Use memcpy. This problem has already been solved.
FYI rep movsd is not always the best, rep movsb can be faster in some circumstances and with SSE and the like the best is movntq [edi], xmm0. You can even optimize further for large amount of memory in using page locality by moving data to a buffer and then moving it to your destination.