什么 C++代码编译为 x86 REP 指令?
我正在 C++ 中将元素从一个数组复制到另一个数组。我在 x86 中发现了 rep movs
指令,它似乎将 ESI 处的数组复制到大小为 ECX 的 EDI 处的数组。但是,我尝试将 for
和 while
循环编译为 VS 2008(在 Intel Xeon x64 处理器上)中的 rep movs
指令。我如何编写将被编译为该指令的代码?
I'm copying elements from one array to another in C++. I found the rep movs
instruction in x86 that seems to copy an array at ESI to an array at EDI of size ECX. However, neither the for
nor while
loops I tried compiled to a rep movs
instruction in VS 2008 (on an Intel Xeon x64 processor). How can I write code that will get compiled to this instruction?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
老实说,你不应该。 REP 是指令集中的一种过时保留,实际上速度相当慢,因为它必须调用 CPU 内部的微编码子例程,该子例程具有 ROM 查找延迟并且也是非流水线的。
在几乎每个实现中,您都会发现
memcpy()
编译器内部函数更易于使用且运行速度更快。Honestly, you shouldn't. REP is sort of an obsolete holdover in the instruction set, and actually pretty slow since it has to call a microcoded subroutine inside the CPU, which has a ROM lookup latency and is nonpipelined as well.
In almost every implementation, you will find that the
memcpy()
compiler intrinsic both is easier to use and runs faster.在 MSVC 下有
__movsxxx
和__stosxxx
内在函数将生成REP
前缀指令。在 vc9+ 下还有一个“hack”来强制内部
memset
又名REP STOS
,因为由于 crt 中的 sse2 分支,内部不再存在。这比 __stosxxx 更好,因为编译器可以针对常量对其进行优化并正确排序。当然,REP 并不总是最好的选择,我认为使用 memcpy 会更好,它会分支到 sse2 或 REPS MOV code> 基于您的系统(在 msvc 下),除非您想为“热门”区域编写自定义程序集...
Under MSVC there are the
__movsxxx
&__stosxxx
intrinsics that will generate aREP
prefixed instruction.there is also a 'hack' to force intrinsic
memset
akaREP STOS
under vc9+, as the intrinsic no longer exits, due to the sse2 branching in the crt. this is better that__stosxxx
due to the fact the compiler can optimize it for constants and order it correctly.of course
REP
isn't always the best thing to use, imo your way better off usingmemcpy
, it'll branch to either sse2 orREPS MOV
based on your system (under msvc), unless you feeling like writing custom assembly for 'hot' areas...如果您确实需要该指令 - 使用内置汇编器并手动编写该指令。 你不能依赖编译器来生成任何特定的机器代码 - 即使它在一次编译中发出它,它也可以决定在下一次编译期间发出其他等价物。
If you need exactly that instruction - use built-in assembler and write that instruction manually. You can't rely on the compiler to produce any specific machine code - even if it emits it in one compilation it can decide to emit some other equivalent during next compilation.
很久以前,当 x86 CPU 是单管道工业 CISC 处理器时,REP 和朋友们就很友好。
但情况已经改变了。如今,当处理器遇到任何指令时,它首先会将其转换为更简单的格式(类似VLIW的微操作)并安排它以供将来执行(这是乱序的一部分) -执行,不同逻辑CPU核心之间调度的一部分,它可用于将写后写序列简化为单写等)。该机制适用于转换为一些类似 VLIW 的操作码的指令,但不适用于转换为循环的机器代码。循环翻译的机器代码可能会导致执行管道停止。
他们没有花费数十万个晶体管来构建 CPU 电路来处理执行管道中微操作的循环部分,而是以某种蹩脚的传统模式来处理它,这种模式会断断续续地停止管道,并要求现代程序员写你自己的该死的循环!
因此机器编写代码时很少使用它。如果您在二进制可执行文件中遇到 REP,则可能是一个不太了解的人类汇编木偶,或者是一个真正需要它保存的几个字节来使用它而不是实际循环的黑客,编写了它。
(但是,对我刚刚写的所有内容持保留态度。也许这不再是真的了。我不再 100% 了解 x86 CPU 的内部结构,我开始有了其他爱好......)
REP and friends was nice once upon a time, when the x86 CPU was a single-pipeline industrial CISC-processor.
But that has changed. Nowadays when the processor encounters any instruction, the first it does is translating it into an easier format (VLIW-like micro-ops) and schedules it for future execution (this is part of out-of-order-execution, part of scheduling between different logical CPU cores, it can be used to simplifying write-after-write-sequences into single-writes, et.c.). This machinery works well for instructions that translates into a few VLIW-like opcodes, but not machine-code that translates into loops. Loop-translated machine code will probably cause the execution pipeline to stall.
Rather than spending hundreds of thousands of transistors into building CPU-circuitry for handling looping portions of the micro-ops in the execution pipeline, they just handle it in some sort of crappy legacy-mode that stutterly stalls the pipeline, and ask modern programmers to write your own damn loops!
Therefore it is seldom used when machines write code. If you encounter REP in a binary executable, its probably a human assembly-muppet who didn't know better, or a cracker that really needed the few bytes it saved to use it instead of an actual loop, that wrote it.
(However. Take everything I just wrote with a grain of salt. Maybe this is not true anymore. I am not 100% up to date with the internals of x86 CPUs anymore, I got into other hobbies..)
我将rep*前缀变体与cmps*、movs*、scas*和stos*指令变体一起使用来生成内联代码,从而最大限度地减少代码大小,避免不必要的调用/跳转,从而减少缓存所做的工作。另一种方法是设置参数并在其他地方调用 memset 或 memcpy ,如果我想复制一百个字节或更多字节,那么总体上可能会更快,但如果只是 10-20 个字节,使用rep会更快(或者至少是)我上次测量的时候)。
由于我的编译器允许指定和使用内联汇编函数,并在优化活动中包括它们的寄存器使用/修改,因此我可以在情况合适时使用它们。
I use the rep* prefix variants with cmps*, movs*, scas* and stos* instruction variants to generate inline code which minimizes the code size, avoids unnecessary calls/jumps and thereby keeps down the work done by the caches. The alternative is to set up parameters and call a memset or memcpy somewhere else which may overall be faster if I want to copy a hundred bytes or more but if it's just a matter of 10-20 bytes using rep is faster (or at least was the last time I measured).
Since my compiler allows specification and use of inline assembly functions and includes their register usage/modification in the optimization activities it is possible for me to use them when the circumstances are right.
从历史记录来看——对制造商的策略没有任何洞察——曾经有一段时间“rep movs*”(等)指令非常慢。我认为那是 Pentium/Pentium MMX 的时代。我的一位同事(比我更有洞察力)说,制造商减少了分配给代表处理的芯片面积(<=>更少的晶体管/更多的微代码),并用它来使其他更常用的指令更快。
在大约十五年的时间里,相对而言,代表再次变得更快,这意味着更多的晶体管/更少的微代码。
On a historic note - not having any insight into the manufacturer's strategies - there was a time when the "rep movs*" (etc) instructions were very slow. I think it was around the time of the Pentium/Pentium MMX. A colleague of mine (who had more insight than I) said that the manufacturers had decreased the chip area (<=> fewer transistors/more microcode) allocated to the rep handling and used it to make other, more used instructions faster.
In the fifteen years or so since rep has become relatively speaking faster again which would suggest more transistors/less microcode.