我正在尝试了解机器代码对齐的原理。我有一个汇编器实现,可以在运行时生成机器代码。我在每个分支目标上使用 16 字节对齐,但看起来这不是最佳选择,因为我注意到,如果删除对齐,有时相同的代码会运行得更快。我认为这与缓存线宽度有关,因此某些命令会被缓存线切断,CPU 会因此而停滞。因此,如果在一个地方插入一些对齐字节,它会将指令移动到进一步通过缓存边界线的地方...
我希望实现一个自动对齐程序,它可以将代码作为一个整体进行处理并根据规范插入对齐CPU 的参数(高速缓存行宽度、32/64 位等)...
有人可以提供有关此过程的一些提示吗?例如,目标 CPU 可以是 Intel Core i7 CPU 64 位平台。
谢谢。
I am trying to understand the principles of machine code alignment. I have an assembler implementation which can generate machine code in run-time. I use 16-bytes alignment on every branch destination, but looks like it is not the optimal choice, since I've noticed that if I remove alignment than sometimes same code works faster. I think that something to do with cache line width, so that some commands are cut by a cache line and CPU experiences stalls because of that. So if some bytes of alignment inserted at one place, it will move instructions somewhere further pass the cache border line...
I was hoping to implement an automatic alignment procedure, which can process a code as a whole and insert alignment according to the specification of the CPU (cache line width, 32/64 bits and so on)...
Can someone give some hints about this procedure? As an example the target CPU could be Intel Core i7 CPU 64-bit platform.
Thank you.
发布评论
评论(4)
我没有资格回答你的问题,因为这是一个如此庞大和复杂的话题。除了缓存行大小之外,可能还有更多机制在起作用。
不过,我想向您推荐 Agner Fog 的网站 和 您可以在那里找到 ="nofollow">编译器制造商。它们包含有关此类主题的大量信息 - 缓存行、分支预测和数据/代码对齐。
I'm not qualified to answer your question because this is such a vast and complicated topic. There are probably many more mechanisms in play here, other than cache line size.
However, I would like to point you to Agner Fog's site and the optimization manuals for compiler makers that you can find there. They contain a plethora of information on these kind of subjects - cache lines, branch prediction and data/code alignment.
段落(16 字节)对齐通常是最好的。但是,它可能会强制某些“本地”JMP 指令不再是本地指令(由于代码大小膨胀)。还可能导致缓存的代码量减少。我只会对齐主要代码段,不会对齐每个微小的子例程/JMP 部分。
Paragraph (16-byte) alignment is usually the best. However, it can force some "local" JMP instructions to no longer be local (due to code size bloat). May also result in not as much code being cached. I would only align major segments of code, I would not align every tiny subroutine/JMP section.
然而,不是专家...分支到不会出现在指令高速缓存中的位置应该从对齐中受益最大,因为您将读取整个指令高速缓存行来填充管道。鉴于该声明,前向分支将在函数的首次运行时受益。向后分支(例如“for”和“while”循环)可能不会受益,因为分支目标和后续指令已被读入缓存。请点击马丁斯答案中的链接。
Not an expert, however... Branches to places that are not going to be in the instruction cache should benefit from alignment the most because you'll read whole cache-line of instructions to fill the pipeline. Given that statement, forward branches will benefit on the first run of a function. Backward branches ("for" and "while" loops for example) will probably not benefit because the branch target and following instructions have been read into cache already. Do follow the links in Martins answer.
如前所述,这是一个非常复杂的领域。阿格纳雾 (Agner Fog) 似乎是一个值得游览的好地方。至于复杂性,我在这里看到了Torbjörn Granlund关于“不变整数的改进除法”的文章”并且在他用来说明他的新算法的代码中,第一条指令 - 我猜 - 主要标签是 nop - 无操作。根据评论,它显着提高了性能。去算算吧。
As mentioned previously this is a very complex area. Agner Fog seems like a good place to visit. As to the complexities I ran across the article here Torbjörn Granlund on "Improved Division by Invariant Integers" and in the code he uses to illustrate his new algorithm the first instruction at - I guess - the main label is nop - no operation. According to the commentary it improves performance significantly. Go figure.