x64 处理器上字对齐加载是否比未对齐加载更快?
在 x86/64(Intel/AMD 64 位)处理器上,按字边界对齐的变量加载是否比未对齐的加载操作更快?
我的一位同事认为,未对齐的加载速度很慢,应该避免。他引用了结构中项目到字边界的填充来证明未对齐的加载速度很慢。示例:
struct A {
char a;
uint64_t b;
};
结构体 A 通常大小为 16 字节。
另一方面,Snappy 压缩器的文档指出 Snappy 假设“未对齐的 32 位和 64 位加载和存储很便宜”。根据源代码,Intel 32 位和 64 位处理器都是如此。
那么:这里的真相是什么?未对齐加载是否会变慢以及慢了多少?在什么情况下?
Are loads of variables that are aligned on word boundaries faster than unaligned load operations on x86/64 (Intel/AMD 64 bit) processors?
A colleague of mine argues that unaligned loads are slow and should be avoided. He cites the padding of items to word boundaries in structs as a proof that unaligned loads are slow. Example:
struct A {
char a;
uint64_t b;
};
The struct A as usually a size of 16 bytes.
On the other hand, the documentation of the Snappy compressor states that Snappy assumes that "unaligned 32- and 64-bit loads and stores are cheap". According to the source code this is true of Intel 32 and 64-bit processors.
So: What is the truth here? If and by how much are unaligned loads slower? Under which circumstances?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我在互联网上发现的一个随机人说,对于 486 来说,对齐的 32 位访问需要一个周期。跨四边形但位于同一高速缓存行内的未对齐 32 位访问需要四个周期。跨多个缓存行的未对齐等可能需要额外六到十二个周期。
鉴于未对齐的访问需要访问多个四边形内存(几乎按照定义),我对此一点也不感到惊讶。我认为现代处理器上更好的缓存性能可以使成本稍微降低一些,但这仍然是需要避免的事情。
(顺便说一句,如果您的代码有任何可移植性的要求...... ia32 及其后代几乎是唯一支持未对齐访问的现代体系结构。例如,ARM 可以在抛出异常之间,模拟软件中的访问,或者只是加载错误的值,具体取决于操作系统!)
更新:这是一个实际访问过的人,并且测量了它。在他的硬件上,他认为未对齐的访问速度是对齐的一半。自己去尝试一下吧...
A Random Guy On The Internet I've found says that for the 486 says that an aligned 32-bit access takes one cycle. An unaligned 32-bit access that spans quads but is within the same cache line takes four cycles. An unaligned etc that spans multiple cache lines can take an extra six to twelve cycles.
Given that an unaligned access requires accessing multiple quads of memory, pretty much by definition, I'm not at all surprised by this. I'd imagine that better caching performance on modern processors makes the cost a little less bad, but it's still something to be avoided.
(Incidentally, if your code has any pretensions to portability... ia32 and descendants are pretty much the only modern architectures that support unaligned accesses at all. ARM, for example, can very between throwing an exception, emulating the access in software, or just loading the wrong value, depending on OS!)
Update: Here's someone who actually went and measured it. On his hardware he reckons unaligned access to be half as fast as aligned. Go try it for yourself...
对齐加载意味着存储速度更快,摘录自 英特尔优化手册清楚地指出了这一点:
和
遵循 3.6.4 中的部分,为编译器开发人员提供了一条很好的规则:
接下来是对齐规则的列表和 3.6.6 中的另一个 gem
这两个规则都被标记为高影响,这意味着它们可以极大地改变性能,连同摘录,第 3.6 节的其余部分充满了自然对齐数据的其他原因。如果只是为了了解他/她正在使用的硬件,那么任何开发人员都值得花时间阅读这些手册。
Aligned loads are stores are faster, two excerpts from the Intel Optimization Manual cleanly point this out:
AND
Following that part in 3.6.4, there is a nice rule for compiler developers:
followed by a listing of alignment rules and another gem in 3.6.6
Both rules are marked as high impact, meaning they can greatly change performance, along with the excerpts, the rest of Section 3.6 is filled with other reasons to naturally align your data. Its well worth any developers time to read these manuals, if only to understand the hardware he/she is working on.
不应该使用未对齐的加载/存储,但原因不是性能。原因是 C 语言禁止它们(通过对齐规则和别名规则),并且它们在许多没有极慢的模拟代码的系统上无法工作 - 这些代码也可能破坏正确行为所需的 C11 内存模型。多线程代码,除非它是在纯粹的逐字节级别上完成的。
对于 x86 和 x86_64,对于大多数操作(除了某些 SSE 指令),允许未对齐的加载和存储,但这并不意味着它们与正确访问一样快。这只是意味着 CPU 会为您进行模拟,并且比您自己的效率更高。例如,执行未对齐字大小读取和写入的 memcpy 类型循环将比执行对齐访问的相同 memcpy 稍微慢一些,但也会更快而不是编写自己的逐字节复制循环。
Unaligned loads/stores should never be used, but the reason is not performance. The reason is that the C language forbids them (both via the alignment rules and the aliasing rules), and they don't work on many systems without extremely slow emulation code - code which may also break the C11 memory model needed for proper behavior of multi-threaded code, unless it's done on a purely byte-by-byte level.
As for x86 and x86_64, for most operations (except some SSE instructions), misaligned load and store are allowed, but that doesn't mean they're as fast as correct accesses. It just means the CPU does the emulation for you, and does it somewhat more efficiently than you could do yourself. As an example, a
memcpy
-type loop that's doing misaligned word-size reads and writes will be moderately slower than the samememcpy
doing aligned access, but it will also be faster than writing your own byte-by-byte copy loop.要修复未对齐的读取,处理器需要执行两次对齐的读取并修复结果。这比必须进行一次读取且无需进行修复要慢。
Snappy 代码有利用未对齐访问的特殊原因。它将在 x86_64 上运行;它不适用于无法选择未对齐访问的体系结构,并且在修复未对齐访问是系统调用或类似昂贵操作的情况下,它会运行缓慢。 (在 DEC Alpha 上,有一种机制大约相当于用于修复未对齐访问的系统调用,并且您必须为程序打开它。)
使用未对齐访问是 Snappy 的作者做出的明智决定。它并不能让每个人都效仿它。例如,如果编译器编写者默认使用代码,他们会因其代码的低性能而受到严厉批评。
To fix up a misaligned read, the processor needs to do two aligned reads and fix up the result. This is slower than having to do one read and no fix-ups.
The Snappy code has special reasons for exploiting unaligned access. It will work on x86_64; it won't work on architectures where unaligned access is not an option, and it will work slowly where fixing up unaligned access is a system call or a similarly expensive operation. (On DEC Alpha, there was a mechanism approximately equivalent to a system call for fixing up unaligned access, and you had to turn it on for your program.)
Using unaligned access is an informed decision that the authors of Snappy made. It does not make it sensible for everyone to emulate it. Compiler writers would be excoriated for the poor performance of their code if they used it by default, for example.
未对齐的 32 和 64 位访问并不便宜。
我做了测试来验证这一点。我在 Core i5 M460(64 位)上的结果如下:最快的整数类型是 32 位宽。 64 位对齐稍微慢一些,但几乎相同。 16 位对齐和 8 位对齐都明显慢于 32 位对齐和 64 位对齐。 16 位比 8 位对齐慢。迄今为止最慢的访问形式是非对齐 32 位访问,它比对齐 32 位访问(其中最快)慢 3.5 倍,而未对齐 32 位访问甚至比未对齐 64 位访问慢 40%。
结果: https://github.com /mkschreder/align-test/blob/master/results-i5-64bit.jpg?raw=true
源代码:https://github.com/mkschreder/align-test
Unaligned 32 and 64 bit access is NOT cheap.
I did tests to verify this. My results on Core i5 M460 (64 bit) were as follows: fastest integer type was 32 bit wide. 64 bit alignment was slightly slower but almost the same. 16 bit alignment and 8 bit alignment were both noticeably slower than both 32 and 64 bit alignment. 16 bit being slower than 8 bit alignment. The by far slowest form of access was non aligned 32 bit access that was 3.5 times slower than aligned 32 bit access (fastest of them) and unaligned 32 bit access was even 40% slower than unaligned 64 bit access.
Results: https://github.com/mkschreder/align-test/blob/master/results-i5-64bit.jpg?raw=true
Source code: https://github.com/mkschreder/align-test