32 位 Intel 处理器上的内存对齐

发布于 2024-07-25 20:09:01 字数 211 浏览 3 评论 0原文

Intel 的 32 位处理器(例如 Pentium)具有 64 位宽的数据总线,因此每次访问可获取 8 个字节。 基于此,我假设这些处理器在地址总线上发出的物理地址始终是 8 的倍数。

首先,这个结论正确吗?

其次,如果正确,则应该在 8 字节边界上对齐数据结构成员。 但我见过有人在这些处理器上使用 4 字节对齐。

他们这样做有何正当理由?

Intel's 32-bit processors such as Pentium have 64-bit wide data bus and therefore fetch 8 bytes per access. Based on this, I'm assuming that the physical addresses that these processors emit on the address bus are always multiples of 8.

Firstly, is this conclusion correct?

Secondly, if it is correct, then one should align data structure members on an 8 byte boundary. But I've seen people using a 4-byte alignment instead on these processors.

How can they be justified in doing so?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

仅此而已 2024-08-01 20:09:01

通常的经验法则(直接来自 Intel 和 AMD 的优化手册)是每种数据类型都应按其自身的大小对齐。 int32 应在 32 位边界上对齐,int64 应在 64 位边界上对齐,依此类推。 char 适合任何地方。

当然,另一个经验法则是“编译器已被告知对齐要求”。 您无需担心它,因为编译器知道添加正确的填充和偏移量以允许有效访问数据。

唯一的例外是使用 SIMD 指令时,您必须手动确保大多数编译器上的对齐。

第二,如果它是正确的,那么一个
应该将数据结构成员对齐
8 字节边界。 但我见过
人们使用 4 字节对齐
而不是在这些处理器上。

我不明白这有什么区别。 CPU 可以简单地发出对包含这 4 个字节的 64 位块的读取。 这意味着它要么在请求的数据之前或之后获得 4 个额外字节。 但在这两种情况下,只需要一次读取。 32 位宽数据的 32 位对齐可确保它不会跨越 64 位边界。

The usual rule of thumb (straight from Intels and AMD's optimization manuals) is that every data type should be aligned by its own size. An int32 should be aligned on a 32-bit boundary, an int64 on a 64-bit boundary, and so on. A char will fit just fine anywhere.

Another rule of thumb is, of course "the compiler has been told about alignment requirements". You don't need to worry about it because the compiler knows to add the right padding and offsets to allow efficient access to data.

The only exception is when working with SIMD instructions, where you have to manually ensure alignment on most compilers.

Secondly, if it is correct, then one
should align data structure members on
an 8 byte boundary. But I've seen
people using a 4-byte alignment
instead on these processors.

I don't see how that makes a difference. The CPU can simply issue a read for the 64-bit block that contains those 4 bytes. That means it either gets 4 extra bytes before the requested data, or after it. But in both cases, it only takes a single read. 32-bit alignment of 32-bit-wide data ensures that it won't cross a 64-bit boundary.

寄意 2024-08-01 20:09:01

物理总线是 64 位宽...8 的倍数 --> 是的

,但是,还有两个因素需要考虑:

  1. 某些 x86 指令集是字节寻址的。 有些是 32 位对齐的(这就是为什么你有 4 字节的东西)。 但没有(核心)指令是 64 位对齐的。 CPU 可以处理未对齐的数据访问。
  2. 如果您关心性能,您应该考虑缓存行,而不是主内存。 缓存线更宽。

Physical bus is 64bit wide ...multiple of 8 --> yes

HOWEVER, there are two more factor to consider:

  1. Some x86 instruction set are byte addressed. Some are 32bit aligned (that's why you have 4 byte thing). But no (core) instruction are 64bits aligned. The CPU can handle misaligned data access.
  2. If you care about the performance, you should think about the cache line, not main memory. Cache lines are much wider.
-小熊_ 2024-08-01 20:09:01

他们这样做是合理的,因为更改为 8 字节对齐将构成 ABI 更改,并且边际性能改进不值得这么麻烦。

正如其他人已经说过的,缓存行很重要。 实际内存总线上的所有访问均以高速缓存线为单位(x86、IIRC 上为 64 字节)。 请参阅已经提到的“每个程序员需要了解的有关内存的知识”文档。 所以实际的内存流量是64字节对齐的。

They are justified in doing so because changing to 8-byte alignment would constitute an ABI change, and the marginal performance improvement is not worth the trouble.

As someone else already said, cachelines matter. All accesses on the actual memory bus are in terms of cache lines (64 bytes on x86, IIRC). See the "What every programmer needs to know about memory" doc that was mentioned already. So the actual memory traffic is 64 byte aligned.

深爱成瘾 2024-08-01 20:09:01

对于随机访问,只要数据没有错位(例如跨越边界),我认为这并不重要; 可以通过硬件中的简单 AND 结构找到数据中的正确地址和偏移量。 当一次读取访问不足以获取一个值时,它会变慢。 这也是编译器通常将小值(字节等)放在一起的原因,因为它们不必位于特定的偏移量; Shorts 应该位于偶数地址上,4 字节地址上应为 32 位,8 字节地址上应为 64 位。

请注意,如果您涉及缓存和线性数据访问,情况就会有所不同。

For random access and as long as the data is not misaligned (e.g. crossing a boundary), I don't think that it matters much; the correct address and offset in the data can be found with a simple AND construct in hardware. It gets slow when one read access is not sufficient to get one value. That's also why compilers usually put small values (bytes etc.) together because they don't have to be at a specific offset; shorts should be on even addresses, 32-bit on 4-byte addresses and 64-bit on 8-byte addresses.

Note that if you have caching involed and linear data access, things will be different.

不弃不离 2024-08-01 20:09:01

您所指的 64 位总线为缓存提供数据。 作为 CPU,始终读取和写入整个缓存行。 高速缓存行的大小始终是 8 的倍数,并且其物理地址确实以 8 字节偏移量对齐。

高速缓存到寄存器的传输不使用外部数据总线,因此该总线的宽度无关紧要。

The 64 bits bus you refer to feeds the caches. As a CPU, always read and write entire cache lines. The size of a cache line is always a multiple of 8, and its physical address is indeed aligned at 8 byte offsets.

Cache-to-register transfers do not use the external databus, so the width of that bus is irrelevant.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文