当前位置：文江博客话题详情

内存对齐的目的

发布于 2024-07-10 09:48:05 字数 98 浏览 10 评论 0原文

不可否认，我不明白。假设您有一个内存，其内存字长度为 1 字节。为什么不能在未对齐地址（即不能被 4 整除）上的单个内存访问中访问 4 字节长的变量，就像对齐地址的情况一样？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夢归不見 2024-07-17 09:48:05

现代处理器上的内存子系统仅限于以其字大小的粒度和对齐方式访问内存；造成这种情况的原因有很多。

速度

现代处理器具有多级高速缓存，数据必须通过这些高速缓存进行提取；支持单字节读取将使内存子系统吞吐量与执行单元吞吐量紧密绑定（也称为 CPU 绑定）；这让人想起PIO模式如何被DMA超越出于许多与硬盘驱动器相同的原因。

CPU总是以其字大小读取（32位处理器上为4字节），因此当您在支持它的处理器上进行未对齐的地址访问时，处理器将读取多个字。 CPU 将读取您请求的地址跨越的每个内存字。这会导致访问所请求数据所需的内存事务数量最多放大 2 倍。

因此，读取两个字节很容易比读取四个字节慢。例如，假设内存中有一个如下所示的结构：

struct mystruct {
    char c;  // one byte
    int i;   // four bytes
    short s; // two bytes
}

在 32 位处理器上，它很可能会如下所示对齐：

Struct Layout

处理器可以在一个事务中读取这些成员中的每一个。

假设您有该结构的打包版本，可能来自为了传输效率而打包的网络；它可能看起来像这样：

Packed Struct

读取第一个字节将是相同的。

当你要求处理器给你 0x0005 中的 16 位时，它必须从 0x0004 中读取一个字并左移 1 个字节以将其放入 16 位寄存器中；一些额外的工作，但大多数都可以在一个周期内完成。

当您从 0x0001 开始请求 32 位时，您将获得 2 倍的放大。处理器将从 0x0000 读入结果寄存器并左移 1 个字节，然后再次从 0x0004 读入临时寄存器，右移 3 个字节，然后与结果寄存器进行“或”操作。

范围

对于任何给定的地址空间，如果架构可以假设 2 个 LSB 始终为 0（例如，32 位机器），那么它可以访问 4 倍以上的内存（2 个保存的位可以代表 4 个不同的状态），或者相同的2位的内存量，用于诸如标志之类的东西。从地址中取出 2 个 LSB 将得到 4 字节对齐；也称为 4 字节的步幅。每次地址递增时，实际上是递增位 2，而不是位 0，即最后 2 位将始终为 00。

这甚至会影响系统的物理设计。如果地址总线需要减少 2 个位，则 CPU 上的引脚可以减少 2 个，电路板上的走线也可以减少 2 个。

原子性

CPU 可以原子地对内存的对齐字进行操作，这意味着没有其他指令可以中断该操作。这对于许多的正确操作至关重要无锁数据结构和其他并发范例。

结论

处理器的内存系统比此处描述的要复杂得多，也更复杂；关于x86 处理器如何实际寻址内存的讨论可以提供帮助（许多处理器的工作原理类似））。

坚持内存对齐还有更多好处，您可以在这篇 IBM 文章。

计算机的主要用途是转换数据。现代内存架构和技术经过数十年的优化，有助于以高度可靠的方式在更多、更快的执行单元之间输入、输出更多数据。

额外奖励：高速

缓存我之前提到的另一个性能对齐是高速缓存行（例如，在某些 CPU 上）64B 的对齐。

有关利用缓存可以获得多少性能的更多信息，请查看处理器缓存效果图库；来自这个关于缓存行大小的问题

了解缓存行对于某些类型的程序优化非常重要。例如，数据的对齐可以确定操作是否触及一个或两个高速缓存线。正如我们在上面的示例中看到的，这很容易意味着在未对齐的情况下，操作速度会慢两倍。

The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.

Speed

Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput (aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.

The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.

Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:

struct mystruct {
    char c;  // one byte
    int i;   // four bytes
    short s; // two bytes
}

On a 32-bit processor it would most likely be aligned like shown here:

Struct Layout

The processor can read each of these members in one transaction.

Say you had a packed version of the struct, maybe from the network where it was packed for transmission efficiency; it might look something like this:

Packed Struct

Reading the first byte is going to be the same.

When you ask the processor to give you 16 bits from 0x0005 it will have to read a word from 0x0004 and shift left 1 byte to place it in a 16-bit register; some extra work, but most can handle that in one cycle.

When you ask for 32 bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.

Range

For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), or the same amount of memory with 2 bits for something like flags. Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.

This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.

Atomicity

The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.

Conclusion

The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).

There are many more benefits to adhering to memory alignment that you can read at this IBM article.

A computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.

Bonus: Caches

Another alignment-for-performance that I alluded to previously is alignment on cache lines which are (for example, on some CPUs) 64B.

For more info on how much performance can be gained by leveraging caches, take a look at Gallery of Processor Cache Effects; from this question on cache-line sizes

Understanding of cache lines can be important for certain types of program optimizations. For example, the alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.

回复收藏 0 原文

单身情人 2024-07-17 09:48:05

这是许多底层处理器的限制。通常可以通过执行 4 次低效的单字节读取而不是一次有效的字读取来解决此问题，但许多语言说明者认为，将它们取缔并强制所有内容对齐会更容易。

此链接。

回复收藏 0 原文

苯莒 2024-07-17 09:48:05

您可以使用某些处理器（nehalem 可以做到这一点），但以前所有内存访问都是在 64 位（或 32 位）线上对齐的，因为总线是 64 位宽，所以您必须一次获取 64 位，并且获取这些要容易得多对齐的 64 位“块”。

因此，如果您想获取单个字节，您可以获取 64 位块，然后屏蔽掉您不需要的位。如果您的字节位于右端，则简单快捷，但如果它位于 64 位块的中间，则必须屏蔽掉不需要的位，然后将数据转移到正确的位置。更糟糕的是，如果您想要一个 2 字节变量，但它被分成 2 个块，那么就需要双倍的内存访问量。

因此，由于每个人都认为内存很便宜，他们只是让编译器根据处理器的块大小对齐数据，以便您的代码运行得更快、更高效，但代价是浪费内存。

回复收藏 0 原文

挥剑断情 2024-07-17 09:48:05

从根本上来说，原因是内存总线有一些比内存大小小得多的特定长度。

因此，CPU 会读取片上 L1 缓存，目前该缓存通常为 32KB。但是，将 L1 高速缓存连接到 CPU 的内存总线的高速缓存行大小的宽度要小得多。这将是 128 位的量级。

因此：

262,144 bits - size of memory
    128 bits - size of bus

未对齐的访问偶尔会重叠两个缓存线，这将需要全新的缓存读取才能获取数据。它甚至可能会一直错过 DRAM。

此外，CPU 的某些部分必须倒立，将这两个不同的高速缓存线（每个高速缓存线都有一段数据）组合成一个对象。在一条线上，它将位于非常高的位中，而在另一行中，它将位于非常低的位中。

将会有完全集成到管道中的专用硬件，用于处理将对齐的对象移动到 CPU 数据总线的必要位上，但是对于未对齐的对象可能缺乏这样的硬件，因为使用这些晶体管来加速正确优化可能更有意义程式。

无论如何，有时必要的第二次内存读取都会减慢管道速度，无论有多少专用硬件（假设和愚蠢地）致力于修补未对齐的内存操作。

Fundamentally, the reason is because the memory bus has some specific length that is much, much smaller than the memory size.

So, the CPU reads out of the on-chip L1 cache, which is often 32KB these days. But the memory bus that connects the L1 cache to the CPU will have the vastly smaller width of the cache line size. This will be on the order of 128 bits.

So:

262,144 bits - size of memory
    128 bits - size of bus

Misaligned accesses will occasionally overlap two cache lines, and this will require an entirely new cache read in order to obtain the data. It might even miss all the way out to the DRAM.

Furthermore, some part of the CPU will have to stand on its head to put together a single object out of these two different cache lines which each have a piece of the data. On one line, it will be in the very high order bits, in the other, the very low order bits.

There will be dedicated hardware fully integrated into the pipeline that handles moving aligned objects onto the necessary bits of the CPU data bus, but such hardware may be lacking for misaligned objects, because it probably makes more sense to use those transistors for speeding up correctly optimized programs.

In any case, the second memory read that is sometimes necessary would slow down the pipeline no matter how much special-purpose hardware was (hypothetically and foolishly) dedicated to patching up misaligned memory operations.

回复收藏 0 原文

时光病人 2024-07-17 09:48:05

@joshperry 对这个问题给出了很好的答案。除了他的回答之外，我还有一些数字以图形方式显示了所描述的效果，尤其是 2 倍放大。以下是 Google 电子表格的链接，其中显示了不同的单词对齐方式看起来像。
此外，这里还有一个 Github gist 的链接，其中包含测试代码。
测试代码改编自 Jonathan Rentzsch 撰写的文章，其中 @joshperry参考。测试在配备四核 2.8 GHz Intel Core i7 64 位处理器和 16GB RAM 的 Macbook Pro 上运行。

回复收藏 0 原文

捂风挽笑 2024-07-17 09:48:05

如果你有32位数据总线，连接到内存的地址总线地址线将从A₂开始，因此在单个总线周期内只能访问32位对齐的地址。

因此，如果一个字跨越地址对齐边界 - 即 16/32 位数据的 A₀ 或 32 位数据的 A₁ 不为零，则需要两个总线周期获取数据。

某些架构/指令集不支持未对齐访问，并且会在此类尝试时生成异常，因此编译器生成的未对齐访问代码不仅需要额外的总线周期，还需要额外的指令，从而使其效率更低。

回复收藏 0 原文

凡间太子 2024-07-17 09:48:05

如果具有字节可寻址存储器的系统具有 32 位宽的存储器总线，则意味着实际上有四个字节宽的存储器系统，它们都连接来读取或写入相同的地址。对齐的 32 位读取需要将信息存储在所有四个存储系统的同一地址中，因此所有系统都可以同时提供数据。未对齐的 32 位读取将要求某些内存系统从一个地址返回数据，而某些内存系统则从下一个更高地址返回数据。尽管有一些内存系统经过优化，能够满足此类请求（除了它们的地址之外，它们实际上还具有“加一”信号，导致它们使用比指定值高一的地址），但这种功能会增加相当大的成本以及记忆系统的复杂性；大多数商品内存系统根本无法同时返回不同 32 位字的部分。

回复收藏 0 原文