内存对齐的目的

发布于 2024-07-10 09:48:05 字数 98 浏览 7 评论 0原文

不可否认,我不明白。 假设您有一个内存,其内存字长度为 1 字节。 为什么不能在未对齐地址(即不能被 4 整除)上的单个内存访问中访问 4 字节长的变量,就像对齐地址的情况一样?

Admittedly I don't get it. Say you have a memory with a memory word of length of 1 byte. Why can't you access a 4 byte long variable in a single memory access on an unaligned address(i.e. not divisible by 4), as it's the case with aligned addresses?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

夢归不見 2024-07-17 09:48:05

现代处理器上的内存子系统仅限于以其字大小的粒度和对齐方式访问内存; 造成这种情况的原因有很多。

速度

现代处理器具有多级高速缓存,数据必须通过这些高速缓存进行提取; 支持单字节读取将使内存子系统吞吐量与执行单元吞吐量紧密绑定(也称为 CPU 绑定); 这让人想起PIO模式如何被DMA超越出于许多与硬盘驱动器相同的原因。

CPU总是以其字大小读取(32位处理器上为4字节),因此当您在支持它的处理器上进行未对齐的地址访问时,处理器将读取多个字。 CPU 将读取您请求的地址跨越的每个内存字。 这会导致访问所请求数据所需的内存事务数量最多放大 2 倍。

因此,读取两个字节很容易比读取四个字节慢。 例如,假设内存中有一个如下所示的结构:

struct mystruct {
    char c;  // one byte
    int i;   // four bytes
    short s; // two bytes
}

在 32 位处理器上,它很可能会如下所示对齐:

Struct Layout

处理器可以在一个事务中读取这些成员中的每一个。

假设您有该结构的打包版本,可能来自为了传输效率而打包的网络; 它可能看起来像这样:

Packed Struct

读取第一个字节将是相同的。

当你要求处理器给你 0x0005 中的 16 位时,它必须从 0x0004 中读取一个字并左移 1 个字节以将其放入 16 位寄存器中; 一些额外的工作,但大多数都可以在一个周期内完成。

当您从 0x0001 开始请求 32 位时,您将获得 2 倍的放大。 处理器将从 0x0000 读入结果寄存器并左移 1 个字节,然后再次从 0x0004 读入临时寄存器,右移 3 个字节,然后与结果寄存器进行“或”操作。

范围

对于任何给定的地址空间,如果架构可以假设 2 个 LSB 始终为 0(例如,32 位机器),那么它可以访问 4 倍以上的内存(2 个保存的位可以代表 4 个不同的状态),或者相同的2位的内存量,用于诸如标志之类的东西。 从地址中取出 2 个 LSB 将得到 4 字节对齐; 也称为 4 字节的 步幅。 每次地址递增时,实际上是递增位 2,而不是位 0,即最后 2 位将始终为 00

这甚至会影响系统的物理设计。 如果地址总线需要减少 2 个位,则 CPU 上的引脚可以减少 2 个,电路板上的走线也可以减少 2 个。

原子性

CPU 可以原子地对内存的对齐字进行操作,这意味着没有其他指令可以中断该操作。 这对于许多 的正确操作至关重要无锁数据结构和其他并发范例。

结论

处理器的内存系统比此处描述的要复杂得多,也更复杂; 关于x86 处理器如何实际寻址内存的讨论可以提供帮助(许多处理器的工作原理类似) )。

坚持内存对齐还有更多好处,您可以在 这篇 IBM 文章

计算机的主要用途是转换数据。 现代内存架构和技术经过数十年的优化,有助于以高度可靠的方式在更多、更快的执行单元之间输入、输出更多数据。

额外奖励:高速

缓存 我之前提到的另一个性能对齐是高速缓存行(例如,在某些 CPU 上)64B 的对齐。

有关利用缓存可以获得多少性能的更多信息,请查看 处理器缓存效果图库; 来自这个关于缓存行大小的问题

了解缓存行对于某些类型的程序优化非常重要。 例如,数据的对齐可以确定操作是否触及一个或两个高速缓存线。 正如我们在上面的示例中看到的,这很容易意味着在未对齐的情况下,操作速度会慢两倍。

The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.

Speed

Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput (aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.

The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.

Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:

struct mystruct {
    char c;  // one byte
    int i;   // four bytes
    short s; // two bytes
}

On a 32-bit processor it would most likely be aligned like shown here:

Struct Layout

The processor can read each of these members in one transaction.

Say you had a packed version of the struct, maybe from the network where it was packed for transmission efficiency; it might look something like this:

Packed Struct

Reading the first byte is going to be the same.

When you ask the processor to give you 16 bits from 0x0005 it will have to read a word from 0x0004 and shift left 1 byte to place it in a 16-bit register; some extra work, but most can handle that in one cycle.

When you ask for 32 bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.

Range

For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), or the same amount of memory with 2 bits for something like flags. Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.

This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.

Atomicity

The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.

Conclusion

The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).

There are many more benefits to adhering to memory alignment that you can read at this IBM article.

A computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.

Bonus: Caches

Another alignment-for-performance that I alluded to previously is alignment on cache lines which are (for example, on some CPUs) 64B.

For more info on how much performance can be gained by leveraging caches, take a look at Gallery of Processor Cache Effects; from this question on cache-line sizes

Understanding of cache lines can be important for certain types of program optimizations. For example, the alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.

单身情人 2024-07-17 09:48:05

这是许多底层处理器的限制。 通常可以通过执行 4 次低效的单字节读取而不是一次有效的字读取来解决此问题,但许多语言说明者认为,将它们取缔并强制所有内容对齐会更容易。

此链接

It's a limitation of many underlying processors. It can usually be worked around by doing 4 inefficient single byte fetches rather than one efficient word fetch, but many language specifiers decided it would be easier just to outlaw them and force everything to be aligned.

There is much more information in this link that the OP discovered.

苯莒 2024-07-17 09:48:05

您可以使用某些处理器(nehalem 可以做到这一点),但以前所有内存访问都是在 64 位(或 32 位)线上对齐的,因为总线是 64 位宽,所以您必须一次获取 64 位,并且获取这些要容易得多对齐的 64 位“块”。

因此,如果您想获取单个字节,您可以获取 64 位块,然后屏蔽掉您不需要的位。 如果您的字节位于右端,则简单快捷,但如果它位于 64 位块的中间,则必须屏蔽掉不需要的位,然后将数据转移到正确的位置。 更糟糕的是,如果您想要一个 2 字节变量,但它被分成 2 个块,那么就需要双倍的内存访问量。

因此,由于每个人都认为内存很便宜,他们只是让编译器根据处理器的块大小对齐数据,以便您的代码运行得更快、更高效,但代价是浪费内存。

you can with some processors (the nehalem can do this), but previously all memory access was aligned on a 64-bit (or 32-bit) line, because the bus is 64 bits wide, you had to fetch 64 bit at a time, and it was significantly easier to fetch these in aligned 'chunks' of 64 bits.

So, if you wanted to get a single byte, you fetched the 64-bit chunk and then masked off the bits you didn't want. Easy and fast if your byte was at the right end, but if it was in the middle of that 64-bit chunk, you'd have to mask off the unwanted bits and then shift the data over to the right place. Worse, if you wanted a 2 byte variable, but that was split across 2 chunks, then that required double the required memory accesses.

So, as everyone thinks memory is cheap, they just made the compiler align the data on the processor's chunk sizes so your code runs faster and more efficiently at the cost of wasted memory.

挥剑断情 2024-07-17 09:48:05

从根本上来说,原因是内存总线有一些比内存大小小得多的特定长度。

因此,CPU 会读取片上 L1 缓存,目前该缓存通常为 32KB。 但是,将 L1 高速缓存连接到 CPU 的内存总线的高速缓存行大小的宽度要小得多。 这将是 128 位的量级。

因此:

262,144 bits - size of memory
    128 bits - size of bus

未对齐的访问偶尔会重叠两个缓存线,这将需要全新的缓存读取才能获取数据。 它甚至可能会一直错过 DRAM。

此外,CPU 的某些部分必须倒立,将这两个不同的高速缓存线(每个高速缓存线都有一段数据)组合成一个对象。 在一条线上,它将位于非常高的位中,而在另一行中,它将位于非常低的位中。

将会有完全集成到管道中的专用硬件,用于处理将对齐的对象移动到 CPU 数据总线的必要位上,但是对于未对齐的对象可能缺乏这样的硬件,因为使用这些晶体管来加速正确优化可能更有意义程式。

无论如何,有时必要的第二次内存读取都会减慢管道速度,无论有多少专用硬件(假设和愚蠢地)致力于修补未对齐的内存操作。

Fundamentally, the reason is because the memory bus has some specific length that is much, much smaller than the memory size.

So, the CPU reads out of the on-chip L1 cache, which is often 32KB these days. But the memory bus that connects the L1 cache to the CPU will have the vastly smaller width of the cache line size. This will be on the order of 128 bits.

So:

262,144 bits - size of memory
    128 bits - size of bus

Misaligned accesses will occasionally overlap two cache lines, and this will require an entirely new cache read in order to obtain the data. It might even miss all the way out to the DRAM.

Furthermore, some part of the CPU will have to stand on its head to put together a single object out of these two different cache lines which each have a piece of the data. On one line, it will be in the very high order bits, in the other, the very low order bits.

There will be dedicated hardware fully integrated into the pipeline that handles moving aligned objects onto the necessary bits of the CPU data bus, but such hardware may be lacking for misaligned objects, because it probably makes more sense to use those transistors for speeding up correctly optimized programs.

In any case, the second memory read that is sometimes necessary would slow down the pipeline no matter how much special-purpose hardware was (hypothetically and foolishly) dedicated to patching up misaligned memory operations.

时光病人 2024-07-17 09:48:05

@joshperry 对这个问题给出了很好的答案。 除了他的回答之外,我还有一些数字以图形方式显示了所描述的效果,尤其是 2 倍放大。 以下是 Google 电子表格的链接,其中显示了不同的单词对齐方式看起来像。
此外,这里还有一个 Github gist 的链接,其中包含测试代码。
测试代码改编自 Jonathan Rentzsch 撰写的文章,其中 @joshperry参考。 测试在配备四核 2.8 GHz Intel Core i7 64 位处理器和 16GB RAM 的 Macbook Pro 上运行。

输入图片此处描述

@joshperry has given an excellent answer to this question. In addition to his answer, I have some numbers that show graphically the effects which were described, especially the 2X amplification. Here's a link to a Google spreadsheet showing what the effect of different word alignments look like.
In addition here's a link to a Github gist with the code for the test.
The test code is adapted from the article written by Jonathan Rentzsch which @joshperry referenced. The tests were run on a Macbook Pro with a quad-core 2.8 GHz Intel Core i7 64-bit processor and 16GB of RAM.

enter image description here

捂风挽笑 2024-07-17 09:48:05

如果你有32位数据总线,连接到内存的地址总线地址线将从A2开始,因此在单个总线周期内只能访问32位对齐的地址。

因此,如果一个字跨越地址对齐边界 - 即 16/32 位数据的 A0 或 32 位数据的 A1 不为零,则需要两个总线周期获取数据。

某些架构/指令集不支持未对齐访问,并且会在此类尝试时生成异常,因此编译器生成的未对齐访问代码不仅需要额外的总线周期,还需要额外的指令,从而使其效率更低。

If you have a 32bit data bus, the address bus address lines connected to the memory will start from A2, so only 32bit aligned addresses can be accessed in a single bus cycle.

So if a word spans an address alignment boundary - i.e. A0 for 16/32 bit data or A1 for 32 bit data are not zero, two bus cycles are required to obtain the data.

Some architectures/instruction sets do not support unaligned access and will generate an exception on such attempts, so compiler generated unaligned access code requires not just additional bus cycles, but additional instructions, making it even less efficient.

凡间太子 2024-07-17 09:48:05

如果具有字节可寻址存储器的系统具有 32 位宽的存储器总线,则意味着实际上有四个字节宽的存储器系统,它们都连接来读取或写入相同的地址。 对齐的 32 位读取需要将信息存储在所有四个存储系统的同一地址中,因此所有系统都可以同时提供数据。 未对齐的 32 位读取将要求某些内存系统从一个地址返回数据,而某些内存系统则从下一个更高地址返回数据。 尽管有一些内存系统经过优化,能够满足此类请求(除了它们的地址之外,它们实际上还具有“加一”信号,导致它们使用比指定值高一的地址),但这种功能会增加相当大的成本以及记忆系统的复杂性; 大多数商品内存系统根本无法同时返回不同 32 位字的部分。

If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.

风流物 2024-07-17 09:48:05

在 PowerPC 上,您可以毫无问题地从奇数地址加载整数。

当您尝试此操作时,Sparc、I86 和(我认为)Itatnium 会引发硬件异常。

在大多数现代处理器上,一个 32 位负载与四个 8 位负载不会产生太大差异。 数据是否已经在缓存中将会产生更大的影响。

On PowerPC you can load an integer from an odd address with no problems.

Sparc and I86 and (I think) Itatnium raise hardware exceptions when you try this.

One 32 bit load vs four 8 bit loads isnt going to make a lot of difference on most modern processors. Whether the data is already in cache or not will have a far greater effect.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文