为什么未对齐的地址访问会导致 2 次或更多访问？

发布于 2024-09-27 01:30:45 字数 494 浏览 1 评论 0原文

对于为什么数据对齐的正常答案是为了更有效地访问并简化CPU的设计。

相关问题及其答案位于此处。另一个来源是此处。但他们都没有解决我的问题。

假设CPU的访问粒度为4字节。这意味着 CPU 一次读取 4 个字节。我上面列出的材料都说，如果我访问未对齐的数据，例如地址 0x1，那么 CPU 必须进行 2 次访问（一次来自地址 0x0、0x1、0x2 和 0x3，一次来自地址 0x4、0x5、0x6 和 0x7）并结合结果。我不明白为什么。为什么当我发出访问地址 0x1 时，CPU 无法从 0x1、0x2、0x3、0x4 读取数据。它不会降低性能并增加电路的复杂性。

先感谢您！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

来日方长 2024-10-04 01:30:45

它不会降低性能，也不会增加电路的复杂性。

我们将错误的假设视为事实，真正阻碍了进一步的理解。

您在另一个问题中的评论使用了更合适的措辞（“我不认为它会降低性能”...）

您是否考虑过内存架构并行使用许多内存芯片以最大化带宽？而且特定的数据项仅存在于一个芯片中，您不能只读取最方便的任何芯片并期望它具有您想要的数据。

现在，CPU 和内存可以连接在一起，使得位 0-7 仅连接到芯片 0、8-15 连接到芯片 1、16-23 连接到芯片 2、24-31 连接到芯片 3。对于所有整数 N ，存储位置 4N 存储在芯片 0 中，4N+1 存储在芯片 1 中，依此类推。它是每个芯片中的第 N 个字节。

让我们看一下存储在每个内存芯片的每个偏移处的内存地址

memory chip       0       1       2       3
offset

    0             0       1       2       3
    1             4       5       6       7
    2             8       9      10      11
    N            4N    4N+1    4N+2    4N+3

因此，如果从内存字节 0-3 加载，N=0，每个芯片都会报告其内部字节 0，这些位最终都会出现在正确的位置，并且一切都是伟大的。

现在，如果您尝试从内存位置 1 开始加载一个单词，会发生什么？

首先，我们看看它是如何完成的。第一个内存字节 1-3 存储在内存芯片 1-3 的偏移量 0 处，最终位于位 8-31 中，因为这是连接这些内存芯片的位置，即使您要求它们位于位 0-23 中。这不是什么大问题，因为 CPU 可以使用用于逻辑左移的相同电路在内部混合它们。然后，在下一个事务中，存储在内存芯片 0 中偏移量 1 处的内存字节 4 被读入位 0-7 并混合到您想要的位 24-31 中。

请注意这里的一些事情。您要求的单词跨偏移量分割，第一个内存事务从三个芯片的偏移量 0 读取，第二个内存事务从另一个芯片的偏移量 1 读取。问题就出在这里。您必须告诉内存芯片偏移量，以便它们可以向您发回正确的数据，并且偏移量约为 40 位宽，并且信号速度非常高。现在只有一组偏移信号连接到所有内存芯片，要对未对齐的内存访问执行单个事务，您需要运行到每个内存芯片的独立偏移（称为地址总线）。对于 64 位处理器，您可以将地址总线从 1 个更改为 8 个，增加了近 300 个引脚。在CPU使用700到1300个引脚的世界中，这很难被称为“电路增加不多”。更不用说许多额外的高速信号带来的噪声和串扰的大幅增加。

好吧，这并没有那么糟糕，因为地址总线上一次最多只能有两个不同的偏移量，并且一个总是另一个加一。因此，您可以通过一根额外的线连接到每个内存芯片，实际上可以说（读取地址总线上列出的偏移量）或（读取下面的偏移量）这是两种状态。但现在每个内存芯片中都有一个额外的加法器，这意味着它必须在实际进行内存访问之前计算偏移量，这会降低内存的最大时钟速率。这意味着如果您希望未对齐访问更快，则对齐访问会变慢。由于 99.99% 的访问可以对齐，因此这是净损失。

这就是为什么未对齐访问被分为两个步骤。因为地址总线是由所有涉及的字节共享的。这实际上是一种简化，因为当你有不同的偏移量时，你也会涉及到不同的缓存行，因此所有缓存一致性逻辑都必须加倍才能处理 CPU 内核之间两倍的通信。

It will not degrade the performance and incur much complexity in circuitry.

It's the false assumptions we take as fact that really throw off further understanding.

Your comment in the other question used much more appropriate wording ("I don't think it would degrade"...)

Did you consider that the memory architecture uses many memory chips in parallel in order to maximize the bandwidth? And that a particular data item is in only one chip, you can't just read whatever chip happens to be most convenient and expect it to have the data you want.

Right now, the CPU and memory can be wired together such that bits 0-7 are wired only to chip 0, 8-15 to chip 1, 16-23 to chip 2, 24-31 to chip 3. And for all integers N, memory location 4N is stored in chip 0, 4N+1 in chip 1, etc. And it is the Nth byte in each of those chips.

Let's look at the memory addresses stored at each offset of each memory chip

memory chip       0       1       2       3
offset

    0             0       1       2       3
    1             4       5       6       7
    2             8       9      10      11
    N            4N    4N+1    4N+2    4N+3

So if you load from memory bytes 0-3, N=0, each chip reports its internal byte 0, the bits all end up in the right places, and everything is great.

Now, if you try to load a word starting at memory location 1, what happens?

First, we look at the way it is done. First memory bytes 1-3, which are stored in memory chips 1-3 at offset 0, end up in bits 8-31, because that's where those memory chips are attached, even though you asked them to be in bits 0-23. This isn't a big deal because the the CPU can swizzle them internally, using the same circuitry used for logical shift left. Then on the next transaction memory byte 4, which is stored in memory chip 0 at offset 1, gets read into bits 0-7 and swizzled into bits 24-31 where you wanted it to be.

Notice something here. The word you asked for is split across offsets, the first memory transaction read from offset 0 of three chips, the second memory transaction read from offset 1 of the other chip. Here's where the problem lies. You have to tell the memory chips the offset so they can send you the right data back, and the offset is ~40 bits wide and the signals are VERY high speed. Right now there is only one set of offset signals that connects to all the memory chips, to do a single transaction for unaligned memory access you would need independent offset (called the address bus BTW) running to each memory chip. For a 64-bit processor, you'd change from one address bus to eight, an increase of almost 300 pins. In a world where CPUs use between 700 and 1300 pins, this can hardly be called "not much increase in circuitry". Not to mention the huge increase in noise and crosstalk from that many extra high-speed signals.

Ok, it isn't quite that bad, because there can only be a maximum of two different offsets out on the address bus at once, and one is always the other plus one. So you could get away with one extra wire to each memory chip, saying in effect either (read the offset listed on the address bus) or (read the offset following) which is two states. But now there's an extra adder in each memory chip, which means it has to calculate the offset before actually doing the memory access, which slows down the maximum clock rate for memory. Which means that aligned access gets slower if you want unaligned access to be faster. Since 99.99% of access can be made aligned, this is a net loss.

So that's why unaligned access gets split into two steps. Because the address bus is shared by all the bytes involved. And this is actually a simplification, because when you have different offsets, you also have different cache lines involved, so all the cache coherency logic would have to double to handle twice the communication between CPU cores.

回复收藏 0 原文

娇俏 2024-10-04 01:30:45

在我看来，这是一个非常简单的假设。该电路可能涉及多层管道和缓存优化，以确保读取内存的某些位。此外，内存读取也被委托给内存子系统，这些子系统可能是由性能和设计复杂性存在一定差异的组件构建的，以便以您认为的方式读取。

不过，我确实要补充一点，我不是 CPU 或内存设计师，所以我可能是在胡说八道。

回复收藏 0 原文

ゝ偶尔ゞ 2024-10-04 01:30:45

你的问题的答案就在问题本身。

CPU 的访问粒度为 4 字节。 因此它只能以 4 字节为单位获取数据。

如果您访问了地址 0x0，CPU 将为您提供从 0x0 到 0x3 的 4 个字节。

当您发出指令访问地址 0x1 中的数据时，CPU 会将其视为对从 0x1 开始的 4 字节数据的请求（即 0x1 到 0x4）。由于 CPU 的粒度，这本质上不能以任何其他方式解释。因此，CPU 会从 0x0 到 0x3 读取数据。 0x4到0x7（因此，2次访问），然后将0x1到0x4的数据放在一起作为最终结果。

回复收藏 0 原文

油饼 2024-10-04 01:30:45

寻址 4 个字节时，第一个字节在左侧 0x1 而不是 0x0 处未对齐，这意味着它不会从字边界开始并溢出到下一个相邻字。第一次访问抓取到字边界的 3 个字节（假设为 32 位字），然后第二次访问以完成内存寻址实现的 4 字节 32 位字的模式抓取字节 0x4。目标代码或汇编器有效地为程序员透明地进行第二次访问和串联。尽可能保持字边界，通常以 4 字节为单位。

回复收藏 0 原文

~没有更多了~