为什么CPU要按字边界访问内存?

发布于 2024-09-18 04:21:19 字数 681 浏览 6 评论 0原文

我经常听说数据应该在内存中正确对齐以获得更好的访问效率。 CPU 以字边界访问内存。

因此,在以下场景中,CPU 必须进行 2 次内存访问才能获取单个字。

Supposing: 1 word = 4 bytes

("|" stands for word boundary. "o" stands for byte boundary)


|----o----o----o----|----o----o----o----|   (The word boundary in CPU's eye)
           ----o----o----o----              (What I want to read from memory)

为什么会发生这种情况? CPU只能在字边界读取的根本原因是什么?

如果CPU只能在4字节字边界读取,那么地址线应该只需要30bit,而不是32bit宽度。因为最后2bit在CPU眼中始终为0。

ADD 1

更重要的是,如果我们承认CPU必须在字边界处读取,为什么边界不能从我要读取的地方开始?在CPU眼中,边界似乎是固定的。

ADD 2

根据AnT,边界设置似乎是硬连线的,并且是由内存访问硬件硬连线的。就这一点而言,CPU 是无辜的。

I heard a lot that data should be properly aligned in memory for better access efficiency. CPU access memory on a word boundary.

So in the following scenario, the CPU has to make 2 memory accesses to get a single word.

Supposing: 1 word = 4 bytes

("|" stands for word boundary. "o" stands for byte boundary)


|----o----o----o----|----o----o----o----|   (The word boundary in CPU's eye)
           ----o----o----o----              (What I want to read from memory)

Why should this happen? What's the root cause of the CPU can only read at the word boundary?

If the CPU can only access at the 4-byte word boundary, the address line should only need 30bit, not 32bit width. Cause the last 2bit are always 0 in CPU's eye.

ADD 1

And even more, if we admit that CPU must read at the word boundary, why can't the boundary start at where I want to read? It seems that the boundary is fixed in CPU's eye.

ADD 2

According to AnT, it seems that the boundary setting is hardwired and it is hardwired by the memory access hardware. CPU is just innocent as far as this is concerned.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

深空失忆 2024-09-25 04:21:19

在这种情况下,“可以”(在“...CPU可以访问...”中)的含义取决于硬件平台。

在 x86 平台上,CPU 指令可以访问绝对任何边界上对齐的数据,而不仅仅是“字边界”上对齐的数据。未对齐的访问可能比对齐的访问效率低,但其原因与 CPU 完全无关。它与底层低级内存访问硬件的工作方式密切相关。在这种情况下,与内存相关的硬件很可能必须对实际内存进行两次访问,但这是 CPU 指令不知道也不需要知道的事情。就CPU而言,它可以访问任何边界上的任何数据。其余部分对 CPU 指令透明地实现。

在像Sun SPARC这样的硬件平台上,CPU无法访问未对齐的数据(简单来说,如果你尝试这样做,你的程序将会崩溃),这意味着如果由于某种原因你需要执行这种未对齐的访问,您必须手动且显式地实现它:将其拆分为两个(或更多)CPU 指令,从而显式执行两个(或更多)内存访问。

至于为什么会这样……嗯,这就是现代计算机内存硬件的工作原理。数据必须对齐。如果未对齐,访问要么效率较低,要么根本不起作用。

现代内存的一个非常简化的模型是一个单元网格(行和列),每个单元存储一个数据字。可编程机械臂可以将单词放入特定单元中并从特定单元中检索单词。一次一个。如果您的数据分布在多个单元中,您别无选择,只能使用机械臂进行多次连续旅行。在某些硬件平台上,组织这些连续行程的任务对 CPU 是隐藏的(这意味着手臂本身知道如何从多个部分组装必要的数据),在其他平台上,它对 CPU 是可见的(这意味着它是CPU 负责组织手臂的这些连续行程)。

The meaning of "can" (in "...CPU can access...") in this case depends on the hardware platform.

On x86 platform CPU instructions can access data aligned on absolutely any boundary, not only on "word boundary". The misaligned access might be less efficient than aligned access, but the reasons for that have absolutely nothing to do with CPU. It has everything to do with how the underlying low-level memory access hardware works. It is quite possible that in this case the memory-related hardware will have to make two accesses to the actual memory, but that's something CPU instructions don't know about and don't need to know about. As far as CPU is concerned, it can access any data on any boundary. The rest is implemented transparently to CPU instructions.

On hardware platforms like Sun SPARC, CPU cannot access misaligned data (in simple words, your program will crash if you attempt to), which means that if for some reason you need to perform this kind of misaligned access, you'll have to implement it manually and explicitly: split it into two (or more) CPU instructions and thus explicitly perform two (or more) memory accesses.

As for why it is so... well, that's just how modern computer memory hardware works. The data has to be aligned. If it is not aligned, the access either is less efficient or does not work at all.

A very simplified model of modern memory would be a grid of cells (rows and columns), each cell storing a word of data. A programmable robotic arm can put a word into a specific cell and retrieve a word from a specific cell. One at a time. If your data is spread across several cells, you have no other choice but to make several consecutive trips with that robotic arm. On some hardware platforms the task of organizing these consecutive trips is hidden from CPU (meaning that the arm itself knows what to do to assemble the necessary data from several pieces), on other platforms it is visible to the CPU (meaning that it is the CPU who's responsible for organizing these consecutive trips of the arm).

赴月观长安 2024-09-25 04:21:19

如果您可以对地址做出某些假设(例如“底部 n 位为零”),那么它可以节省寻址逻辑中的芯片。某些 CPU(x86 及其类似产品)会将逻辑放在适当的位置以进行转换将数据错位放入多个读取中,从而向程序员隐藏了一些令人讨厌的性能影响,而该世界之外的大多数 CPU 都会发出一个硬件错误,以明确的方式解释它们不喜欢这种情况

。 “效率”是胡说八道,或者更准确地说是回避问题,真正的原因很简单,如果可以减少操作的地址位数(例如在x86 世界)是硬件设计决策的结果,而不是一般寻址的结果,

话虽如此,对于大多数用例,如果您以两字节字访问数据,那么硬件设计决策是有意义的。在某些情况下,您需要访问 offset,然后访问 offset+2,然后访问 offset+4 等等。在访问两字节字时能够按字节递增地址通常(99.44%肯定如此)不是您想要做的。因此,要求地址偏移量在字边界上对齐并没有什么坏处(当您设计数据结构时,这是一种轻微的、一次性的不便),但它确实节省了您的芯片。

抛开历史不谈,我曾经在 Interdata Model 70——一台 16 位小型机上工作过。它要求所有内存访问都是 16 位对齐的。按照当时的标准,当我处理它时,它的内存量也非常小。 (即使在当时,这也是一个遗迹。)字对齐用于使内存容量加倍,因为绕线的 CPU 很容易被黑客攻击。添加了新的地址解码逻辑,该逻辑将地址的低位取为 1(之前是对齐错误),并使用它切换到第二个内存组。尝试不使用对齐逻辑! :)

It saves silicon in the addressing logic if you can make certain assumptions about the address (like "bottom n bits are zero). Some CPUs (x86 and their work-alikes) will put logic in place to turn misaligned data into multiple fetches, concealing some nasty performance hits from the programmer. Most CPUs outside of that world will instead raise a hardware error explaining in no uncertain terms that they don't like this.

All the arguments you're going to hear about "efficiency" are bollocks or, more precisely are begging the question. The real reason is simply that it saves silicon in the processor core if the number of address bits can be reduced for operations. Any inefficiency that arises from misaligned access (like in the x86 world) are a result of the hardware design decisions, not intrinsic to addressing in general.

Now that being said, for most use cases the hardware design decision makes sense. If you're accessing data in two-byte words, most common use cases have you access offset, then offset+2, then offset+4 and so on. Being able to increment the address byte-wise while accessing two-byte words is typically (as in 99.44% certainly) not what you want to be doing. As such it doesn't hurt to require address offsets to align on word boundaries (it's a mild, one-time inconvenience when you design your data structures) but it sure does save on your silicon.

As a historical aside, I worked once on an Interdata Model 70 -- a 16-bit minicomputer. It required all memory access to be 16-bit aligned. It also had a very small amount of memory by the time I was working on it by the standards of the time. (It was a relic even back then.) The word-alignment was used to double the memory capacity since the wire-wrapped CPU could be easily hacked. New address decode logic was added that took a 1 in the low bit of the address (previously an alignment error in the making) and used it to switch to a second bank of memory. Try that without alignment logic! :)

我是有多爱你 2024-09-25 04:21:19

因为这样效率更高。

在您的示例中,CPU 必须执行两次读取:它必须读取前半部分,然后分别读取后半部分,然后将它们重新组合在一起以进行计算。如果数据正确对齐,这比一次性读取要复杂得多且慢得多。

一些处理器,如 x86,可以容忍未对齐的数据访问(因此您仍然需要所有 32 位) - 其他处理器(如 Itanium)绝对无法处理未对齐的数据访问,并且会发出非常明显的抱怨。

Because it is more efficient.

In your example, the CPU would have to do two reads: it has to read in the first half, then read in the second half separately, then reassemble them together to do the computation. This is much more complicated and slower than doing the read in one go if the data was properly aligned.

Some processors, like x86, can tolerate misaligned data access (so you would still need all 32 bits) - others like Itanium absolutely cannot handle misaligned data accesses and will complain quite spectacularly.

岁月无声 2024-09-25 04:21:19

字对齐不仅是 CPU 的特征

,在硬件级别上,大多数 RAM 模块都有给定的字大小(与每个读/写周期可访问的位数有关)。

在我必须连接嵌入式设备的模块上,寻址是通过三个参数实现的: 该模块分为四个组,可以在 RW 操作之前进行选择。每个存储体本质上都是一个 32 位字的大表,可以通过行和列索引进行寻址。

在此设计中,只能每个单元进行访问,因此每个读取操作返回 4 个字节,每个写入操作预计返回 4 个字节。

连接到该 RAM 芯片的内存控制器可以通过两种方式进行设计:要么允许使用多个周期对内存芯片进行无限制的访问,以将未对齐的数据拆分/合并到/来自多个单元(使用附加逻辑),要么对如何访问内存芯片施加一些限制可以通过降低复杂性来访问存储器。

由于复杂性会妨碍可维护性和性能,因此大多数设计人员选择了后者[需要引用]

Word alignment is not only featured by CPUs

On the hardware level, most RAM-Modules have a given Word size in respect to the amount of bits that can be accessed per read/write cycle.

On a module I had to interface on an embedded device, addressing was implemented through three parameters: The module was organized in four banks which could be selected prior to the RW operation. each of this banks was essentially a large table 32-bit words, wich could be adressed through a row and column index.

In this design, access was only possible per cell, so every read operation returned 4 bytes, and every write operation expected 4 bytes.

A memory controller hooked up to this RAM chip could be desigend in two ways: either allowing unrestricted access to the memory chip using several cycles to split/merge unaligned data to/from several cells (with additional logic), or imposing some restrictions on how memory can be accessed with the gain of reduced complexity.

As complexity can impede maintainability and performance, most designers chose the latter [citation needed]

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文