对齐和未对齐的内存访问?

发布于 2024-07-25 18:01:07 字数 266 浏览 8 评论 0原文

对齐和未对齐的内存访问有什么区别?

我在 TMS320C64x DSP 上工作,我想使用内部函数(用于汇编指令的 C 函数),它具有

ushort & _amem2(void *ptr);
ushort & _mem2(void *ptr);

_amem2 执行 2 个字节的对齐访问和 _mem2进行未对齐的访问。

我什么时候应该使用哪个?

What is the difference between aligned and unaligned memory access?

I work on an TMS320C64x DSP, and I want to use the intrinsic functions (C functions for assembly instructions) and it has

ushort & _amem2(void *ptr);
ushort & _mem2(void *ptr);

where _amem2 does an aligned access of 2 bytes and _mem2 does unaligned access.

When should I use which?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

一梦浮鱼 2024-08-01 18:01:07

许多计算机体系结构将内存存储在每个几个字节的“字”中。 例如,Intel 32位架构存储32位字,每个字4个字节。 然而,内存是在单字节级别寻址的; 因此,地址可以是“对齐的”,这意味着它从字边界开始,也可以是“未对齐的”,这意味着它不是。

在某些架构上,某些内存操作可能会更慢,甚至完全不允许在未对齐的地址上进行。

因此,如果您知道您的地址与正确的地址对齐,则可以使用 _amem2() 来提高速度。 否则,您应该使用 _mem2()。

Many computer architectures store memory in "words" of several bytes each. For example, the Intel 32-bit architecture stores words of 32 bits, each of 4 bytes. Memory is addressed at the single byte level, however; therefore an address can be "aligned", meaning it starts at a word boundary, or "unaligned", meaning it doesn't.

On certain architectures certain memory operations may be slower or even completely not allowed on unaligned addresses.

So, if you know your addresses are aligned on the right addresses, you can use _amem2(), for speed. Otherwise, you should use _mem2().

謸气贵蔟 2024-08-01 18:01:07

对齐的内存访问意味着指针(作为整数)是称为对齐的特定类型值的倍数。 对齐是类型必须或应该存储(例如出于性能原因)在CPU 上的自然地址倍数。 例如,CPU 可能要求所有两字节加载或存储都通过二的倍数地址完成。 对于小型基本类型(小于 4 字节),对齐方式几乎总是类型的大小。 对于结构体,对齐通常是任何成员的最大对齐。

C 编译器始终将您声明的变量放在满足“正确”对齐的地址处。 因此,如果 ptr 指向 uint16_t 变量,它将被对齐,您可以使用 _amem2。 仅当您访问通过 I/O 接收的打包字节数组或字符串中间的字节时,才需要使用 _mem2。

An aligned memory access means that the pointer (as an integer) is a multiple of a type-specific value called the alignment. The alignment is the natural address multiple where the type must be, or should be stored (e.g. for performance reasons) on a CPU. For example, a CPU might require that all two-byte loads or stores are done through addresses that are multiples of two. For small primitive types (under 4 bytes), the alignment is almost always the size of the type. For structs, the alignment is usually the maximum alignment of any member.

The C compiler always puts variables that you declare at addresses which satisfy the "correct" alignment. So if ptr points to e.g. a uint16_t variable, it will be aligned and you can use _amem2. You need to use _mem2 only if you are accessing e.g. a packed byte array received via I/O, or bytes in the middle of a string.

老旧海报 2024-08-01 18:01:07

我知道这是一个有选定答案的老问题,但没有看到有人解释对齐和未对齐内存访问之间有什么区别的答案......

无论是 dram、sram、flash 还是其他。 以 sram 为例,它是由位构建的,特定的 sram 将由固定数量的位宽和固定数量的行深构建。 假设 32 位宽和几行/多行深。

如果我对这个 sram 中的地址 0x0000 进行 32 位写入,则该 sram 周围的内存控制器可以简单地对第 0 行执行一个写入周期。

如果我对这个 sram 中的地址 0x0001 进行 32 位写入,假设这是允许的,控制器需要读取第 0 行,修改其中三个字节,保留一个,并将其写入第 0 行,然后读取第 1 行,修改一个字节,保留找到的其他三个字节并将其写回。 哪些字节被修改或不被修改与系统的字节顺序有关。

前者是对齐的,后者是未对齐的,显然存在性能差异,而且需要额外的逻辑来执行四个内存周期并合并字节通道。

如果我要从地址 0x0000 读取 32 位,那么对第 0 行的单次读取就完成了。 但是从 0x0001 读取,我必须执行两次读取 row0 和 row1,并且根据系统设计,只需将这些 64 位发送回处理器可能需要两个总线时钟而不是一个。 或者内存控制器有额外的逻辑,以便在一个总线周期内将 32 位在数据总线上对齐。

16 位读取稍好一些,从 0x0000、0x0001 和 0x0002 读取只会从 row0 读取,并且可以根据系统/处理器设计将这些 32 位发回,处理器提取它们或将它们转移到内存控制器中,这样它们落在特定的字节通道上,因此处理器不必旋转。 如果不是两者都必须,其中之一必须如此。 不过,从 0x0003 读取就像上面一样,您必须读取第 0 行和第 1 行,因为每个字节都有一个字节,然后将 64 位发送回处理器以供提取,或者内存控制器将这些位组合成一个 32 位总线响应(对于这些示例,假设处理器和内存控制器之间的总线为 32 位宽)。

在此示例 sram 中,16 位写入始终以至少一次读取-修改-写入结束,地址 0x0000、0x0001 和 0x0002 读取 row0 修改两个字节并写回。 地址0x0003读取两行各修改一个字节并写回。

8 位时,您只需要读取包含该字节的一行,但写入是一行的读取-修改-写入。

armv4 不喜欢未对齐,尽管您可以禁用陷阱,结果并不像您预期​​的那样,这并不重要,当前的arms允许未对齐并为您提供上述行为,您可以在控制寄存器中进行一些更改,然后它将中止未对齐转移。 mips过去不允许,不知道他们现在做什么。 x86、68K 等是允许的,内存控制器可能必须完成大部分工作。

明确不允许这样做的设计是为了性能和较少的逻辑,有些人会说这是程序员的负担,其他人可能会说这对程序员来说没有额外的工作或者对程序员来说更容易。 无论是否对齐,您也可以明白为什么最好不要尝试通过创建 8 位变量来节省任何内存,而是继续烧录 32 位字或寄存器或总线的任何自然大小。 它可能会以一些字节的小成本提高您的性能。 更不用说编译器需要添加额外的代码来使 32 位寄存器模拟 8 位变量、屏蔽,有时还需要符号扩展。 在使用寄存器本机大小的情况下,不需要这些附加指令。 您还可以将多个内容打包到总线/内存范围的位置中,并执行一个内存周期来收集或写入它们,然后使用一些额外的指令在寄存器之间进行操作,而无需花费内存,并且可能会清洗指令数量。

我不同意编译器总是将数据与目标对齐,有一些方法可以打破这一点。 如果目标不支持未对齐,您就会遇到错误。 如果编译器总是根据您能想到的任何合法代码正确执行,那么程序员永远不需要谈论这个问题,除非是为了性能,否则没有理由提出这个问题。 如果您不控制 void ptr 地址是否对齐,那么您必须始终使用 mem2() 未对齐访问,或者您必须根据 ptr 的值在代码中执行 if-then-else 作为 nik指出。 通过声明为 void,C 编译器现在无法正确处理您的对齐方式,并且无法保证这一点。 如果您采用 char *prt 并将其提供给这些函数,则所有赌注都取决于编译器是否正确,而无需添加埋在 mem2() 函数中或这两个函数之外的额外代码。 所以正如你的问题中所写的 mem2() 是唯一正确的答案。

台式机/笔记本电脑中使用的 DRAM 往往是 64 或 72(带 ecc)位宽,并且对它们的每次访问都是对齐的。 尽管记忆棒实际上是由 8 位宽或 16 或 32 位宽的芯片组成的。 (由于各种原因,这可能会随着手机/平板电脑的变化而变化)内存控制器和理想情况下至少有一个缓存位于该 DRAM 前面,以便处理小于总线宽度读-修改-写的未对齐甚至对齐访问在高速缓存 sram 中,速度更快,并且 dram 访问都是对齐的全总线宽度访问。 如果 DRAM 前面没有缓存,并且控制器设计用于全宽访问,那么这是最差的性能,如果设计用于单独点亮字节通道(假设 8 位宽芯片),那么您就没有读取-修改功能-编写但更复杂的控制器。 如果典型的用例是使用缓存(如果设计中有缓存),那么在控制器中为每个字节通道进行额外的工作可能没有意义,但让它知道如何进行全总线宽度大小的传输或的倍数。

I know this is an old question with a selected answer but didnt see anyone explain the answer to what is the difference between aligned and unaligned memory access...

Be it dram or sram or flash or other. Take an sram as a simple example it is built out of bits a specific sram will be built out of a fixed number of bits wide and a fixed number of rows deep. lets say 32 bits wide and several/many rows deep.

if I do a 32 bit write to address 0x0000 in this sram, the memory controller around this sram can simply do a single write cycle to row 0.

if I do a 32 bit write to address 0x0001 in this sram, assuming that is allowed, the controller will need to do a read of row 0, modify three of the bytes, preserving one, and write that to row 0, then read row 1 modify one byte leaving the other three as found and write that back. which bytes get modified or not have to do with endianness for the system.

The former is aligned and the latter unaligned, clearly a performance difference plus need the extra logic to be able to do the four memory cycles and merge the byte lanes.

If I were to read 32 bits from address 0x0000 then a single read of row 0, done. But read from 0x0001 and I have to do two reads row0 and row1 and depending on the system design just send those 64 bits back to the processor possibly two bus clocks instead of one. or the memory controller has the extra logic so that the 32 bits are aligned on the data bus in one bus cycle.

16 bit reads are a little better, a read from 0x0000, 0x0001 and 0x0002 would only be a read from row0 and could based on the system/processor design send those 32 bits back and the processor extracts them or shift them in the memory controller so that they land on specific byte lanes so the processor doesnt have to rotate around. One or the other has to if not both. A read from 0x0003 though is like above you have to read row 0 and row1 as one of your bytes is in each and then either send 64 bits back for the processor to extract or the memory controller combines the bits into one 32 bit bus response (assuming the bus between the processor and memory controller is 32 bits wide for these examples).

A 16 bit write though always ends up with at least one read-modify-write in this example sram, address 0x0000, 0x0001 and 0x0002 read row0 modify two bytes and write back. address 0x0003 read two rows modify one byte each and write back.

8 bit you only need to read one row containing that byte, writes though are a read-modify-write of one row.

The armv4 didnt like unaligned although you could disable the trap and the result is not like you would expect above, not important, current arms allow unaligned and give you the above behavior you can change a bit in a control register and then it will abort unaligned transfers. mips used to not allow, not sure what they do now. x86, 68K etc, was allowed and the memory controller may have had to do the most work.

The designs that dont permit it clearly are for performance and less logic at what some would say is a burden on the programmers others might say it is no extra work on the programmer or easier on the programmer. aligned or not you can also see why it can be better to not try to save any memory by making 8 bit variables but go ahead and burn a 32 bit word or whatever the natural size of a register or the bus is. It may help your performance at a small cost of some bytes. Not to mention the extra code the compiler would need to add to make the lets say 32 bit register mimic an 8 bit variable, masking and sometimes sign extension. Where using register native sizes those additional instructions are not required. You can also pack multiple things into a bus/memory wide location and do one memory cycle to collect or write them then use some extra instructions to manipulate between registers not costing ram and a possible wash on the number of instructions.

I dont agree that the compiler will always align the data right for the target, there are ways to break that. And if the target doesnt support unaligned you will hit the fault. Programmers would never need to talk about this if the compiler always did it right based on any legal code you could come up with, there would be no reason for this question unless it was for performance. if you dont control the void ptr address to be aligned or not then you have to use the mem2() unaligned access all the time or you have to do an if-then-else in your code based on the value of the ptr as nik pointed out. by declaring as void the C compiler now has no way to correctly deal with your alignment and it wont be guaranteed. if you take a char *prt and feed it to these functions all bets are off on the compiler getting it right without you adding extra code either buried in the mem2() function or outside these two functions. so as written in your question mem2() is the only correct answer.

DRAM say used in your desktop/laptop tends to be 64 or 72 (with ecc) bits wide, and every access to them is aligned. Even though the memory sticks are actually made up of 8 bit wide or 16 or 32 bit wide chips. (this may be changing with phones/tablets for various reasons) the memory controller and ideally at least one cache sits in front of this dram so that the unaligned or even aligned accesses that are smaller than the bus width read-modify-writes are dealt with in the cache sram which is way faster, and the dram accesses are all aligned full bus width accesses. If you have no cache in front of the dram and the controller is designed for full width accesses then that is the worst performance, if designed for lighting up the byte lanes separately (assuming 8 bit wide chips) then you dont have the read-modify-writes but a more complicated controller. if the typical use case is with a cache (if there is one in the design) then it may not make sense to have that additional work in the controller for each byte lane, but have it just know how to do full bus width sized transfers or multiples of.

晨曦慕雪 2024-08-01 18:01:07

对齐地址是所讨论的访问大小的倍数的地址。

  • 在 4 的倍数的地址上访问 4 字节字将是对齐的
  • 从地址(例如)3 访问 4 字节将是未对齐的访问

_mem2 函数很可能也适用于未对齐的访问对于在其代码中获得正确的对齐效果来说不太理想。 这意味着 _mem2 函数可能比其 _amem2 版本更昂贵。

因此,当您需要性能时(特别是当您知道访问延迟很高时),请谨慎确定何时可以使用对齐访问。 _amem2 的存在就是为了这个目的——当您知道访问已对齐时为您提供性能。

当涉及 2 字节访问时,识别对齐操作非常简单。
如果该操作的所有访问地址都是“偶数”(即它们的 LSB 为零),则您有 2 字节对齐。 这可以很容易地检查,

if (address & 1) // is true
    /* we have an odd address; not aligned */
else
    /* we have an even address; its aligned to 2-bytes */

Aligned addresses are those which are multiples of the access size in question.

  • Access of 4 byte words on addresses that are multiple of 4 will be aligned
  • Access of 4 bytes from the address (say) 3 will be unaligned access

It is very likely that the _mem2 function which will work also for unaligned accesses will be less optimal to get the correct alignments working in its code. This means that the _mem2 function is likely to be costlier then its _amem2 version.

So, when you need performance (particularly when you know that the access latency is high) it would be prudent to identify when you can use the aligned access. The _amem2 exists for this very purpose -- to give you performance when you know the access is aligned.

When it comes to 2 byte accesses, identifying aligned operations is very simple.
If all the access addresses for the operation are 'even' (that is, their LSB is zero), you have 2-byte alignment. This can be easily checked with,

if (address & 1) // is true
    /* we have an odd address; not aligned */
else
    /* we have an even address; its aligned to 2-bytes */
失而复得 2024-08-01 18:01:07

_mem2 更通用。 无论 ptr 是否对齐,它都会起作用。 _amem2 更严格:它要求 ptr 对齐(尽管可能稍微更高效)。 所以使用_mem2除非你能保证ptr总是对齐的。

_mem2 is more general. It'll work if ptr is aligned or not. _amem2 is more strict: it requires that ptr be aligned (though is presumably slightly more efficient). So use _mem2 unless you can guarantee that ptr is always aligned.

始终不够爱げ你 2024-08-01 18:01:07

许多处理器对内存访问有对齐限制。 未对齐的访问要么生成异常中断(例如 ARM),要么只是速度较慢(例如 x86)。

_mem2 可能被实现为获取两个字节并使用移位和/或按位运算将它们制成 16 位 ushort。

_amem2 可能只是从指定的 ptr 中读取 16 位 ushort。

我具体不了解 TMS320C64x,但我猜它需要 16 位内存访问的 16 位对齐。 因此,您可以始终使用_mem2,但会带来性能损失;当您可以保证 ptr 是偶数地址时,可以使用_amem2

Many processors have alignment restrictions on memory access. Unaligned access either generates an exception interrupt (e.g. ARM), or is just slower (e.g. x86).

_mem2 is probably implemented as fetching two bytes and using shift and or bitwise operations to make a 16-bit ushort out of them.

_amem2 probably just reads the 16-bit ushort from the specified ptr.

I don't know TMS320C64x specifically but I'd guess it requires 16-bit alignment for 16-bit memory accesses. So you can use _mem2 always but with performance penalty, and _amem2 when you can guarantee that ptr is an even address.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文