沿 4 字节边界对齐

发布于 2024-07-30 11:43:06 字数 246 浏览 5 评论 0原文

我最近开始考虑对齐...这是我们通常不必考虑的事情,但我意识到某些处理器要求对象沿 4 字节边界对齐。 这到底意味着什么?哪些特定系统有对齐要求?

假设我有一个任意指针:

unsigned char* ptr

现在,我尝试从内存位置检索 double 值:

double d = **((double*)ptr);

这会引起问题吗?

I recently got thinking about alignment... It's something that we don't ordinarily have to consider, but I've realized that some processors require objects to be aligned along 4-byte boundaries. What exactly does this mean, and which specific systems have alignment requirements?

Suppose I have an arbitrary pointer:

unsigned char* ptr

Now, I'm trying to retrieve a double value from a memory location:

double d = **((double*)ptr);

Is this going to cause problems?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

醉生梦死 2024-08-06 11:43:06

它肯定会在某些系统上引起问题。

例如,在基于 ARM 的系统上,您无法寻址未与 4 字节边界对齐的 32 位字。 这样做将导致访问冲突异常。 在 x86 上,您可以访问此类非对齐数据,但性能会受到一点影响,因为必须从内存中获取两个字而不是一个字。

It can definitely cause problems on some systems.

For example, on ARM-based systems you cannot address a 32-bit word that is not aligned to a 4-byte boundary. Doing so will result in an access violation exception. On x86 you can access such non-aligned data, though the performance suffers a little since two words have to be fetched from memory instead of just one.

浪推晚风 2024-08-06 11:43:06

以下是英特尔 x86/x64 参考手册 关于对齐的说明:

4.1.1 字、双字、四字和双四字的对齐

单词、双字和四字可以
不需要在内存中对齐
自然边界。 自然的
单词、双词的边界,
并且四字是偶数的
地址,地址可整除
四个,并均匀地寻址
分别能被八整除。
然而,为了提高性能
程序、数据结构(特别是
堆栈)应该自然对齐
尽可能限制边界。 这
原因是处理器
需要两次内存访问才能完成
未对齐的内存访问; 对齐
访问只需要一个内存
使用权。 字或双字操作数
跨越 4 字节边界或
交叉的四字操作数
考虑 8 字​​节边界
未对齐,需要两个单独的
用于访问的内存总线周期。

一些操作指令
双四字需要内存
操作数自然对齐
边界。 这些指令生成
一般保护异常 (#GP)
如果指定了未对齐的操作数。
双重的自然边界
四字是均匀的任何地址
能被 16 整除。 其他说明
对双四字进行操作
允许未对齐的访问(没有
生成一般保护
例外)。 不过,额外的内存
需要总线周期才能访问
内存中未对齐的数据。

不要忘记,参考手册是负责任的开发人员和工程师的最终信息来源,因此,如果您正在处理有详细记录的内容(例如 Intel CPU),只需查找参考手册有关该问题的内容即可。

Here's what the Intel x86/x64 Reference Manual says about alignments:

4.1.1 Alignment of Words, Doublewords, Quadwords, and Double Quadwords

Words, doublewords, and quadwords do
not need to be aligned in memory on
natural boundaries. The natural
boundaries for words, double words,
and quadwords are even-numbered
addresses, addresses evenly divisible
by four, and addresses evenly
divisible by eight, respectively.
However, to improve the performance of
programs, data structures (especially
stacks) should be aligned on natural
boundaries whenever possible. The
reason for this is that the processor
requires two memory accesses to make
an unaligned memory access; aligned
accesses require only one memory
access. A word or doubleword operand
that crosses a 4-byte boundary or a
quadword operand that crosses an
8-byte boundary is considered
unaligned and requires two separate
memory bus cycles for access.

Some instructions that operate on
double quadwords require memory
operands to be aligned on a natural
boundary. These instructions generate
a general-protection exception (#GP)
if an unaligned operand is specified.
A natural boundary for a double
quadword is any address evenly
divisible by 16. Other instructions
that operate on double quadwords
permit unaligned access (without
generating a general-protection
exception). However, additional memory
bus cycles are required to access
unaligned data from memory.

Don't forget, reference manuals are the ultimate source of information of the responsible developer and engineer, so if you're dealing with something well documented such as Intel CPUs, just look up what the reference manual says about the issue.

楠木可依 2024-08-06 11:43:06

是的,这可能会导致许多问题。 C++ 标准实际上并不能保证它能够工作。 您不能在指针类型之间任意转换。

当您将 char 指针转换为双指针时,它会使用reinterpret_cast,它应用实现定义映射。 您不能保证生成的指针将包含相同的位模式,或者它将指向相同的地址或其他任何内容。 从更实际的角度来说,您也不能保证您正在读取的值正确对齐。 如果数据被写入一系列字符,那么它们将使用字符的对齐要求。

至于对齐的含义,本质上就是值的起始地址应该能被对齐大小整除。 例如,地址 16 在 1、2、4、8 和 16 字节边界上对齐,因此在典型的 CPU 上,这些大小的值可以存储在那里。

地址 6 未在 4 字节边界上对齐,因此我们不应该在那里存储 4 字节值。

值得注意的是,即使在不强制或不需要对齐的 CPU 上,访问未对齐的值通常也会显着减慢速度。

Yes, that can cause a number of problems. The C++ standard doesn't actually guarantee that it'll work. You can't just arbitrarily cast between pointer types.

When you cast a char pointer to a double pointer, it uses a reinterpret_cast, which applies an implementation-defined mapping. You're not guaranteed that the resulting pointer will contain the same bit pattern, or that it will point to the same address or, well, anything else. In more practical terms, you're also not guaranteed that the value you're reading is aligned properly. If the data was written as a series of chars, then they will use char's alignment requirements.

As for what alignment means, essentially just that the starting address of the value should be divisible by the alignment size. Address 16 is aligned on 1, 2, 4, 8 and 16-byte boundaries, for example, so on typical CPU's, values of these sizes can be stored there.

Address 6 isn't aligned on a 4-byte boundary, so we should not store 4-byte values there.

It's worth noting that even on CPU's that don't enforce or require alignment, you typically still get a significant slowdown from accessing unaligned values.

与酒说心事 2024-08-06 11:43:06

对齐会影响结构的布局。 考虑这个结构:

struct S {
  char a;
  long b;
};

在 32 位 CPU 上,该结构的布局通常是:

a _ _ _ b b b b

要求是 32 位值必须在 32 位边界上对齐。 如果结构体按如下方式更改:

struct S {
  char a;
  short b;
  long c;
};

布局将如下:

a _ b b c c c c

16 位值在 16 位边界上对齐。

有时,如果您想将结构与数据格式相匹配,则可能需要打包结构。 通过使用编译器选项或者#pragma,您可以删除多余的空间:

a b b b b
a b b c c c c

但是,在现代 CPU 上访问打包结构的未对齐成员通常会慢得多,甚至可能导致一个例外。

Alignment affects the layout of structs. Consider this struct:

struct S {
  char a;
  long b;
};

On a 32-bit CPU the layout of this struct will often be:

a _ _ _ b b b b

The requirement is that a 32-bit value has to be aligned on a 32-bit boundary. If the struct is changed like this:

struct S {
  char a;
  short b;
  long c;
};

the layout will be this:

a _ b b c c c c

The 16-bit value is aligned on a 16-bit boundary.

Sometimes you want to pack the structs perhaps if you want to match the struct with a data format. By using a compiler option or perhaps a #pragma you are able to remove the excess space:

a b b b b
a b b c c c c

However, accessing an unaligned member of a packed struct will often be much slower on modern CPU's, or may even result in an exception.

毁梦 2024-08-06 11:43:06

是的,这可能会导致问题。

4 对齐只是意味着当将指针视为数字地址时,它是 4 的倍数。如果指针不是所需对齐的倍数,则它是未对齐的。 编译器对某些类型设置对齐限制有两个原因:

  1. 因为硬件无法从未对齐的指针加载该数据类型(至少不能使用编译器想要发出的加载和存储指令)。
  2. 因为硬件从对齐的指针加载该数据类型更快。

如果您处于情况 (1),并且 double 是 4 对齐的,并且您尝试使用不是 4 对齐的 char * 指针来编写代码,那么您很可能会得到一个硬件陷阱。 有些硬件没有陷阱。 它只是加载一个无意义的值并继续。 但是,C++ 标准没有定义可能发生的情况(未定义的行为),因此此代码可能会导致您的计算机着火。

在 x86 上,永远不会出现情况 (1),因为标准加载指令可以处理未对齐的指针。 在 ARM 上,不存在未对齐的加载,如果您尝试这样做,那么您的程序就会崩溃(如果您幸运的话。有些 ARM 会默默地失败)。

回到您的示例,问题是为什么您要使用非 4 对齐的 char * 来尝试此操作。 如果您通过 double * 成功地在那里写入了一个 double,那么您将能够读回它。 因此,如果您最初有一个指向 double 的“正确”指针,您将其转换为 char *,并且现在又转换回来,则不必担心对齐问题。

但你说任意 char *,所以我猜这不是你所拥有的。 如果您从文件中读取了包含序列化双精度数据的数据块,那么您必须确保满足平台的对齐要求才能执行此转换。 如果您有 8 个字节表示某种文件格式中的 double,那么您不能随意将其读入任意偏移量的 char* 缓冲区中,然后转换为 double *

最简单的方法是确保将文件数据读入合适的结构中。 内存分配始终与它们足够大以包含的任何类型的最大对齐要求对齐,这一事实也对您有所帮助。 因此,如果您分配一个足够大的缓冲区来包含双精度数,则该缓冲区的开头将具有双精度数所需的任何对齐方式。 因此,您可以将代表双精度数的 8 个字节读入缓冲区的开头,进行强制转换(或使用联合)并读出双精度数。

或者,您可以这样做:

double readUnalignedDouble(char *un_ptr) {
    double d;
    // either of these
    std::memcpy(&d, un_ptr, sizeof(d));
    std::copy(un_ptr, un_ptr + sizeof(d), reinterpret_cast<char *>(&d));
    return d;
}

这保证是有效的(假设 un_ptr 确实指向您平台的有效 double 表示的字节),因为 double 是 POD,因此可以逐字节复制。 如果您有很多双打要加载,它可能不是最快的解决方案。

如果您正在从文件中读取数据,那么实际上比您担心具有非 IEEE 双精度表示形式、或具有 9 位字节或其他一些不常见属性的平台(其中可能存在非值)的情况要复杂得多。双精度值的存储表示形式中的位。 但您实际上并没有询问有关文件的问题,我只是将其作为一个示例,无论如何,这些平台比您所询问的问题要罕见得多,即 double 具有对齐要求。

最后,与对齐完全无关,如果您通过从与 double *< 不别名兼容的指针进行强制转换获得 char * ,那么您还需要担心严格的别名。 /代码>。 不过,别名在 char * 本身和其他任何内容之间都是有效的。

Yes, that could cause problems.

4-alignment simply means that the pointer, when considered as a numeric address, is a multiple of 4. If the pointer is not a multiple of the required alignment, then it is unaligned. There are two reasons why compilers place alignment restrictions on certain types:

  1. Because the hardware cannot load that datatype from an unaligned pointer (at least, not using the instructions which the compiler wants to emit for loads and stores).
  2. Because the hardware loads that datatype more quickly from aligned pointers.

If you're in case (1), and double is 4-aligned, and you try your code with a char * pointer which is not 4-aligned, then you'll most likely get a hardware trap. Some hardware does not trap. It just loads a nonsense value and continues. However, the C++ standard doesn't define what can happen (undefined behavior), so this code could set your computer on fire.

On x86, you're never in case (1), because the standard load instructions can handle unaligned pointers. On ARM, there are no unaligned loads, and if you attempt one then your program crashes (if you're lucky. Some ARMs silently fail).

Coming back to your example, the question is why you're trying this with a char * that isn't 4-aligned. If you successfully wrote a double there via a double *, then you'll be able to read it back. So if you originally had a "proper" pointer to double, which you cast to char * and you're now casting back, you don't have to worry about alignment.

But you said arbitrary char *, so I guess that's not what you have. If you read a chunk of data out of a file, which contains a serialized double, then you must ensure that that the alignment requirements for your platform are met in order to do this cast. If you have 8 bytes representing a double in some file format, then you cannot just read it willy-nilly into a char* buffer at any offset and then cast to double *.

The easiest way to do this is to make sure that you read the file data into a suitable struct. You're also helped by the fact that memory allocations are always aligned to the maximum alignment requirement of any type they're big enough to contain. So if you allocate a buffer big enough to contain a double, then the start of that buffer has whatever alignment is required by double. So then you can read the 8 bytes representing the double into the start of the buffer, cast (or use a union) and read the double out.

Alternatively, you could do something like this:

double readUnalignedDouble(char *un_ptr) {
    double d;
    // either of these
    std::memcpy(&d, un_ptr, sizeof(d));
    std::copy(un_ptr, un_ptr + sizeof(d), reinterpret_cast<char *>(&d));
    return d;
}

This is guaranteed to be valid (assuming un_ptr really points to the bytes of a valid double representation for your platform), because double is POD and hence can be copied byte-by-byte. It may not be the fastest solution, if you have a lot of doubles to load.

If you are reading from a file, there's actually a bit more to it than that if you're worried about platforms with non-IEEE double representations, or with 9 bit bytes, or some other unusual properties, where there might be non-value bits in the stored representation of a double. But you didn't actually ask about files, I just made it up as an example, and in any case those platforms are much rarer than the issue you're asking about, which is for double to have an alignment requirement.

Finally, nothing at all to do with alignment, you also have strict aliasing to worry about if you got that char * via a cast from a pointer which is not alias-compatible with double *. Aliasing is valid between char * itself and anything else, though.

潦草背影 2024-08-06 11:43:06

在 x86 上它总是会运行,当然在对齐时效率会更高。

但如果您是多线程的,那么请注意读写撕裂。 对于 64 位值,您需要一台 x64 机器来为您提供线程之间的原子读写。
如果说你从另一个线程读取值,当它在 0x00000000.FFFFFFFF 和 0x00000001.00000000 之间递增时,那么理论上另一个线程可能会读取 0 或 1FFFFFFFF,特别是如果该值跨越缓存线边界。
我推荐 Duffy 的“Windows 上的并发编程”,因为它对内存模型进行了很好的讨论,甚至提到了 dot-net 执行 GC 时多处理器上的对齐陷阱。 您想远离安腾!

On the x86 it's always going to run, of course more efficiently when aligned.

But if you're MULTITHREADING then watch for read-write-tearing. With a 64-bit value you need an x64 machine to give you atomic read-and-write between threads.
If say you read the value from another thread when it's say incrementing between 0x00000000.FFFFFFFF and 0x00000001.00000000, then another thread might in theory read say either 0 or 1FFFFFFFF, especially IF SAY the value STRADDLED A CACHE-LINE boundary.
I recommend Duffy's "Concurrent Programming on Windows" for its nice discussion of memory models, even mentioning alignment gotchas on multiprocessors when dot-net does a GC. You want to stay away from the Itanium !

猫七 2024-08-06 11:43:06

SPARC(Solaris 机器)是另一种体系结构(至少在过去是这样),如果您尝试使用未对齐的值,它会阻塞(给出 SIGBUS 错误)。

作为 Martin York 的补充,malloc 也与最大可能的类型保持一致,即它对所有东西都是安全的,比如“new”。 事实上,经常“new”只是使用 malloc。

SPARC (Solaris machines) is another architecture (at least some in times past) that will choke (give a SIGBUS error) if you try to use an unaligned value.

An addendum to Martin York, malloc also is aligned to the largest possible type, ie it's safe for everything, like 'new'. In fact, frequently 'new' just uses malloc.

彩扇题诗 2024-08-06 11:43:06

对齐要求的一个示例是使用矢量化 (SIMD) 指令时。 (它可以在没有对齐的情况下使用,但如果您使用一种需要对齐的指令,速度会快得多)。

An example of aligment requirement is when using vectorization (SIMD) instructions. (It can be used without aligment but is much faster if you use a kind of instruction which requires alignment).

生生漫 2024-08-06 11:43:06

强制内存对齐在基于 RISC 的架构(例如 MIPS)中更为常见。
AFAIK,这些类型的处理器的主要想法实际上是速度问题。
RISC 方法论的核心是拥有一组简单且快速的指令(通常每条指令一个内存周期)。 这并不一定意味着它比 CISC 处理器具有更少的指令,更多的是它具有更简单、更快的指令。
许多 MIPS 处理器,尽管 8 字节可寻址将是字对齐的(通常为 32 位,但并非总是如此),然后屏蔽掉适当的位。
这个想法是,执行对齐加载+位掩码比尝试执行未对齐加载更快。
通常(当然这实际上取决于芯片组),执行未对齐加载会生成总线错误,因此 RISC 处理器将提供“未对齐加载/存储”指令,但这通常会比相应的对齐加载/存储慢得多。

当然,这仍然没有回答他们为什么这样做的问题,即对齐记忆词会给你带来什么好处?
我不是硬件专家,我确​​信这里有人可以给出更好的答案,但我的两个最佳猜测是:
1. 当字对齐时,从缓存中获取数据会更快,因为许多缓存被组织成缓存行(从 8 到 512 字节的任何内容),并且缓存通常比 RAM 昂贵得多,因此您希望充分利用其中。
2. 访问每个内存地址可能会快得多,因为它允许您读取“突发模式”(即在需要之前获取下一个连续地址)

请注意,对于非对齐存储,上述情况都不是绝对不可能的,我'我猜测(虽然我不知道)这很大程度上取决于硬件设计选择和成本

Enforced memory alignment is much more common in RISC based architectures such as MIPS.
The main thinking for these types of processors, AFAIK, is really a speed issue.
RISC methodology was all about having a set of simple and fast instructions ( usually one memory cycle per instruction ). This does not mean necessarily that it has less instructions than a CISC processor, more that it has simpler, faster instructions.
Many MIPS processors, although 8 byte addressable would be word aligned ( 32-bits typically but not always) then mask off the appropriate bits.
The idea being that this is faster to do an aligned load + bit mask than than trying to do an unaligned load.
Typically ( and of course this really depends on chipset ), doing an un-aligned load would generate a bus error so RISC processors would offer an 'unaligned load/store' instruction but this would often be much slower than the corresponding aligned load/store.

Of course this still doesn't answer the question as to why they do this i.e what advantage does having memory word aligned give you?
I'm no hardware expert and I'm sure someone on here can give a better answer but my two best guesses are:
1. It can be much faster to fetch from the cache when word aligned because many caches are organised into cache-lines ( anything from 8 to 512 bytes ) and as cache memory is typically much more expensive than RAM, you want to make the most of it.
2. It may be much faster to access each memory address as it allows you to read through 'Burst Mode' ( i.e fetching the next sequential address before it's needed )

Note none of the above is strictly impossible with non-aligned stores, I'm guessing ( though I don't know ) that a lot of it comes down to hardware design choices and cost

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文