使用 pragma pack(1) 时是否存在性能问题?
我们的标头在大多数结构(用于网络和文件 I/O)周围使用 #pragma pack(1)
。据我了解,它将结构的对齐方式从默认的 8 字节对齐更改为 1 字节对齐。
假设一切都在 32 位 Linux 中运行(也许 Windows 也是如此),这种打包对齐是否会对性能造成影响?
我不关心库的可移植性,但更关心文件和网络 I/O 与不同 #pragma 包的兼容性以及性能问题。
Our headers use #pragma pack(1)
around most of our structs (used for net and file I/O). I understand that it changes the alignment of structs from the default of 8 bytes, to an alignment of 1 byte.
Assuming that everything is run in 32-bit Linux (perhaps Windows too), is there any performance hit that comes from this packing alignment?
I'm not concerned about portability for libraries, but more with compatibility of file and network I/O with different #pragma packs, and performance issues.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
当内存访问发生在字对齐的内存地址时,内存访问速度最快。最简单的示例是以下结构(@Didier 也使用了该结构):
默认情况下,GCC 插入填充,因此 a 位于偏移量 0 处,b 位于偏移量 4 处(字对齐)。如果没有填充,b 就不是字对齐的,并且访问速度会更慢。
慢了多少?
与大多数性能问题一样,您必须对应用程序进行基准测试,以了解这在实践中存在多大问题。
关于可移植性:我假设您使用
#pragma pack(1)
,以便您可以通过线路和磁盘发送结构,而不必担心不同的编译器或平台以不同的方式打包结构。这是有效的,但是,有几个问题需要记住:Memory access is fastest when it can take place at word-aligned memory addresses. The simplest example is the following struct (which @Didier also used):
By default, GCC inserts padding, so a is at offset 0, and b is at offset 4 (word-aligned). Without padding, b isn't word-aligned, and access is slower.
How much slower?
As with most performance questions, you'd have to benchmark your application to see how much of an issue this is in practice.
Regarding portability: I assume that you're using
#pragma pack(1)
so that you can send structs across the wire and to and from disk without worrying about different compilers or platforms packing structs differently. This is valid, however, there are a couple of issues to keep in mind:是的。绝对有。
例如,如果您定义一个结构体:
那么每当您访问成员 i 时,CPU 都会变慢,因为 32 位值 i 无法以本机对齐方式访问。为了简单起见,假设 CPU 必须从内存中获取 3 个字节,然后从下一个位置获取另外 1 个字节,以将值从内存传输到 CPU 寄存器。
Yes. There absolutely are.
For instance, if you define a struct:
then whenever you access the member i, the CPU is slowed, because the 32 bits value i is not accessible in a native, aligned way. To make it simple, imagine that the CPU has to get 3 bytes from memory, and then 1 other byte from the next location to transfer the value from the memory to the CPU registers.
声明结构时,大多数编译器会在成员之间插入填充字节,以确保它们与内存中的适当地址对齐(通常填充字节是类型大小的倍数)。这使得编译器能够在访问这些成员时优化访问。
#pragma pack(1)
指示编译器以特定对齐方式打包结构成员。这里的1
告诉编译器不要在成员之间插入任何填充。所以是的,存在一定的性能损失,因为您强制编译器做一些超出其自然性能优化的事情。此外,某些平台要求对象在特定边界对齐并且使用未对齐的结构可能会给您带来分段错误。
理想情况下,最好避免更改默认的自然对齐规则。但是,如果根本无法避免“pragma pack”指令(如您的情况),则必须在定义需要紧密打包的结构后恢复原始打包方案。
例如:
When you declare a struct, most of the compilers insert padding bytes between members to ensure that they are aligned to appropriate addresses in memory (usually the padding bytes are a multiple of the type's size). This enables the compiler to have optimized access in aceessing these members.
#pragma pack(1)
instructs the compiler to pack structure members with particular alignment. The1
here tells the compiler not to insert any padding between members.So yes there is a definite performance penalty, since you force the compiler to do something beyond what it would naturally do for performance optimization.Also, some platforms demand that the objects be aligned at specific boundaries and using unalighed structures might give you segmentation faults.
Ideally, it is best to avoid changing the default natural alignment rules. But If the 'pragma pack' directive cannot be avoided at all (as in your case), then the original packing scheme must be restored after the definition of the structures that require tight packing.
For eg:
这取决于底层架构及其处理未对齐地址的方式。
x86 可以优雅地处理未对齐的地址,尽管会牺牲性能,而其他架构(例如 ARM)可能会调用对齐错误 (
SIGBUS
),甚至将未对齐的地址“舍入”到最近的边界,在这种情况下你的代码将会以一种可怕的方式失败。最重要的是,仅当您确定底层架构将处理未对齐的地址,并且网络 I/O 的成本高于处理成本时,才打包它。
It depends on the underlying architecture and the way it handles unaligned addresses.
x86 handles unaligned addresses gracefully, although at a performance cost, while other architectures such as ARM may invoke an alignment fault (
SIGBUS
), or even "round" the misaligned address to the closest boundary, in which case your code will fail in a hideous way.Bottom line is, pack it only if you are sure that the underlying architecture will handle unaligned addresses, and if the cost of network I/O is higher than the processing cost.
绝对有。 2020 年 1 月,Microsoft 的 Raymond Chen 发布了具体示例,说明如何使用
#pragma pack(1)
生成臃肿的可执行文件,这些可执行文件需要很多很多指令才能对打包结构执行操作。特别是在不直接支持硬件中未对齐访问的非 x86 硬件上。任何编写
#pragma pack(1)的人 不妨在额头上戴一个牌子,上面写着“我讨厌 RISC”
#pragma pack(1)
和它的变体也有微妙的危险 - 即使在它们应该“工作”的 x86 系统上Absolutely. In January 2020, Microsoft's Raymond Chen posted concrete examples of how using
#pragma pack(1)
can produce bloated executables that take many, many more instructions to perform operations on packed structures. Especially on non-x86 hardware that doesn't directly support misaligned accesses in hardware.Anybody who writes
#pragma pack(1)
may as well just wear a sign on their forehead that says “I hate RISC”#pragma pack(1)
and its variations are also subtly dangerous - even on x86 systems where they supposedly "work"从技术上讲,是的,它会影响性能,但仅限于内部处理。如果您需要为网络/文件 IO 打包结构,则需要在打包要求和内部处理之间取得平衡。我所说的内部处理是指您对 IO 之间的数据所做的工作。如果您只进行很少的处理,那么性能方面不会损失太多。否则,您可能希望对正确对齐的结构进行内部处理,并且仅在执行 IO 时“打包”结果。或者您可以切换为仅使用默认对齐结构,但您需要确保每个人都以相同的方式对齐它们(网络和文件客户端)。
Technically, yes, it would affect performance, but only with regards to internal processing. If you need the structures packed for network/file IO, there's a balance between the packed requirement and just internal processing. By internal processing, I mean, the work you do on the data between the IO. If you do very little processing, you won't lose much in terms of performance. Otherwise, you may wish to do internal processing on properly aligned structures and only "pack" the results when doing IO. Or you could switch to using only default aligned structures, but you'll need to ensure everyone aligns them the same way (network and file clients).
有些机器代码指令在 32 位或 64 位(甚至更多位)上运行,但期望数据在内存地址上对齐。如果不是,他们必须在内存上执行多次读/写周期才能执行其任务。
性能损失的程度很大程度上取决于您对数据的处理方式。如果您构建大型结构数组并对它们执行大量计算,它可能会变得很大。但是,如果您只存储数据一次,只是为了在其他时间将其读回并将其转换为字节流,那么它可能几乎不会被注意到。
There are certain machine code instructions that operate on 32 bit or 64 bit (or even more) but expect the data to be aligned on memory adresses. If they are not they have to do more than one read/write cyce on memory to perform their task.
How bit that performance hit is depends heavily on what you are doing with the data. If you build large arrays of structs and perform extensive calculations on them it might become big. But if you only store data once just to read it back at some other time converting it to a byte stream anyway, then it might be barely noticable.
在某些平台(例如 ARM Cortex-M0)上,如果在奇数地址上使用 16 位加载/存储指令将会失败,如果在不是四的倍数的地址上使用 32 位指令将会失败。从/到可能是奇数的地址加载或存储 16 位对象将需要使用三个指令而不是一个;对于 32 位地址,需要 7 条指令。
在 clang 或 gcc 上,获取打包结构成员的地址将产生一个指针,该指针通常无法用于访问该成员。在更有用的 Keil 编译器上,获取 __packed 结构成员的地址将产生一个 __packed 限定指针,该指针只能存储在同样限定的指针对象中。通过此类指针进行的访问将使用支持未对齐访问所需的多指令序列。
On some platforms such as the ARM Cortex-M0, the 16-bit load/store instructions will fail if used on an odd address, and the 32-bit instructions will fail if used on addresses that are not multiples of four. Loading or storing a 16-bit object from/to an address which is might be odd will require using three instructions rather than one; for a 32-bit address, seven instructions would be required.
On clang or gcc, taking the address of a packed structure member will yield a pointer that will often be unusable for purposes of accessing that member. On the more useful Keil compiler, taking the address of a
__packed
structure member will yield a__packed
qualified pointer which can only be stored in pointer objects that are qualified likewise. Accesses made via such pointers will use the multi-instruction sequence necessary to support unaligned accesses.