使用 pragma pack(1) 时是否存在性能问题?

发布于 2024-12-10 20:43:40 字数 224 浏览 0 评论 0原文

我们的标头在大多数结构(用于网络和文件 I/O)周围使用 #pragma pack(1)。据我了解,它将结构的对齐方式从默认的 8 字节对齐更改为 1 字节对齐。

假设一切都在 32 位 Linux 中运行(也许 Windows 也是如此),这种打包对齐是否会对性能造成影响?

我不关心库的可移植性,但更关心文件和网络 I/O 与不同 #pragma 包的兼容性以及性能问题。

Our headers use #pragma pack(1) around most of our structs (used for net and file I/O). I understand that it changes the alignment of structs from the default of 8 bytes, to an alignment of 1 byte.

Assuming that everything is run in 32-bit Linux (perhaps Windows too), is there any performance hit that comes from this packing alignment?

I'm not concerned about portability for libraries, but more with compatibility of file and network I/O with different #pragma packs, and performance issues.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

半世蒼涼 2024-12-17 20:43:40

当内存访问发生在字对齐的内存地址时,内存访问速度最快。最简单的示例是以下结构(@Didier 也使用了该结构):

struct sample {
   char a;
   int b;
};

默认情况下,GCC 插入填充,因此 a 位于偏移量 0 处,b 位于偏移量 4 处(字对齐)。如果没有填充,b 就不是字对齐的,并且访问速度会更慢。

慢了多少?

  • 对于 32 位 x86,根据 Intel 64 和 IA32 架构软件开发人员手册

    处理器需要两个内存
    进行未对齐的内存访问;对齐访问仅需要一个
    内存访问。跨越 4 字节边界的字或双字操作数或
    跨越 8 字节边界的四字操作数被视为未对齐,并且
    需要两个单独的内存总线周期进行访问。

    与大多数性能问题一样,您必须对应用程序进行基准测试,以了解这在实践中存在多大问题。

  • 根据 维基百科,像 SSE2 这样的 x86 扩展需要字对齐。
  • 许多其他架构需要字对齐(如果数据结构未字对齐,则会生成 SIGBUS 错误)。

关于可移植性:我假设您使用 #pragma pack(1) ,以便您可以通过线路和磁盘发送结构,而不必担心不同的编译器或平台以不同的方式打包结构。这是有效的,但是,有几个问题需要记住:

  • 这对于处理大端与小端问题没有任何作用。您可以通过对结构中的任何整数、无符号等调用 htons 系列函数来处理这些问题。
  • 根据我的经验,在应用程序代码中使用打包的、可序列化的结构并不是很有趣。它们很难在不破坏向后兼容性的情况下进行修改和扩展,而且正如已经指出的那样,还会带来性能损失。考虑将打包的、可序列化的结构的内容传输到等效的非打包的、可扩展的结构中进行处理,或者考虑使用成熟的序列化库,例如 协议缓冲区(具有C 绑定)。

Memory access is fastest when it can take place at word-aligned memory addresses. The simplest example is the following struct (which @Didier also used):

struct sample {
   char a;
   int b;
};

By default, GCC inserts padding, so a is at offset 0, and b is at offset 4 (word-aligned). Without padding, b isn't word-aligned, and access is slower.

How much slower?

  • For 32-bit x86, according to the Intel 64 and IA32 Architectures Software Developer's Manual:

    The processor requires two memory
    accesses to make an unaligned memory access; aligned accesses require only one
    memory access. A word or doubleword operand that crosses a 4-byte boundary or a
    quadword operand that crosses an 8-byte boundary is considered unaligned and
    requires two separate memory bus cycles for access.

    As with most performance questions, you'd have to benchmark your application to see how much of an issue this is in practice.

  • According to Wikipedia, x86 extensions like SSE2 require word alignment.
  • Many other architectures require word alignment (and will generate SIGBUS errors if data structures aren't word-aligned).

Regarding portability: I assume that you're using #pragma pack(1) so that you can send structs across the wire and to and from disk without worrying about different compilers or platforms packing structs differently. This is valid, however, there are a couple of issues to keep in mind:

  • This does nothing to handle big endian versus little endian issues. You can handle these by calling the htons family of functions on any ints, unsigned, etc. in your structs.
  • In my experience, working with packed, serializable structs in application code isn't a lot of fun. They're very difficult to modify and extend without breaking backwards compatibility, and as already noted, there are performance penalties. Consider transferring your packed, serializable structs' contents into equivalent non-packed, extensible structs for processing, or consider using a full-fledged serialization library like Protocol Buffers (which has C bindings).
ˇ宁静的妩媚 2024-12-17 20:43:40

是的。绝对有。

例如,如果您定义一个结构体:

struct dumb {
    char c;
    int  i;
};

那么每当您访问成员 i 时,CPU 都会变慢,因为 32 位值 i 无法以本机对齐方式访问。为了简单起见,假设 CPU 必须从内存中获取 3 个字节,然后从下一个位置获取另外 1 个字节,以将值从内存传输到 CPU 寄存器。

Yes. There absolutely are.

For instance, if you define a struct:

struct dumb {
    char c;
    int  i;
};

then whenever you access the member i, the CPU is slowed, because the 32 bits value i is not accessible in a native, aligned way. To make it simple, imagine that the CPU has to get 3 bytes from memory, and then 1 other byte from the next location to transfer the value from the memory to the CPU registers.

写给空气的情书 2024-12-17 20:43:40

声明结构时,大多数编译器会在成员之间插入填充字节,以确保它们与内存中的适当地址对齐(通常填充字节是类型大小的倍数)。这使得编译器能够在访问这些成员时优化访问。

#pragma pack(1) 指示编译器以特定对齐方式打包结构成员。这里的 1 告诉编译器不要在成员之间插入任何填充。

所以是的,存在一定的性能损失,因为您强制编译器做一些超出其自然性能优化的事情。此外,某些平台要求对象在特定边界对齐并且使用未对齐的结构可能会给您带来分段错误。

理想情况下,最好避免更改默认的自然对齐规则。但是,如果根本无法避免“pragma pack”指令(如您的情况),则必须在定义需要紧密打包的结构后恢复原始打包方案。

例如:

//push current alignment rules to internal stack and force 1-byte alignment boundary
#pragma pack(push,1)  

/*   definition of structures that require tight packing go in here   */

//restore original alignment rules from stack    
#pragma pack(pop)

When you declare a struct, most of the compilers insert padding bytes between members to ensure that they are aligned to appropriate addresses in memory (usually the padding bytes are a multiple of the type's size). This enables the compiler to have optimized access in aceessing these members.

#pragma pack(1) instructs the compiler to pack structure members with particular alignment. The 1 here tells the compiler not to insert any padding between members.

So yes there is a definite performance penalty, since you force the compiler to do something beyond what it would naturally do for performance optimization.Also, some platforms demand that the objects be aligned at specific boundaries and using unalighed structures might give you segmentation faults.

Ideally, it is best to avoid changing the default natural alignment rules. But If the 'pragma pack' directive cannot be avoided at all (as in your case), then the original packing scheme must be restored after the definition of the structures that require tight packing.

For eg:

//push current alignment rules to internal stack and force 1-byte alignment boundary
#pragma pack(push,1)  

/*   definition of structures that require tight packing go in here   */

//restore original alignment rules from stack    
#pragma pack(pop)
怎樣才叫好 2024-12-17 20:43:40

这取决于底层架构及其处理未对齐地址的方式。

x86 可以优雅地处理未对齐的地址,尽管会牺牲性能,而其他架构(例如 ARM)可能会调用对齐错误 (SIGBUS),甚至将未对齐的地址“舍入”到最近的边界,在这种情况下你的代码将会以一种可怕的方式失败。

最重要的是,仅当您确定底层架构将处理未对齐的地址,并且网络 I/O 的成本高于处理成本时,才打包它。

It depends on the underlying architecture and the way it handles unaligned addresses.

x86 handles unaligned addresses gracefully, although at a performance cost, while other architectures such as ARM may invoke an alignment fault (SIGBUS), or even "round" the misaligned address to the closest boundary, in which case your code will fail in a hideous way.

Bottom line is, pack it only if you are sure that the underlying architecture will handle unaligned addresses, and if the cost of network I/O is higher than the processing cost.

冷月断魂刀 2024-12-17 20:43:40

使用 pragma pack(1) 时是否存在性能问题?

绝对有。 2020 年 1 月,Microsoft 的 Raymond Chen 发布了具体示例,说明如何使用 #pragma pack(1) 生成臃肿的可执行文件,这些可执行文件需要很多很多指令才能对打包结构执行操作。特别是在不直接支持硬件中未对齐访问的非 x86 硬件上。

任何编写#pragma pack(1)的人 不妨在额头上戴一个牌子,上面写着“我讨厌 RISC”

当您使用#pragma pack(1)时,这会将默认结构打包更改为字节打包,删除通常插入以保持对齐的所有填充字节。

...

任何 P 结构可能未对齐的可能性都会对代码生成产生重大影响,因为对成员的所有访问都必须处理地址未正确对齐的情况。

void UpdateS(S* s)
{
 s->总计 = s->a + s->b;
}

无效更新P(P* p)
{
 p->总计 = p->a + p->b;
}

尽管结构 S 和 P 具有完全相同的布局,
由于对齐方式,代码生成有所不同。

更新S 更新P
英特尔安腾

添加 r31 = r32, 4 添加 r31 = r32, 4
添加 r30 = r32 8 ;;添加 r30 = r32 8 ;;
ld4 r31 = [r31] ld1 r29 = [r31], 1
ld4 r30 = [r30] ;; ld1 r28 = [r30], 1 ;;
                              ld1 r27 = [r31], 1
                              ld1 r26 = [r30], 1 ;;
                              dep r29 = r27, r29, 8, 8
                              dep r28 = r26, r28, 8, 8
                              ld1 r25 = [r31], 1
                              ld1 r24 = [r30], 1 ;;
                              dep r29 = r25, r29, 16, 8
                              dep r28 = r24, r28, 16, 8
                              ld1 r27 = [r31]
                              ld1 r26 = [r30] ;;
                              dep r29 = r27, r29, 24, 8
                              dep r28 = r26, r28, 24, 8 ;;
添加 r31 = r30, r31 ;;添加 r31 = r28, r29 ;;
st4 [r32] = r31 st1 [r32] = r31
                              添加 r30 = r32, 1
                              添加 r29 = r32, 2 
                              外部 r28 = r31, 8, 8
                              外部 r27 = r31, 16, 8 ;;
                              st1 [r30] = r28
                              st1 [r29] = r27, 1
                              外部 r26 = r31, 24, 8 ;;
                              st1 [r29] = r26
br.ret.sptk.many rp br.ret.sptk.many.rp

...
[其他硬件的示例]
...

观察到,对于某些 RISC 处理器,代码大小的爆炸非常显着。这可能反过来影响内联决策。

故事的寓意:除非绝对必要,否则不要将 #pragma pack(1) 应用于结构。它会使您的代码变得臃肿并抑制优化。

#pragma pack(1) 和它的变体也有微妙的危险 - 即使在它们应该“工作”的 x86 系统上

Are there performance issues when using pragma pack(1)?

Absolutely. In January 2020, Microsoft's Raymond Chen posted concrete examples of how using #pragma pack(1) can produce bloated executables that take many, many more instructions to perform operations on packed structures. Especially on non-x86 hardware that doesn't directly support misaligned accesses in hardware.

Anybody who writes #pragma pack(1) may as well just wear a sign on their forehead that says “I hate RISC”

When you use #pragma pack(1), this changes the default structure packing to byte packing, removing all padding bytes normally inserted to preserve alignment.

...

The possibility that any P structure could be misaligned has significant consequences for code generation, because all accesses to members must handle the case that the address is not properly aligned.

void UpdateS(S* s)
{
 s->total = s->a + s->b;
}

void UpdateP(P* p)
{
 p->total = p->a + p->b;
}

Despite the structures S and P having exactly the same layout, the
code generation is different because of the alignment.

UpdateS                       UpdateP
Intel Itanium

adds  r31 = r32, 4            adds  r31 = r32, 4
adds  r30 = r32  8 ;;         adds  r30 = r32  8 ;;
ld4   r31 = [r31]             ld1   r29 = [r31], 1
ld4   r30 = [r30] ;;          ld1   r28 = [r30], 1 ;;
                              ld1   r27 = [r31], 1
                              ld1   r26 = [r30], 1 ;;
                              dep   r29 = r27, r29, 8, 8
                              dep   r28 = r26, r28, 8, 8
                              ld1   r25 = [r31], 1
                              ld1   r24 = [r30], 1 ;;
                              dep   r29 = r25, r29, 16, 8
                              dep   r28 = r24, r28, 16, 8
                              ld1   r27 = [r31]
                              ld1   r26 = [r30] ;;
                              dep   r29 = r27, r29, 24, 8
                              dep   r28 = r26, r28, 24, 8 ;;
add   r31 = r30, r31 ;;       add   r31 = r28, r29 ;;
st4   [r32] = r31             st1   [r32] = r31
                              adds  r30 = r32, 1
                              adds  r29 = r32, 2 
                              extr  r28 = r31, 8, 8
                              extr  r27 = r31, 16, 8 ;;
                              st1   [r30] = r28
                              st1   [r29] = r27, 1
                              extr  r26 = r31, 24, 8 ;;
                              st1   [r29] = r26
br.ret.sptk.many rp           br.ret.sptk.many.rp

...
[examples from other hardware]
...

Observe that for some RISC processors, the code size explosion is quite significant. This may in turn affect inlining decisions.

Moral of the story: Don’t apply #pragma pack(1) to structures unless absolutely necessary. It bloats your code and inhibits optimizations.

#pragma pack(1) and its variations are also subtly dangerous - even on x86 systems where they supposedly "work"

静若繁花 2024-12-17 20:43:40

从技术上讲,是的,它会影响性能,但仅限于内部处理。如果您需要为网络/文件 IO 打包结构,则需要在打包要求和内部处理之间取得平衡。我所说的内部处理是指您对 IO 之间的数据所做的工作。如果您只进行很少的处理,那么性能方面不会损失太多。否则,您可能希望对正确对齐的结构进行内部处理,并且仅在执行 IO 时“打包”结果。或者您可以切换为仅使用默认对齐结构,但您需要确保每个人都以相同的方式对齐它们(网络和文件客户端)。

Technically, yes, it would affect performance, but only with regards to internal processing. If you need the structures packed for network/file IO, there's a balance between the packed requirement and just internal processing. By internal processing, I mean, the work you do on the data between the IO. If you do very little processing, you won't lose much in terms of performance. Otherwise, you may wish to do internal processing on properly aligned structures and only "pack" the results when doing IO. Or you could switch to using only default aligned structures, but you'll need to ensure everyone aligns them the same way (network and file clients).

濫情▎り 2024-12-17 20:43:40

有些机器代码指令在 32 位或 64 位(甚至更多位)上运行,但期望数据在内存地址上对齐。如果不是,他们必须在内存上执行多次读/写周期才能执行其任务。
性能损失的程度很大程度上取决于您对数据的处理方式。如果您构建大型结构数组并对它们执行大量计算,它可能会变得很大。但是,如果您只存储数据一次,只是为了在其他时间将其读回并将其转换为字节流,那么它可能几乎不会被注意到。

There are certain machine code instructions that operate on 32 bit or 64 bit (or even more) but expect the data to be aligned on memory adresses. If they are not they have to do more than one read/write cyce on memory to perform their task.
How bit that performance hit is depends heavily on what you are doing with the data. If you build large arrays of structs and perform extensive calculations on them it might become big. But if you only store data once just to read it back at some other time converting it to a byte stream anyway, then it might be barely noticable.

月亮是我掰弯的 2024-12-17 20:43:40

在某些平台(例如 ARM Cortex-M0)上,如果在奇数地址上使用 16 位加载/存储指令将会失败,如果在不是四的倍数的地址上使用 32 位指令将会失败。从/到可能是奇数的地址加载或存储 16 位对象将需要使用三个指令而不是一个;对于 32 位地址,需要 7 条指令。

在 clang 或 gcc 上,获取打包结构成员的地址将产生一个指针,该指针通常无法用于访问该成员。在更有用的 Keil 编译器上,获取 __packed 结构成员的地址将产生一个 __packed 限定指针,该指针只能存储在同样限定的指针对象中。通过此类指针进行的访问将使用支持未对齐访问所需的多指令序列。

On some platforms such as the ARM Cortex-M0, the 16-bit load/store instructions will fail if used on an odd address, and the 32-bit instructions will fail if used on addresses that are not multiples of four. Loading or storing a 16-bit object from/to an address which is might be odd will require using three instructions rather than one; for a 32-bit address, seven instructions would be required.

On clang or gcc, taking the address of a packed structure member will yield a pointer that will often be unusable for purposes of accessing that member. On the more useful Keil compiler, taking the address of a __packed structure member will yield a __packed qualified pointer which can only be stored in pointer objects that are qualified likewise. Accesses made via such pointers will use the multi-instruction sequence necessary to support unaligned accesses.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文