为什么数据结构对齐对性能很重要？

发布于 2024-08-17 04:57:02 字数 146 浏览 23 评论 0原文

有人能给我一个简短而合理的解释，解释为什么编译器向数据结构添加填充以对齐其成员吗？我知道这样做是为了CPU可以更有效地访问数据，但我不明白为什么会这样。

如果这仅与 CPU 相关，为什么在 Linux 中双 4 字节对齐，在 Windows 中双 8 字节对齐？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我的鱼塘能养鲲 2024-08-24 04:57:02

对齐有助于 CPU 以高效的方式从内存中获取数据：减少缓存未命中/刷新、减少总线事务等。

某些内存类型（例如 RDRAM、DRAM 等）需要以结构化方式访问（对齐“字”并按顺序排列）。 “突发事务”（即一次处理多个单词），以便产生有效的结果。这是由于许多因素造成的，其中包括：

设置时间：内存设备访问内存位置所需的时间
总线仲裁开销，即许多设备可能想要访问内存设备

“填充”用于纠正数据结构的对齐方式以优化传输效率。

换句话说，访问“未对齐”的结构将产生较低的整体性能。这种陷阱的一个很好的例子是：假设数据结构未对齐，并且需要 CPU/内存控制器执行 2 个总线事务（而不是 1 个）才能获取所述结构，因此性能会降低。

回复收藏 0 原文

薯片软お妹 2024-08-24 04:57:02

CPU 以 4 个字节为一组从内存中获取数据（它实际上取决于硬件，对于某些类型的硬件，它是 8 或其他值，但为了简单起见，我们坚持使用 4），
如果数据开始于可被 4 整除的地址，则一切正常，CPU 会转到内存地址并加载数据。

现在假设数据从一个不能被 4 整除的地址开始，为了简单起见，CPU 必须从地址 0 获取数据，然后应用某种算法将字节转储到 0 地址，以获得对实际数据的访问权限。数据位于字节 1。这需要时间，因此会降低性能。因此，将所有数据地址对齐会更加有效。

回复收藏 0 原文

转身泪倾城 2024-08-24 04:57:02

缓存行是缓存的基本单位。通常为 16-64 字节或更多。

奔腾 IV：64 字节；奔腾Pro/II：32字节；奔腾I：32字节； 486：16 字节。

myrandomreader:
  ; ...
  ; ten instructions to generate next pseudo-random
  ; address in ESI from previous address
  ; ...
  MOV EAX, DS:[ESI]   ; X
  LOOP myrandomreader

对于跨越两个高速缓存行的内存读取：

（对于 L1 高速缓存未命中）处理器必须等待整个高速缓存行 1 从 L2->L1 读取到处理器中，然后才能请求第二个高速缓存行，导致短暂的执行停顿

（对于 L2 缓存未命中），处理器必须等待从 L3 缓存（如果存在）或主内存完成两次突发读取，而不是一次

处理器停顿

随机对于 64 字节缓存行，4 字节读取将在约 5% 的时间跨过缓存行边界，对于 32 字节缓存行，约占 10%，对于 16 字节缓存行，约占 20%。
即使数据位于缓存行内，某些未对齐数据的指令也可能会产生额外的执行开销。英特尔网站上针对某些 SSE 指令讨论了这一点。
如果您自己定义结构，则考虑在 struct 中列出所有 <32 位数据字段可能是有意义的，以便减少填充开销，或者检查是否更好打开或关闭特定结构的打包。
在 MIPS 和许多其他平台上，您没有选择权，必须对齐 - 如果不这样做，内核异常！
在 MIPS 和许多其他平台上，
如果您在总线上执行 I/O 或使用原子操作（例如原子递增/递减），或者如果您希望能够将代码移植到非英特尔平台，那么对齐对您来说也可能特别重要。
在仅限英特尔（！）的代码上，常见的做法是为网络和磁盘定义一组打包结构，为内存定义另一组填充结构，并具有在这些格式之间转换数据的例程（也请考虑“字节序”） “对于磁盘和网络格式）。

A cache line is a basic unit of caching. Typically it is 16-64 bytes or more.

Pentium IV: 64 bytes; Pentium Pro/II: 32 bytes; Pentium I: 32 bytes; 486: 16 bytes.

myrandomreader:
  ; ...
  ; ten instructions to generate next pseudo-random
  ; address in ESI from previous address
  ; ...
  MOV EAX, DS:[ESI]   ; X
  LOOP myrandomreader

For memory read straddling two cachelines:

(for L1 cache miss) the processor must wait for the whole of cache line 1 to be read from L2->L1 into the processor before it can request the second cache line, causing a short execution stall

(for L2 cache miss) the processor must wait for two burst reads from L3 cache (if present) or main memory to complete rather than one

Processor stalls

A random 4 byte read will straddle a cacheline boundary about 5% of the time for 64 byte cachelines, 10% for 32 byte ones and 20% for 16 byte ones.
There may be additional execution overheads for some instructions on misaligned data even if it is within a cacheline. This is talked about on the Intel website for some SSE instructions.
If you are defining the structures yourself, it may make sense to look at listing all the <32bit data fields together in a struct so that padding overhead is reduced or alternatively review whether it is better to turn packing on or off for a particular structure.
On MIPS and many other platforms you don't get the choice and must align - kernel exception if you don't!!
Alignment may also matter extra specially to you if you are doing I/O on the bus or using atomic operations such as atomic increment/decrement or if you wish to be able to port your code to non-Intel.
On Intel only (!) code, a common practice is to define one set of packed structures for network and disk, and another padded set for in-memory and to have routines to convert data between these formats (also consider "endianness" for the disk and network formats).

回复收藏 0 原文