为什么数据结构对齐对性能很重要?

发布于 2024-08-17 04:57:02 字数 146 浏览 13 评论 0原文

有人能给我一个简短而合理的解释,解释为什么编译器向数据结构添加填充以对齐其成员吗?我知道这样做是为了CPU可以更有效地访问数据,但我不明白为什么会这样。

如果这仅与 CPU 相关,为什么在 Linux 中双 4 字节对齐,在 Windows 中双 8 字节对齐?

Can someone give me a short and plausible explanation for why the compiler adds padding to data structures in order to align its members? I know that it's done so that the CPU can access the data more efficiently, but I don't understand why this is so.

And if this is only CPU related, why is a double 4 byte aligned in Linux and 8 byte aligned in Windows?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

我的鱼塘能养鲲 2024-08-24 04:57:02

对齐有助于 CPU 以高效的方式从内存中获取数据:减少缓存未命中/刷新、减少总线事务等。

某些内存类型(例如 RDRAM、DRAM 等)需要以结构化方式访问(对齐“字”并按顺序排列)。 “突发事务”(即一次处理多个单词),以便产生有效的结果。这是由于许多因素造成的,其中包括:

  1. 设置时间:内存设备访问内存位置所需的时间
  2. 总线仲裁开销,即许多设备可能想要访问内存设备

“填充”用于纠正数据结构的对齐方式以优化传输效率。


换句话说,访问“未对齐”的结构将产生较低的整体性能。这种陷阱的一个很好的例子是:假设数据结构未对齐,并且需要 CPU/内存控制器执行 2 个总线事务(而不是 1 个)才能获取所述结构,因此性能会降低。

Alignment helps the CPU fetch data from memory in an efficient manner: less cache miss/flush, less bus transactions etc.

Some memory types (e.g. RDRAM, DRAM etc.) need to be accessed in a structured manner (aligned "words" and in "burst transactions" i.e. many words at one time) in order to yield efficient results. This is due to many things amongst which:

  1. setup time: time it takes for the memory devices to access the memory locations
  2. bus arbitration overhead i.e. many devices might want access to the memory device

"Padding" is used to correct the alignment of data structures in order to optimize transfer efficiency.


In other words, accessing a "mis-aligned" structure will yield lower overall performance. A good example of such pitfall: suppose a data structure is mis-aligned and requires the CPU/Memory Controller to perform 2 bus transactions (instead of 1) in order to fetch the said structure, the performance is thus consequently lower.

薯片软お妹 2024-08-24 04:57:02

CPU 以 4 个字节为一组从内存中获取数据(它实际上取决于硬件,对于某些类型的硬件,它是 8 或其他值,但为了简单起见,我们坚持使用 4),
如果数据开始于可被 4 整除的地址,则一切正常,CPU 会转到内存地址并加载数据。

现在假设数据从一个不能被 4 整除的地址开始,为了简单起见,CPU 必须从地址 0 获取数据,然后应用某种算法将字节转储到 0 地址,以获得对实际数据的访问权限。数据位于字节 1。这需要时间,因此会降低性能。因此,将所有数据地址对齐会更加有效。

the CPU fetches data from memory in groups of 4 bytes (it actualy depends on the hardware its 8 or other values for some types of hardware, but lets stick with 4 to keep it simple),
all is well if the data begins in an address which is dividable by 4, the CPU goes to the memory address and loads the data.

now suppose the data begins in an address not dividable by 4 say for the sake of simplicity at address 1, the CPU must take data from address 0 and then apply some algorithm to dump the byte at the 0 address , to gain access to the actual data at byte 1. this takes time and therefore lowers preformance. so it is much more efficient to have all data addresses aligned.

转身泪倾城 2024-08-24 04:57:02

缓存行是缓存的基本单位。通常为 16-64 字节或更多。

奔腾 IV:64 字节;奔腾Pro/II:32字节;奔腾I:32字节; 486:16 字节。

myrandomreader:
  ; ...
  ; ten instructions to generate next pseudo-random
  ; address in ESI from previous address
  ; ...
  MOV EAX, DS:[ESI]   ; X
  LOOP myrandomreader

对于跨越两个高速缓存行的内存读取:

(对于 L1 高速缓存未命中)处理器必须等待整个高速缓存行 1 从 L2->L1 读取到处理器中,然后才能请求第二个高速缓存行,导致短暂的执行停顿

(对于 L2 缓存未命中),处理器必须等待从 L3 缓存(如果存在)或主内存完成两次突发读取,而不是一次

处理器停顿

  • 随机对于 64 字节缓存行,4 字节读取将在约 5% 的时间跨过缓存行边界,对于 32 字节缓存行,约占 10%,对于 16 字节缓存行,约占 20%。

  • 即使数据位于缓存行内,某些未对齐数据的指令也可能会产生额外的执行开销。英特尔网站上针对某些 SSE 指令讨论了这一点。

  • 如果您自己定义结构,则考虑在 struct 中列出所有 <32 位数据字段可能是有意义的,以便减少填充开销,或者检查是否更好打开或关闭特定结构的打包。

  • 在 MIPS 和许多其他平台上,您没有选择权,必须对齐 - 如果不这样做,内核异常!

    在 MIPS 和许多其他平台上,

  • 如果您在总线上执行 I/O 或使用原子操作(例如原子递增/递减),或者如果您希望能够将代码移植到非英特尔平台,那么对齐对您来说也可能特别重要。

  • 在仅限英特尔(!)的代码上,常见的做法是为网络和磁盘定义一组打包结构,为内存定义另一组填充结构,并具有在这些格式之间转换数据的例程(也请考虑“字节序”) “对于磁盘和网络格式)。

A cache line is a basic unit of caching. Typically it is 16-64 bytes or more.

Pentium IV: 64 bytes; Pentium Pro/II: 32 bytes; Pentium I: 32 bytes; 486: 16 bytes.

myrandomreader:
  ; ...
  ; ten instructions to generate next pseudo-random
  ; address in ESI from previous address
  ; ...
  MOV EAX, DS:[ESI]   ; X
  LOOP myrandomreader

For memory read straddling two cachelines:

(for L1 cache miss) the processor must wait for the whole of cache line 1 to be read from L2->L1 into the processor before it can request the second cache line, causing a short execution stall

(for L2 cache miss) the processor must wait for two burst reads from L3 cache (if present) or main memory to complete rather than one

Processor stalls

  • A random 4 byte read will straddle a cacheline boundary about 5% of the time for 64 byte cachelines, 10% for 32 byte ones and 20% for 16 byte ones.

  • There may be additional execution overheads for some instructions on misaligned data even if it is within a cacheline. This is talked about on the Intel website for some SSE instructions.

  • If you are defining the structures yourself, it may make sense to look at listing all the <32bit data fields together in a struct so that padding overhead is reduced or alternatively review whether it is better to turn packing on or off for a particular structure.

  • On MIPS and many other platforms you don't get the choice and must align - kernel exception if you don't!!

  • Alignment may also matter extra specially to you if you are doing I/O on the bus or using atomic operations such as atomic increment/decrement or if you wish to be able to port your code to non-Intel.

  • On Intel only (!) code, a common practice is to define one set of packed structures for network and disk, and another padded set for in-memory and to have routines to convert data between these formats (also consider "endianness" for the disk and network formats).

白馒头 2024-08-24 04:57:02

除了 jldupont 的答案之外,某些架构还具有加载和存储指令(用于从内存读取/写入的指令),这些指令在字对齐边界上进行操作 - 因此,要从内存将需要两个加载指令,一个移位指令,然后一个掩码指令 - 效率低得多!

In addition to jldupont's answer, some architectures have load and store instructions (those used to read/write to and from memory) that only operate on word aligned boundaries - so, to load a non-aligned word from memory would take two load instructions, a shift instruction, and then a mask instruction - much less efficient!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文