为什么数据结构对齐对性能很重要?
有人能给我一个简短而合理的解释,解释为什么编译器向数据结构添加填充以对齐其成员吗?我知道这样做是为了CPU可以更有效地访问数据,但我不明白为什么会这样。
如果这仅与 CPU 相关,为什么在 Linux 中双 4 字节对齐,在 Windows 中双 8 字节对齐?
Can someone give me a short and plausible explanation for why the compiler adds padding to data structures in order to align its members? I know that it's done so that the CPU can access the data more efficiently, but I don't understand why this is so.
And if this is only CPU related, why is a double 4 byte aligned in Linux and 8 byte aligned in Windows?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对齐有助于 CPU 以高效的方式从内存中获取数据:减少缓存未命中/刷新、减少总线事务等。
某些内存类型(例如 RDRAM、DRAM 等)需要以结构化方式访问(对齐“字”并按顺序排列)。 “突发事务”(即一次处理多个单词),以便产生有效的结果。这是由于许多因素造成的,其中包括:
“填充”用于纠正数据结构的对齐方式以优化传输效率。
换句话说,访问“未对齐”的结构将产生较低的整体性能。这种陷阱的一个很好的例子是:假设数据结构未对齐,并且需要 CPU/内存控制器执行 2 个总线事务(而不是 1 个)才能获取所述结构,因此性能会降低。
Alignment helps the CPU fetch data from memory in an efficient manner: less cache miss/flush, less bus transactions etc.
Some memory types (e.g. RDRAM, DRAM etc.) need to be accessed in a structured manner (aligned "words" and in "burst transactions" i.e. many words at one time) in order to yield efficient results. This is due to many things amongst which:
"Padding" is used to correct the alignment of data structures in order to optimize transfer efficiency.
In other words, accessing a "mis-aligned" structure will yield lower overall performance. A good example of such pitfall: suppose a data structure is mis-aligned and requires the CPU/Memory Controller to perform 2 bus transactions (instead of 1) in order to fetch the said structure, the performance is thus consequently lower.
CPU 以 4 个字节为一组从内存中获取数据(它实际上取决于硬件,对于某些类型的硬件,它是 8 或其他值,但为了简单起见,我们坚持使用 4),
如果数据开始于可被 4 整除的地址,则一切正常,CPU 会转到内存地址并加载数据。
现在假设数据从一个不能被 4 整除的地址开始,为了简单起见,CPU 必须从地址 0 获取数据,然后应用某种算法将字节转储到 0 地址,以获得对实际数据的访问权限。数据位于字节 1。这需要时间,因此会降低性能。因此,将所有数据地址对齐会更加有效。
the CPU fetches data from memory in groups of 4 bytes (it actualy depends on the hardware its 8 or other values for some types of hardware, but lets stick with 4 to keep it simple),
all is well if the data begins in an address which is dividable by 4, the CPU goes to the memory address and loads the data.
now suppose the data begins in an address not dividable by 4 say for the sake of simplicity at address 1, the CPU must take data from address 0 and then apply some algorithm to dump the byte at the 0 address , to gain access to the actual data at byte 1. this takes time and therefore lowers preformance. so it is much more efficient to have all data addresses aligned.
缓存行是缓存的基本单位。通常为 16-64 字节或更多。
奔腾 IV:64 字节;奔腾Pro/II:32字节;奔腾I:32字节; 486:16 字节。
对于跨越两个高速缓存行的内存读取:
(对于 L1 高速缓存未命中)处理器必须等待整个高速缓存行 1 从 L2->L1 读取到处理器中,然后才能请求第二个高速缓存行,导致短暂的执行停顿
(对于 L2 缓存未命中),处理器必须等待从 L3 缓存(如果存在)或主内存完成两次突发读取,而不是一次
处理器停顿
随机对于 64 字节缓存行,4 字节读取将在约 5% 的时间跨过缓存行边界,对于 32 字节缓存行,约占 10%,对于 16 字节缓存行,约占 20%。
即使数据位于缓存行内,某些未对齐数据的指令也可能会产生额外的执行开销。英特尔网站上针对某些 SSE 指令讨论了这一点。
如果您自己定义结构,则考虑在
struct
中列出所有 <32 位数据字段可能是有意义的,以便减少填充开销,或者检查是否更好打开或关闭特定结构的打包。在 MIPS 和许多其他平台上,您没有选择权,必须对齐 - 如果不这样做,内核异常!
在 MIPS 和许多其他平台上,
如果您在总线上执行 I/O 或使用原子操作(例如原子递增/递减),或者如果您希望能够将代码移植到非英特尔平台,那么对齐对您来说也可能特别重要。
在仅限英特尔(!)的代码上,常见的做法是为网络和磁盘定义一组打包结构,为内存定义另一组填充结构,并具有在这些格式之间转换数据的例程(也请考虑“字节序”) “对于磁盘和网络格式)。
A cache line is a basic unit of caching. Typically it is 16-64 bytes or more.
Pentium IV: 64 bytes; Pentium Pro/II: 32 bytes; Pentium I: 32 bytes; 486: 16 bytes.
For memory read straddling two cachelines:
(for L1 cache miss) the processor must wait for the whole of cache line 1 to be read from L2->L1 into the processor before it can request the second cache line, causing a short execution stall
(for L2 cache miss) the processor must wait for two burst reads from L3 cache (if present) or main memory to complete rather than one
Processor stalls
A random 4 byte read will straddle a cacheline boundary about 5% of the time for 64 byte cachelines, 10% for 32 byte ones and 20% for 16 byte ones.
There may be additional execution overheads for some instructions on misaligned data even if it is within a cacheline. This is talked about on the Intel website for some SSE instructions.
If you are defining the structures yourself, it may make sense to look at listing all the <32bit data fields together in a
struct
so that padding overhead is reduced or alternatively review whether it is better to turn packing on or off for a particular structure.On MIPS and many other platforms you don't get the choice and must align - kernel exception if you don't!!
Alignment may also matter extra specially to you if you are doing I/O on the bus or using atomic operations such as atomic increment/decrement or if you wish to be able to port your code to non-Intel.
On Intel only (!) code, a common practice is to define one set of packed structures for network and disk, and another padded set for in-memory and to have routines to convert data between these formats (also consider "endianness" for the disk and network formats).
除了 jldupont 的答案之外,某些架构还具有加载和存储指令(用于从内存读取/写入的指令),这些指令仅在字对齐边界上进行操作 - 因此,要从内存将需要两个加载指令,一个移位指令,然后一个掩码指令 - 效率低得多!
In addition to jldupont's answer, some architectures have load and store instructions (those used to read/write to and from memory) that only operate on word aligned boundaries - so, to load a non-aligned word from memory would take two load instructions, a shift instruction, and then a mask instruction - much less efficient!