优化 C++ 中的成员变量顺序
我正在阅读游戏的博客文章 Introversion 的编码员,他正忙着试图挤压每一个CPU 勾选他可以退出代码。 他随手提到的一个技巧是
“重新排序a的成员变量 分为最常用和最不使用的类别。”
我不熟悉 C++,也不熟悉它的编译方式,但我想知道
- 这个说法是否准确?
- 如何/为什么?
- 它适用于其他(编译/脚本)语言吗
我我知道这个技巧节省的(CPU)时间是最少的,这并不是一个破坏性的事情,但另一方面,在大多数函数中,很容易识别哪些变量将是最多的。常用的,默认就以这种方式开始编码。
I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to
"re-order the member variables of a
class into most used and least used."
I'm not familiar with C++, nor with how it compiles, but I was wondering if
- This statement is accurate?
- How/Why?
- Does it apply to other (compiled/scripting) languages?
I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
这里有两个问题:
它可能有帮助的原因是,内存以称为“缓存行”的块的形式加载到 CPU 缓存中。 这需要时间,一般来说,为对象加载的缓存行越多,所需的时间就越长。 此外,更多的其他内容会被从缓存中抛出以腾出空间,这会以不可预测的方式减慢其他代码的速度。
高速缓存行的大小取决于处理器。 如果它与对象的大小相比很大,那么很少有对象会跨越缓存行边界,因此整个优化是相当无关紧要的。 否则,有时您可能会只将对象的一部分放在缓存中,而将其余对象放在主内存(或者可能是 L2 缓存)中。 如果您最常见的操作(访问常用字段的操作)对对象使用尽可能少的缓存,那是一件好事,因此将这些字段分组在一起可以让您更有可能发生这种情况。
一般原则称为“引用局部性”。 程序访问的不同内存地址越接近,获得良好缓存行为的机会就越大。 通常很难提前预测性能:同一架构的不同处理器型号可能会有不同的行为,多线程意味着您通常不知道缓存中会包含什么,等等。但是可以谈论什么是大多数情况下可能会发生。 如果您想了解任何事情,通常必须对其进行测量。
请注意,这里有一些问题。 如果您使用基于 CPU 的原子操作(C++0x 中的原子类型通常会这样做),那么您可能会发现 CPU 锁定整个缓存行以锁定字段。 然后,如果您有多个紧密相连的原子字段,不同的线程在不同的核心上运行并同时对不同的字段进行操作,您会发现所有这些原子操作都是序列化的,因为它们都锁定相同的内存位置,即使它们“在不同的领域进行再经营。 如果它们在不同的缓存行上运行,那么它们就会并行工作,并且运行得更快。 事实上,正如 Glen(通过 Herb Sutter)在他的回答中指出的那样,在一致缓存架构上,即使没有原子操作也会发生这种情况,并且可能会彻底毁掉你的一天。 因此,在涉及多个核心的情况下,引用局部性不一定是一件好事,即使它们共享缓存。 您可以预期它是这样的,因为缓存未命中通常是速度损失的一个原因,但在您的特定情况下是严重错误的。
现在,除了区分常用字段和不常用字段之外,对象越小,它占用的内存(因此缓存也就越少)就越少。 这对于各方来说都是个好消息,至少在没有激烈争用的情况下是如此。 对象的大小取决于其中的字段,以及必须在字段之间插入的任何填充,以确保它们针对架构正确对齐。 C++(有时)根据字段声明的顺序对字段在对象中必须出现的顺序施加限制。 这是为了使低级编程更容易。 因此,如果您的对象包含:
那么这很可能会占用内存中的 16 个字节。 顺便说一句,int 的大小和对齐方式在每个平台上并不相同,但 4 很常见,这只是一个示例。
在这种情况下,编译器将在第二个 int 之前插入 3 个字节的填充,以正确对齐它,并在末尾插入 3 个字节的填充。 对象的大小必须是其对齐方式的倍数,以便相同类型的对象可以在内存中相邻放置。 这就是 C/C++ 中的数组,内存中的相邻对象。 如果结构体是 int、int、char、char,那么同一个对象可能是 12 个字节,因为 char 没有对齐要求。
我说过 int 是否 4 对齐取决于平台:在 ARM 上绝对必须如此,因为未对齐的访问会引发硬件异常。 在 x86 上,您可以访问未对齐的整数,但它通常较慢并且 IIRC 非原子。 所以编译器通常(总是?)在 x86 上对整数进行 4 对齐。
如果您关心打包,那么编写代码时的经验法则是查看结构体每个成员的对齐要求。 然后对具有最大对齐类型的字段进行排序,然后是次小的,依此类推,直到没有对齐要求的成员。 例如,如果我正在尝试编写可移植代码,我可能会想到:
如果您不知道字段的对齐方式,或者您正在编写可移植代码,但希望在没有重大欺骗的情况下尽最大努力,那么您假设对齐要求是结构中任何基本类型的最大要求,并且基本类型的对齐要求是它们的大小。 因此,如果您的结构体包含 uint64_t 或 long long,那么最好的猜测是它是 8 对齐的。 有时你会错,但很多时候你会是对的。
请注意,像您的博主这样的游戏程序员通常了解有关其处理器和硬件的所有信息,因此他们不必猜测。 他们知道缓存行大小,知道每种类型的大小和对齐方式,并且知道编译器使用的结构布局规则(对于 POD 和非 POD 类型)。 如果他们支持多个平台,那么他们可以根据需要对每个平台进行特殊处理。 他们还花费大量时间思考游戏中的哪些对象将受益于性能改进,并使用分析器找出真正的瓶颈所在。 但即便如此,无论对象是否需要,制定一些经验法则也不是一个坏主意。 只要不让代码变得不清楚,“将常用字段放在对象的开头”和“按对齐要求排序”是两个很好的规则。
Two issues here:
The reason that it might help, is that memory is loaded into the CPU cache in chunks called "cache lines". This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.
The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It's a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.
The general principle is called "locality of reference". The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It's often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don't know what's going to be in the cache, etc. But it's possible to talk about what's likely to happen, most of the time. If you want to know anything, you generally have to measure it.
Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they're operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.
Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don't have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:
then chances are this will occupy 16 bytes in memory. The size and alignment of int isn't the same on every platform, by the way, but 4 is very common and this is just an example.
In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object's size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That's all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.
I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it's generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.
The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I'm trying to write portable code I might come up with this:
If you don't know the alignment of a field, or you're writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it's 8-aligned. Sometimes you'll be wrong, but you'll be right a lot of the time.
Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don't have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it's not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won't make the code unclear, "put commonly-used fields at the start of the object" and "sort by alignment requirement" are two good rules.
根据您运行的程序的类型,此建议可能会提高性能,也可能会大大减慢速度。
在多线程程序中执行此操作意味着您将增加“错误共享”的机会。
查看 Herb Sutters 关于该主题的文章 这里
我以前已经说过了,而且我会一直这样说。 获得真正性能提升的唯一真正方法是测量代码,并使用工具来识别真正的瓶颈,而不是任意更改代码库中的内容。
Depending on the type of program you're running this advice may result in increased performance or it may slow things down drastically.
Doing this in a multi-threaded program means you're going to increase the chances of 'false-sharing'.
Check out Herb Sutters articles on the subject here
I've said it before and I'll keep saying it. The only real way to get a real performance increase is to measure your code, and use tools to identify the real bottle neck instead of arbitrarily changing stuff in your code base.
这是优化工作集大小的方法之一。 John Robbins 有一篇很好的 文章,介绍了如何加快速度通过优化工作集大小来提高应用程序性能。 当然,它涉及仔细选择最终用户可能使用应用程序执行的最常见用例。
It is one of the ways of optimizing the working set size. There is a good article by John Robbins on how you can speed up the application performance by optimizing the working set size. Of course it involves careful selection of most frequent use cases the end user is likely to perform with the application.
我们在这里为成员提供的指导方针略有不同(ARM 架构目标,由于各种原因,主要是 THUMB 16 位代码生成):
“按对齐分组” " 有点明显,超出了这个问题的范围; 它避免了填充,使用更少的内存等。
不过,第二个项目符号源自 THUMB LDRB(加载寄存器字节)、LDRH(加载寄存器半字)和 LDR(加载寄存器)上的小 5 位“立即”字段大小) 指示。
5 位意味着可以编码 0-31 的偏移量。 实际上,假设“this”在寄存器中很方便(通常是这样):
如果它们超出了这个范围,则必须生成多个指令:要么是带有立即数的 ADD 序列,以在寄存器中累积适当的地址,要么更糟糕的是,从函数末尾的文字池中加载。
如果我们确实访问了文字池,那就很糟糕了:文字池会通过 d-cache,而不是 i-cache; 这意味着第一次文字池访问时至少需要从主内存加载缓存行,如果文字池不在自己的缓存上启动,则数据缓存和 i-cache 之间会出现一系列潜在的逐出和失效问题行(即,如果实际代码没有在缓存行末尾结束)。
(如果我对我们正在使用的编译器有一些愿望,那么强制文字池在缓存行边界上启动的方法就是其中之一。)
(无关地,我们为避免文字池使用所做的事情之一是保留我们的所有“全局变量”都在一个表中。这意味着对“GlobalTable”进行一次文字池查找,而不是对每个全局变量进行多次查找。如果您真的很聪明,您也许可以将 GlobalTable 保存在某种内存中。无需加载文字池条目即可访问 - 是 .sbss 吗?)
We have slightly different guidelines for members here (ARM architecture target, mostly THUMB 16-bit codegen for various reasons):
"group by alignment" is somewhat obvious, and outside the scope of this question; it avoids padding, uses less memory, etc.
The second bullet, though, derives from the small 5-bit "immediate" field size on the THUMB LDRB (Load Register Byte), LDRH (Load Register Halfword), and LDR (Load Register) instructions.
5 bits means offsets of 0-31 can be encoded. Effectively, assuming "this" is handy in a register (which it usually is):
If they're outside this range, multiple instructions have to be generated: either a sequence of ADDs with immediates to accumulate the appropriate address in a register, or worse yet, a load from the literal pool at the end of the function.
If we do hit the literal pool, it hurts: the literal pool goes through the d-cache, not the i-cache; this means at least a cacheline worth of loads from main memory for the first literal pool access, and then a host of potential eviction and invalidation issues between the d-cache and i-cache if the literal pool doesn't start on its own cache line (i.e. if the actual code doesn't end at the end of a cache line).
(If I had a few wishes for the compiler we're working with, a way to force literal pools to start on cacheline boundaries would be one of them.)
(Unrelatedly, one of the things we do to avoid literal pool usage is keep all of our "globals" in a single table. This means one literal pool lookup for the "GlobalTable", rather than multiple lookups for each global. If you're really clever you might be able to keep your GlobalTable in some sort of memory that can be accessed without loading a literal pool entry -- was it .sbss?)
虽然改善数据访问的缓存行为的引用局部性通常是一个相关的考虑因素,但在需要优化时控制布局还有其他几个原因 - 特别是在嵌入式系统中,即使许多嵌入式系统上使用的 CPU 甚至没有一个缓存。
- 结构中字段的内存对齐
许多程序员都很好地理解对齐注意事项,因此我不会在这里讨论太多细节。
在大多数 CPU 架构上,必须以本机对齐方式访问结构中的字段以提高效率。 这意味着,如果混合使用各种大小的字段,编译器必须在字段之间添加填充,以保持对齐要求正确。 因此,为了优化结构使用的内存,重要的是要记住这一点并布局字段,使最大的字段后面跟着较小的字段,以将所需的填充保持在最低限度。 如果要“打包”结构以防止填充,则访问未对齐字段的运行时成本很高,因为编译器必须使用对字段较小部分的一系列访问以及移位和掩码来组装字段来访问未对齐字段寄存器中的值。
- 结构中常用字段的偏移
对于许多嵌入式系统来说另一个重要的考虑因素是在结构的开头包含经常访问的字段。
某些体系结构的指令中可用于对指针访问的偏移量进行编码的位数有限,因此,如果访问偏移量超过该位数的字段,则编译器将必须使用多个指令来形成指向该字段的指针。 例如,ARM 的 Thumb 架构有 5 位来编码偏移量,因此只有当该字段距离开头的 124 字节以内时,它才能在单个指令中访问字大小的字段。 因此,如果您有一个大型结构,嵌入式工程师可能要记住的优化是将常用字段放置在结构布局的开头。
While locality of reference to improve the cache behavior of data accesses is often a relevant consideration, there are a couple other reasons for controlling layout when optimization is required - particularly in embedded systems, even though the CPUs used on many embedded systems do not even have a cache.
- Memory alignment of the fields in structures
Alignment considerations are pretty well understood by many programmers, so I won't go into too much detail here.
On most CPU architectures, fields in a structure must be accessed at a native alignment for efficiency. This means that if you mix various sized fields the compiler has to add padding between the fields to keep the alignment requirements correct. So to optimize the memory used by a structure it's important to keep this in mind and lay out the fields such that the largest fields are followed by smaller fields to keep the required padding to a minimum. If a structure is to be 'packed' to prevent padding, accessing unaligned fields comes at a high runtime cost as the compiler has to access unaligned fields using a series of accesses to smaller parts of the field along with shifts and masks to assemble the field value in a register.
- Offset of frequently used fields in a structure
Another consideration that can be important on many embedded systems is to have frequently accessed fields at the start of a structure.
Some architectures have a limited number of bits available in an instruction to encode an offset to a pointer access, so if you access a field whose offset exceeds that number of bits the compiler will have to use multiple instructions to form a pointer to the field. For example, the ARM's Thumb architecture has 5 bits to encode an offset, so it can access a word-sized field in a single instruction only if the field is within 124 bytes from the start. So if you have a large structure an optimization that an embedded engineer might want to keep in mind is to place frequently used fields at the beginning of a structure's layout.
那么第一个成员不需要添加到指针的偏移量来访问它。
Well the first member doesn't need an offset added to the pointer to access it.
在 C# 中,成员的顺序由编译器确定,除非您放置属性 [LayoutKind.Sequential/Explicit],该属性强制编译器按照您指定的方式布局结构/类。
据我所知,编译器似乎最大限度地减少了打包,同时按照自然顺序对齐数据类型(即 4 字节 int 从 4 字节地址开始)。
In C#, the order of the member is determined by the compiler unless you put the attribute [LayoutKind.Sequential/Explicit] which forces the compiler to lay out the structure/class the way you tell it to.
As far as I can tell, the compiler seems to minimize packing while aligning the data types on their natural order (i.e. 4 bytes int start on 4 byte addresses).
我关注的是性能、执行速度,而不是内存使用情况。
编译器在没有任何优化开关的情况下,将使用代码中声明的相同顺序来映射变量存储区域。
想象一下
大混乱吗? 没有对齐开关,低内存操作。 等人,我们将在 DDR3 dimm 上使用一个 64 位字的无符号字符,另一个使用另一个 64 位字,但这是长期不可避免的。
所以,这是每个变量的获取。
但是,打包或重新排序将导致一次提取和一次 AND 掩码能够使用无符号字符。
因此,就速度而言,在当前的 64 位字存储机器上,对齐、重新排序等都是禁忌。 我做微控制器的工作,打包/非打包的差异非常明显(谈论<10MIPS处理器,8位字存储器)
另一方面,众所周知,调整代码的性能所需的工程工作量除了一个好的算法会指导你做什么,以及编译器能够优化什么,通常会导致燃烧橡胶而没有任何实际效果。 还有一段只写的语法可疑代码。
我看到的优化的最后一步(在 uP 中,不认为这对于 PC 应用程序是可行的)是将程序编译为单个模块,让编译器对其进行优化(速度/指针分辨率/内存的更一般视图)打包等),并让链接器回收非调用的库函数、方法等。
I'm focusing on performance, execution speed, not memory usage.
The compiler, without any optimizing switch, will map the variable storage area using the same order of declarations in code.
Imagine
Big mess-up? without align switches, low-memory ops. et al, we're going to have an unsigned char using a 64bits word on your DDR3 dimm, and another 64bits word for the other, and yet the unavoidable one for the long.
So, that's a fetch per each variable.
However, packing it, or re-ordering it, will cause one fetch and one AND masking to be able to use the unsigned chars.
So speed-wise, on a current 64bits word-memory machine, aligns, reorderings, etc, are no-nos. I do microcontroller stuff, and there the differences in packed/non-packed are reallllly noticeable (talking about <10MIPS processors, 8bit word-memories)
On the side, it's long known that the engineering effort required to tweak code for performance other than what a good algorithm instructs you to do, and what the compiler is able to optimize, often results in burning rubber with no real effects. That and a write-only piece of syntaxically dubius code.
The last step-forward in optimization I saw (in uPs, don't think it's doable for PC apps) is to compile your program as a single module, have the compiler optimize it (much more general view of speed/pointer resolution/memory packing, etc), and have the linker trash non-called library functions, methods, etc.
理论上,如果您有大对象,它可以减少缓存未命中。 但通常最好将相同大小的成员分组在一起,这样您就有更紧密的内存包装。
In theory, it could reduce cache misses if you have big objects. But it's usually better to group members of the same size together so you have tighter memory packing.
我非常怀疑这会对 CPU 改进产生任何影响 - 也许是可读性。 如果给定帧内执行的常用基本块位于同一组页面中,则可以优化可执行代码。 这是相同的想法,但不知道如何在代码中创建基本块。 我的猜测是编译器按照它看到的顺序放置函数,而没有进行优化,因此您可以尝试将常见功能放在一起。
尝试运行分析器/优化器。 首先,使用一些分析选项进行编译,然后运行程序。 一旦分析的 exe 完成,它将转储一些分析信息。 获取此转储并将其作为输入通过优化器运行。
我已经离开这一行很多年了,但他们的工作方式并没有太大改变。
I highly doubt that would have any bearing in CPU improvements - maybe readability. You can optimize the executable code if the commonly executed basic blocks that are executed within a given frame are in the same set of pages. This is the same idea but would not know how create basic blocks within the code. My guess is the compiler puts the functions in the order it sees them with no optimization here so you could try and place common functionality together.
Try and run a profiler/optimizer. First you compile with some profiling option then run your program. Once the profiled exe is complete it will dump some profiled information. Take this dump and run it through the optimizer as input.
I have been away from this line of work for years but not much has changed how they work.