位字段:设置与测试并设置(为了性能)
我有大量这样的 C 结构实例:
struct mystruct
{
/* ... */
unsigned flag: 1;
/* ... */
};
flag
最初为 0,但在从某个函数退出时必须为 1。
最简单的实现是:
void set_flag(struct mystruct *sp)
{
sp->flag = 1U;
}
但是这样做对性能可能产生的影响是什么:
void set_flag(struct mystruct *sp)
{
if (sp->flag == 0U)
{
sp->flag = 1U;
}
}
我希望避免写入主内存。 第一个版本始终执行写入操作,第二个版本仅在尚未设置标志的情况下执行写入操作,但在绝大多数情况下,标志已被设置。
还有哪些其他因素(例如分支预测)可能会影响性能?
到目前为止,我已经看到了速度的小幅提升,我希望随着数据集变大,速度会变得更加显着。
对于大型数据集,此更改是否存在使程序变慢的风险?如果是,在什么情况下可能会发生这种情况?
I have a large number of instances of a C structure like this:
struct mystruct
{
/* ... */
unsigned flag: 1;
/* ... */
};
flag
is initially 0 but must be 1 on exit from a certain function.
The simplest implementation is:
void set_flag(struct mystruct *sp)
{
sp->flag = 1U;
}
But what is the likely effect on performance of doing this instead:
void set_flag(struct mystruct *sp)
{
if (sp->flag == 0U)
{
sp->flag = 1U;
}
}
I am hoping to avoid a write to main memory. The first version always does the write, the second version only performs the write if the flag was not already set, but in the vast majority of cases, the flag will already be set.
What other factors (e.g. branch prediction) are likely to affect performance?
I have seen a small speed increase so far, which I hope will become more significant as the data set becomes larger.
Is there a risk of this change making the program slower for large data sets, and if so in what circumstances might this happen?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
设置前的测试确实会产生影响,但影响程度取决于您的用例。
在任何一种情况下(例如,仅写入或测试并设置),数据最终都会出现在高速缓存行中。
但是,如果您的缓存行被标记为脏(例如已修改)或干净,则存在差异。 脏缓存行必须写回主内存,而干净的缓存行可能会被遗忘并用新数据填充。
现在考虑一下您的代码破坏了大量数据,并且您仅访问每个数据块一次或两次。 如果是这样,则可以假设大多数内存访问都是高速缓存未命中。 如果在发生缓存未命中时大多数缓存行都是脏的并且大多数缓存行都是脏的,会发生什么情况?
在将新数据加载到该行之前,必须将它们写回主存储器。 这比仅仅忘记缓存行的内容要慢。 它还将使高速缓存和主内存之间的内存带宽加倍。
这对于一个 CPU 核心来说可能不会有什么影响,因为现在内存很快,但另一个 CPU 也可以(希望)做一些其他工作。 您可以确定,如果总线不忙于将高速缓存行移入和移出,则另一个 CPU 核心将更快地执行所有操作。
简而言之:保持缓存行清洁将使带宽需求减半,并使缓存未命中的成本降低一点。
关于分支:当然:成本很高,但缓存未命中更糟糕! 另外,如果幸运的话,CPU 将使用其乱序执行功能来抵消缓存未命中和分支成本。
如果您确实希望从该代码中获得最佳性能,并且大多数访问都是缓存未命中,那么您有两个选择:
绕过缓存:x86 架构为此目的提供非临时加载和存储。 它们隐藏在 SSE 指令集中的某个位置,并且可以通过内在函数从 C 语言中使用。
(仅适用于专家):使用一些内联汇编程序行,将测试和设置功能替换为使用 CMOV(条件移动)指令的汇编程序。 这不仅可以保持缓存行干净,而且可以避免分支。 现在 CMOV 是一条慢指令,只有在无法预测分支的情况下才会优于分支。 所以你可以更好地对你的代码进行基准测试。
The test before set does make a difference, but how much it is depends on your use-cases.
The data will end up in a cache-line in either case (e.g. just writing or test-and-set).
However, there is a difference if your cache line is tagged as dirty (e.g. modified) or clean. Dirty cache-lines have to be written back to main memory while clean cache-lines can just be forgotten and filled with new data.
Now consider that your code mangles huge amounts of data, and you access each chunk of data only once or twice. If so can assume that most of the memory accesses are cache misses. What happens if the majority of your cache-lines are dirty at the point where a cache miss occurs and the majority of cache lines are dirty?
They have to be written back to the main memory before new data will be loaded into the line. This is slower than just forgetting the content of a cache-line. Also it will double the memory bandwidth between the cache and the main memory.
That may not make a difference for once CPU core since memory is fast these days, but another CPU will (hopefully) do some other work as well. You can be sure that the other CPU core will execute everything a tad faster if the buss is not busy moving cache-lines in and out.
In short: keeping your cache-lines clean will half that bandwidth requirement and makes cache-misses a tad cheaper.
Regarding the branch: Sure: It's costly, but a cache-miss is much worse! Also if you're lucky the CPU will use it's out of order execution features to offset cache misses with the costs of the branch.
If you really want to get the best possible performance out of this code, and if most of your accesses are cache-misses you have two options:
Bypass the cache: The x86 architecture has non temporal loads and stores for this purpose. They're hidden somewhere in the SSE instruction sets and can be used from the c-language via intrinsics.
(Only for experts): Use some lines of inline-assembler that replaces the test-and-set function with assembler that uses the CMOV (conditional move) instruction. This will not only keep your cache-lines clean but avoid the branch. Now CMOV is a slow instruction and will only outperform a branch if the branches cannot be predicted. So you'll better benchmark your code.
这是一个有趣的问题,尼尔斯关于缓存线的回答绝对是很好的建议。
我想强调分析代码以衡量实际性能的重要性 - 您能否衡量在您遇到的数据中设置该标志的频率? 根据答案,性能可能会发生很大变化。
只是为了好玩,我使用您的代码在一个包含不同比例 1 的 5000 万元素数组上运行了 set 与 test-then-set 的一些比较。 这是一个图表:
(来源:natekohl.net)
这只是一个当然,还有玩具的例子。 但请注意非线性性能——这是我没有预料到的——并且当数组几乎完全充满 1 时,测试然后设置变得比普通设置更快。
This is an interesting question, and Nils' answer about cache lines is definitely great advice.
I'd like to emphasize the importance of profiling code to measure real performance -- can you measure how frequently that flag will already be set in the data that you encounter? Performance could change a lot depending on the answer.
Just for fun, I used your code to run a little comparison of set versus test-then-set on a 50-million element array filled with various proportions of 1's. Here's a graph:
(source: natekohl.net)
This is just a toy example, of course. But note the non-linear performance -- which I wasn't expecting -- and that test-then-set becomes faster than plain set when the array is almost entirely filled with 1's.
这些是我对您的要求的解释,
假设
我建议如下。
These are my interpretations of your requirement,
Assuming that,
I suggest the following things.
当移动到更大的数据集时,这种优化可能不会导致速度下降。
读取值时的缓存抖动将相同,分支预测惩罚也将相同,这些是此处优化的关键因素。
分支预测存储每个分支指令的历史记录,因此它并不关心您有多少个实例,只要您使用不同地址的指令(例如内联函数)在它们上进行分支即可。 如果您有一个功能实体(未内联),那么您将拥有一个用于所有功能实体的分支指令,这将抑制分支预测,使其更频繁地错过并增加惩罚。
This optimization will likely not cause speed decrease when moving to a larger dataset.
Cache thrashing when reading values will be the same, branch prediction penalties will also be the same and these are the key factors to optimize for here.
Branch prediction stores history per branch instruction so it doesn't care how many instances you have as long as you branch on them with instructions at different addresses (inlined function for example). If you have a single function entity (not inlined) you will have one branch instruction for all and this will suppress branch prediction making it miss more often and increasing the penalties.
你总是可以分析,但我很确定第一个版本既更快又不那么晦涩。
You could always profile, but I am pretty sure the first version is both faster and less obscure.
这两种方法都需要将数据加载到缓存中,因此您唯一节省的就是读/写和写之间的差异。
我不明白这种变化如何使你的代码在处理更大的数据集时变慢,所以你在这方面可能足够安全。
对我来说,这有点像过早的乐观。 (除非您的分析已将其识别为瓶颈)
与所有与性能相关的事物一样,确定代码更改效果的最佳方法是对其进行测量。 您应该能够相对轻松地创建大量测试数据。
Either approach will require that the data is loaded into the cache, so your only saving will be a difference between a read/write and a write.
I don't see how this change could make your code slower with larger data sets, so you're likely safe enough on that front.
It smells a little like a premature-optimistaion to me. (Unless your profiling has identified this as a bottleneck)
As with all things performance related the best way to be sure of the effect of a code change is to measure it. You should be able to create a large amount of test data relatively easily.
如果您确实担心时间性能,请将标志更改为完整 int 而不是位字段。 然后设置它只是一个写入而不是像位域那样的读写。
但正如已经指出的,这有点微优化的味道。
If you're really worried about the time performance, change the flag to a full int instead of a bitfield. Then setting it is just a write and not a read-write as with bitfields.
But as already pointed out, this smells of micro-optimization.
设置之前的测试没有任何意义 - 没有测试的代码更干净,也更快一些。
作为旁注 - 像这样的内联函数是有意义的,因为函数调用的开销比函数体更大,尽管优化编译器应该不假思索地这样做。
The test before set doesn't make any sense - code without test is cleaner and also a bit faster.
As a side note - it makes sense to inline functions like this because overhead on function call is bigger then function body, although optimising compiler should do it without second thought.
既然没人这么说,我就这么说。
你为什么要使用位字段? 不同编译器的布局会有所不同,因此它们对于接口来说是无用的。 它们可能会也可能不会更节省空间; 编译器可能只是决定将它们推入 32 位字段,以便有效地填充内容。 不能保证它们会更快,事实上它们可能会更慢。
我已禁止在工作中使用它们。 除非有人能给我一个令人信服的理由,证明他们提供了任何额外的功能,否则不值得和他们一起玩。
Since nobody else said it, I will.
Why are you using a bit-field at all? The layout will differ from compiler to compiler, so they are useless for interfaces. They may, or may not, be more space efficient; the compiler might just decide to shove them into a 32-bit field so as to pad thing efficiently. There are no guarantees they are faster, and in fact they are likely to be slower.
I've banned their usage at work. Unless someone can give me a convincing reason that they provide any extra capability, it's not worth playing with them.