C++ 中的无分支 if 是什么意思?看起来像?
我意识到我在该领域非常缺乏知识(这是一种奇怪的说法,我不了解杰克)。
是否有一些关于如何以及何时使用它们的文档?
I realize I've got a great lack of knowledge in that area (fancy way of saying I don't know jack about it).
Is there some documentation as to how and when to use them?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
除了所有基于旋转的无分支代码(不会涵盖所有内容,例如 FP)之外,您还会获得专门用于创建无分支代码的说明,这些指令将是
SETcc
, x86 下的FCMOVcc
和CMOVcc
,它们根据比较中的条件标志执行操作。一个非常简单的例子是(是的,这个例子太简单了,人们可能永远不会写这样的东西,它只是为了清楚地证明一点):
现在一个简单的x86编译器可以将其编译为:
一个优化的x86编译器将需要将其分解为以下无分支代码(或类似代码):
可以在此处找到更复杂的示例。
然而,这是编译器将执行的事情,而不是您自己应该担心的事情(至少在不分析编译器输出的情况下)。但是,如果要求代码无分支且不会失败,则 C++ 无法提供足够的控制来实现此目的,因此您需要使用(内联)汇编。
Apart from all the twiddling based branchless code (which won't cover everything, such as FP), you get instructions specifically meant for branchless code creation, these would be
SETcc
,FCMOVcc
andCMOVcc
under x86, which perform operations based on the condition flags from a comparison.a really simple example would be (yes, the example is so simple that one would probably never write something like this, its only to demonstrated a point clearly):
now a simple x86 compiler might compile that down to:
an optimizing x86 compiler would take it down into the following branchless code (or similar):
A little more complex example can be found here.
However, this is something that a compiler will perform, and not some you should worry about yourself (at least not without analyzing the output of your compiler). but if the code is required to be branchless without fail, C++ doesn't provide enough control to do so, so you need to use (inline) assembly.
我不久前写过三元逻辑模拟器,这个问题对我来说是可行的,因为它直接影响我的解释器执行速度;我需要尽快模拟大量的三元逻辑门。
在二进制编码三进制系统中,一个 trit 被打包在两位中。最高有效位表示负数,最低有效位表示正数。情况“11”不应发生,但必须正确处理并威胁为 0。
考虑内联 int bct_decoder( unsigned bctData ) 函数,该函数应将格式化的 trit 返回为常规整数 -1、0 或1;据我观察,有 4 种方法:我将它们称为“cond”、“mod”、“math”和“lut”;让我们研究一下它们
首先是基于 jz|jnz 和 jl|jb 条件跳转,因此 cond。它的性能一点也不好,因为依赖于分支预测器。更糟糕的是 - 它会变化,因为不知道是否会有一个或两个先验分支。这是一个例子:
这是最慢的版本,在最坏的情况下它可能涉及 2 个分支,这是二进制逻辑失败的地方。在我的 3770k 上,它对随机数据的平均速度约为 200MIPS。 (这里和之后 - 每个测试都是随机填充的 2mb 数据集上 1000 次尝试的平均值)
下一个依赖于模运算符,其速度介于第一和第三之间,但绝对更快 - 600 MIPS:
下一个是无分支方法,它只涉及数学,因此数学;它根本不假设跳转指令:
这做了应该做的事情,并且表现得非常好。相比之下,性能估计为 1000 MIPS,比分支版本快 5 倍。由于缺乏本机 2 位有符号 int 支持,分支版本可能会变慢。但在我的应用程序中,它本身就是一个很好的版本。
如果这还不够,那么我们可以更进一步,有一些特别的东西。接下来是查找表方法:
在我的例子中,一个 trit 仅占用 2 位,因此 lut 表只有 2b*4 = 8 字节,值得尝试。它适合缓存并以 1400-1600 MIPS 的速度快速运行,这是我的测量精度下降的地方。这就是快速数学方法的 1.5 倍加速。这是因为您只有预先计算的结果和单个
AND
指令。遗憾的是,缓存很小,并且(如果您的索引长度大于几位)您根本无法使用它。所以我想我回答了你的问题,关于分支/无分支代码是什么样的。答案要好得多,并且有详细的示例、真实世界的应用和真实的性能测量结果。
I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible.
In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0.
Consider
inline int bct_decoder( unsigned bctData )
function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate themFirst is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example:
This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset)
Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS:
Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all:
This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself.
If this is not enough then we can go futher, having something special. Next is called lookup table approach:
In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single
AND
instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it.So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.
http://hbfs.wordpress.com/2008/ 08/05/branchless-equivalents-of-simple-functions/
我认为(虽然我不知道比我在页面上读到的更多)这是一个在没有分支的情况下获得 if 功能的方法(基于“branchless if”一词是有意义的;))。不知道更多了。
感谢谷歌先生。
http://hbfs.wordpress.com/2008/08/05/branchless-equivalents-of-simple-functions/
I think (though I don't know more than what I read on the page) it is a way of getting if functionality without the branching (which makes sense based on the words branchless if ;)). Don't know more.
Thank Mr. Google.