当前位置：文江博客话题详情

C++ 中的无分支 if 是什么意思？看起来像？

发布于 2024-11-09 22:20:38 字数 71 浏览 1 评论 0原文

我意识到我在该领域非常缺乏知识（这是一种奇怪的说法，我不了解杰克）。

是否有一些关于如何以及何时使用它们的文档？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

乖乖 2024-11-16 22:20:38

除了所有基于旋转的无分支代码（不会涵盖所有内容，例如 FP）之外，您还会获得专门用于创建无分支代码的说明，这些指令将是 SETcc, x86 下的FCMOVcc 和 CMOVcc，它们根据比较中的条件标志执行操作。

一个非常简单的例子是（是的，这个例子太简单了，人们可能永远不会写这样的东西，它只是为了清楚地证明一点）：

bool CheckZero(int x)
{
    if(x == 0)
       return true;

    return false;
    //OR we can use: return (x == 0) ? true : false; 
}

现在一个简单的x86编译器可以将其编译为：

    MOV EAX,[ESP + 4]
    TEXT EAX,EAX
    JE _TRUE
    XOR EAX,EAX
    RETN 4

_TRUE:
    MOV EAX,1
    RETN 4

一个优化的x86编译器将需要将其分解为以下无分支代码（或类似代码）：

MOV ECX,[ESP + 4]
XOR EAX,EAX
TEST ECX,ECX
SETE AL
RETN 4

可以在此处找到更复杂的示例。

然而，这是编译器将执行的事情，而不是您自己应该担心的事情（至少在不分析编译器输出的情况下）。但是，如果要求代码无分支且不会失败，则 C++ 无法提供足够的控制来实现此目的，因此您需要使用（内联）汇编。

Apart from all the twiddling based branchless code (which won't cover everything, such as FP), you get instructions specifically meant for branchless code creation, these would be SETcc, FCMOVcc and CMOVcc under x86, which perform operations based on the condition flags from a comparison.

a really simple example would be (yes, the example is so simple that one would probably never write something like this, its only to demonstrated a point clearly):

bool CheckZero(int x)
{
    if(x == 0)
       return true;

    return false;
    //OR we can use: return (x == 0) ? true : false; 
}

now a simple x86 compiler might compile that down to:

    MOV EAX,[ESP + 4]
    TEXT EAX,EAX
    JE _TRUE
    XOR EAX,EAX
    RETN 4

_TRUE:
    MOV EAX,1
    RETN 4

an optimizing x86 compiler would take it down into the following branchless code (or similar):

MOV ECX,[ESP + 4]
XOR EAX,EAX
TEST ECX,ECX
SETE AL
RETN 4

A little more complex example can be found here.

However, this is something that a compiler will perform, and not some you should worry about yourself (at least not without analyzing the output of your compiler). but if the code is required to be branchless without fail, C++ doesn't provide enough control to do so, so you need to use (inline) assembly.

回复收藏 0 原文

忆沫 2024-11-16 22:20:38

我不久前写过三元逻辑模拟器，这个问题对我来说是可行的，因为它直接影响我的解释器执行速度；我需要尽快模拟大量的三元逻辑门。

在二进制编码三进制系统中，一个 trit 被打包在两位中。最高有效位表示负数，最低有效位表示正数。情况“11”不应发生，但必须正确处理并威胁为 0。

考虑内联 int bct_decoder( unsigned bctData ) 函数，该函数应将格式化的 trit 返回为常规整数 -1、0 或1;据我观察，有 4 种方法：我将它们称为“cond”、“mod”、“math”和“lut”；让我们研究一下它们

首先是基于 jz|jnz 和 jl|jb 条件跳转，因此 cond。它的性能一点也不好，因为依赖于分支预测器。更糟糕的是 - 它会变化，因为不知道是否会有一个或两个先验分支。这是一个例子：

inline int bct_decoder_cond( unsigned bctData ) {
   unsigned lsB = bctData & 1;
   unsigned msB = bctData >> 1;
   return
       ( lsB == msB ) ? 0 : // most possible -> make zero fastest branch
       ( lsB > msB ) ? 1 : -1;
}

这是最慢的版本，在最坏的情况下它可能涉及 2 个分支，这是二进制逻辑失败的地方。在我的 3770k 上，它对随机数据的平均速度约为 200MIPS。（这里和之后 - 每个测试都是随机填充的 2mb 数据集上 1000 次尝试的平均值）

下一个依赖于模运算符，其速度介于第一和第三之间，但绝对更快 - 600 MIPS：

inline int bct_decoder_mod( unsigned bctData ) {
    return ( int )( ( bctData + 1 ) % 3 ) - 1;
}

下一个是无分支方法，它只涉及数学，因此数学；它根本不假设跳转指令：

inline int bct_decoder_math( unsigned bctData ) {
    return ( int )( bctData & 1 ) - ( int )( bctData >> 1 );
}

这做了应该做的事情，并且表现得非常好。相比之下，性能估计为 1000 MIPS，比分支版本快 5 倍。由于缺乏本机 2 位有符号 int 支持，分支版本可能会变慢。但在我的应用程序中，它本身就是一个很好的版本。

如果这还不够，那么我们可以更进一步，有一些特别的东西。接下来是查找表方法：

inline int bct_decoder_lut( unsigned bctData ) {
    static const int decoderLUT[] = { 0, 1, -1, 0 };
    return decoderLUT[ bctData & 0x3 ];
}

在我的例子中，一个 trit 仅占用 2 位，因此 lut 表只有 2b*4 = 8 字节，值得尝试。它适合缓存并以 1400-1600 MIPS 的速度快速运行，这是我的测量精度下降的地方。这就是快速数学方法的 1.5 倍加速。这是因为您只有预先计算的结果和单个 AND 指令。遗憾的是，缓存很小，并且（如果您的索引长度大于几位）您根本无法使用它。

所以我想我回答了你的问题，关于分支/无分支代码是什么样的。答案要好得多，并且有详细的示例、真实世界的应用和真实的性能测量结果。

I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible.

In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0.

Consider inline int bct_decoder( unsigned bctData ) function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate them

First is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example:

inline int bct_decoder_cond( unsigned bctData ) {
   unsigned lsB = bctData & 1;
   unsigned msB = bctData >> 1;
   return
       ( lsB == msB ) ? 0 : // most possible -> make zero fastest branch
       ( lsB > msB ) ? 1 : -1;
}

This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset)

Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS:

inline int bct_decoder_mod( unsigned bctData ) {
    return ( int )( ( bctData + 1 ) % 3 ) - 1;
}

Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all:

inline int bct_decoder_math( unsigned bctData ) {
    return ( int )( bctData & 1 ) - ( int )( bctData >> 1 );
}

This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself.

If this is not enough then we can go futher, having something special. Next is called lookup table approach:

inline int bct_decoder_lut( unsigned bctData ) {
    static const int decoderLUT[] = { 0, 1, -1, 0 };
    return decoderLUT[ bctData & 0x3 ];
}

In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single AND instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it.

So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.

回复收藏 0 原文