与不使用 if 的测试相比,if 语句的效率如何? (C++)

发布于 2024-09-04 23:50:09 字数 382 浏览 10 评论 0原文

我需要一个程序来获取两个数字中较小的一个,我想知道使用标准“如果 x 小于 y”是否

int a, b, low;
if (a < b) low = a;
else low = b;

比这个更有效率或更低:(

int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));

或者将 int delta = a - b 位于顶部,并用它重新替换 a - b 的实例)。

我只是想知道其中哪一个更有效(或者如果差异太小而无法相关),以及 if-else 语句与一般替代方案的效率。

I need a program to get the smaller of two numbers, and I'm wondering if using a standard "if x is less than y"

int a, b, low;
if (a < b) low = a;
else low = b;

is more or less efficient than this:

int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));

(or the variation of putting int delta = a - b at the top and rerplacing instances of a - b with that).

I'm just wondering which one of these would be more efficient (or if the difference is too miniscule to be relevant), and the efficiency of if-else statements versus alternatives in general.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(16

爱要勇敢去追 2024-09-11 23:50:10

更新的答案采用编译器矢量化的当前(2018)状态。请参阅 danben 的回答,了解矢量化不是问题的一般情况。

TLDR 摘要:避免 if 有助于矢量化。

由于 SIMD 过于复杂,无法在某些元素上进行分支,而在其他元素上则不允许,因此任何包含 if 语句的代码都将无法进行矢量化,除非编译器知道可以将其重写为无分支操作集。我不知道有任何编译器将其作为矢量化过程的集成部分(Clang 独立执行其中一些操作,但不是专门帮助矢量化 AFAIK)

使用 OP 提供的示例:

int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));

许多编译器可以将其矢量化为某种东西大约相当于:

__m128i low128i(__m128i a, __m128i b){
  __m128i diff, tmp;
  diff = _mm_sub_epi32(a,b);
  tmp = _mm_srai_epi32(diff, 31);
  tmp = _mm_and_si128(tmp,diff);
  return _mm_add_epi32(tmp,b);
}

这种优化需要以允许的方式布局数据,但它可以扩展到带有avx2的__m256i或带有avx512的__m512i(甚至进一步展开循环以利用额外的寄存器)或其他架构上的其他 simd 指令。另一个优点是这些指令都是低延迟、高吞吐量指令(延迟约为 1,吞吐量倒数在 0.33 到 0.5 范围内 - 相对于非向量化代码来说非常快)

我看不出编译器为什么要这样做无法将 if 语句优化为向量化条件移动(除了相应的 x86 操作仅在内存位置上工作并且吞吐量较低,而其他架构(如 ARM)可能完全缺乏它),但它可以em> 可以通过执行以下操作来完成:

void lowhi128i(__m128i *a, __m128i *b){ // does both low and high
  __m128i _a=*a, _b=*b;
  __m128i lomask =  _mm_cmpgt_epi32(_a,_b),
  __m128i himask =  _mm_cmpgt_epi32(_b,_a);
  _mm_maskmoveu_si128(_b,lomask,a);
  _mm_maskmoveu_si128(_a,himask,b);
}

但是,与上面的示例相比,由于内存读取和写入以及较低的吞吐量(更高/更差的倒数吞吐量),这会产生更高的延迟。

Updated answer taking the current (2018) state of compiler vectorization. Please see danben's answer for the general case where vectorization is not a concern.

TLDR summary: avoiding ifs can help with vectorization.

Because SIMD would be too complex to allow branching on some elements, but not others, any code containing an if statement will fail to be vectorized unless the compiler knows a "superoptimization" technique that can rewrite it into a branchless set of operations. I don't know of any compilers that are doing this as an integrated part of the vectorization pass (Clang does some of this independently, but not specificly toward helping vectorization AFAIK)

Using the OP's provided example:

int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));

Many compilers can vectorize this to be something approximately equivalent to:

__m128i low128i(__m128i a, __m128i b){
  __m128i diff, tmp;
  diff = _mm_sub_epi32(a,b);
  tmp = _mm_srai_epi32(diff, 31);
  tmp = _mm_and_si128(tmp,diff);
  return _mm_add_epi32(tmp,b);
}

This optimization would require the data to be layed out in a fashion that would allow for it, but it could be extended to __m256i with avx2 or __m512i with avx512 (and even unroll loops further to take advantage of additional registers) or other simd instructions on other architectures. Another plus is that these instructions are all low latency, high-throughput instructions (latencies of ~1 and reciprocal throughputs in the range of 0.33 to 0.5 - so really fast relative to non-vectorized code)

I see no reason why compilers couldn't optimize an if statement to a vectorized conditional move (except that the corresponding x86 operations only work on memory locations and have low throughput and other architectures like arm may lack it entirely) but it could be done by doing something like:

void lowhi128i(__m128i *a, __m128i *b){ // does both low and high
  __m128i _a=*a, _b=*b;
  __m128i lomask =  _mm_cmpgt_epi32(_a,_b),
  __m128i himask =  _mm_cmpgt_epi32(_b,_a);
  _mm_maskmoveu_si128(_b,lomask,a);
  _mm_maskmoveu_si128(_a,himask,b);
}

However this would have a much higher latency due to memory reads and writes and lower throughput (higher/worse reciprocal throughput) than the example above.

万劫不复 2024-09-11 23:50:10

除非您真的想提高效率,否则我认为您不需要担心这一点。

我的简单想法是 if 会更快,因为它比较一件事,而其他代码正在执行多项操作。但我再次认为差异是微乎其微的。

Unless you're really trying to buckle down on efficiency, I don't think this is something you need to worry about.

My simple thought though is that the if would be quicker because it's comparing one thing, while the other code is doing several operations. But again, I imagine that the difference is minuscule.

旧话新听 2024-09-11 23:50:10

如果它是针对 Gnu C++ 的,请尝试这个,

int min = i <? j;

我还没有对其进行分析,但我认为它绝对是最值得击败的。

If it is for Gnu C++, try this

int min = i <? j;

I have not profiled it but I think it is definitely the one to beat.

GRAY°灰色天空 2024-09-11 23:50:09

理由担心此类事情。)

(免责声明:以下内容涉及通常没有必要的非常低级的优化。如果您继续阅读,您就放弃了抱怨计算机速度快的权利,并且没有任何 消除 if 语句的优点是可以避免分支预测惩罚。

分支预测惩罚通常仅在分支不易预测时才成为问题。当分支几乎总是被采用/不被采用时,或者它遵循简单的模式时,很容易预测该分支。例如,循环语句中的分支除了最后一次外每次都会进行,因此很容易预测。但是,如果您有这样的代码

a = random() % 10
if (a < 5)
  print "Less"
else
  print "Greater"

,则该分支不容易预测,并且经常会产生与清除缓存和回滚在分支的错误部分执行的指令相关的预测损失。

避免此类惩罚的一种方法是使用三元 (?:) 运算符。在简单的情况下,编译器将生成条件移动指令而不是分支。

因此

int a, b, low;
if (a < b) low = a;
else low = b;

int a, b, low;
low = (a < b) ? a : b

在第二种情况下,不需要分支指令。此外,它比你的琐碎实现更清晰、更具可读性。

当然,这是一个微观优化,不太可能对您的代码产生重大影响。

(Disclaimer: the following deals with very low-level optimizations that are most often not necessary. If you keep reading, you waive your right to complain that computers are fast and there is never any reason to worry about this sort of thing.)

One advantage of eliminating an if statement is that you avoid branch prediction penalties.

Branch prediction penalties are generally only a problem when the branch is not easily predicted. A branch is easily predicted when it is almost always taken/not taken, or it follows a simple pattern. For example, the branch in a loop statement is taken every time except the last one, so it is easily predicted. However, if you have code like

a = random() % 10
if (a < 5)
  print "Less"
else
  print "Greater"

then this branch is not easily predicted, and will frequently incur the prediction penalty associated with clearing the cache and rolling back instructions that were executed in the wrong part of the branch.

One way to avoid these kinds of penalties is to use the ternary (?:) operator. In simple cases, the compiler will generate conditional move instructions rather than branches.

So

int a, b, low;
if (a < b) low = a;
else low = b;

becomes

int a, b, low;
low = (a < b) ? a : b

and in the second case a branching instruction is not necessary. Additionally, it is much clearer and more readable than your bit-twiddling implementation.

Of course, this is a micro-optimization which is unlikely to have significant impact on your code.

囚你心 2024-09-11 23:50:09

在 gcc 4.3.4、amd64(core 2 duo)、Linux 上编译:

int foo1(int a, int b)
{
    int low;
    if (a < b) low = a;
    else low = b;
    return low;
}

int foo2(int a, int b)
{
    int low;
    low = b + ((a - b) & ((a - b) >> 31));
    return low;
}

我得到:

foo1:
    cmpl    %edi, %esi
    cmovle  %esi, %edi
    movl    %edi, %eax
    ret

foo2:
    subl    %esi, %edi
    movl    %edi, %eax
    sarl    $31,  %eax
    andl    %edi, %eax
    addl    %esi, %eax
    ret

...我很确定这不会计入分支预测,因为代码不会跳转。此外,非 if 语句版本长了 2 个指令。我想我会继续编码,让编译器完成它的工作。

Compiling this on gcc 4.3.4, amd64 (core 2 duo), Linux:

int foo1(int a, int b)
{
    int low;
    if (a < b) low = a;
    else low = b;
    return low;
}

int foo2(int a, int b)
{
    int low;
    low = b + ((a - b) & ((a - b) >> 31));
    return low;
}

I get:

foo1:
    cmpl    %edi, %esi
    cmovle  %esi, %edi
    movl    %edi, %eax
    ret

foo2:
    subl    %esi, %edi
    movl    %edi, %eax
    sarl    $31,  %eax
    andl    %edi, %eax
    addl    %esi, %eax
    ret

...which I'm pretty sure won't count for branch predictions, since the code doesn't jump. Also, the non if-statement version is 2 instructions longer. I think I will continue coding, and let the compiler do it's job.

蝶舞 2024-09-11 23:50:09

简单答案:一个条件跳转将比两个减法、一个加法、一个按位与和一个移位运算的组合更有效。 在这一点上我已经受过充分的教育(请参阅评论)我什至不再有足够的信心说它通常更有效。

务实的答案:无论哪种方式,您为额外的 CPU 周期支付的费用都不会像程序员为弄清楚第二个示例在做什么而花费的时间一样多。程序的可读性第一,效率第二。

Simple answer: One conditional jump is going to be more efficient than two subtractions, one addition, a bitwise and, and a shift operation combined. I've been sufficiently schooled on this point (see the comments) that I'm no longer even confident enough to say that it's usually more efficient.

Pragmatic answer: Either way, you're not paying nearly as much for the extra CPU cycles as you are for the time it takes a programmer to figure out what that second example is doing. Program for readability first, efficiency second.

虫児飞 2024-09-11 23:50:09

最大的问题是您的第二个示例无法在 64 位计算机上运行

然而,即使忽略这一点,现代编译器也足够聪明,可以在每种可能的情况下考虑无分支预测,并比较估计的速度。因此,您的第二个示例将实际上很可能更慢

if 语句和使用三元运算符之间没有区别,因为即使大多数愚蠢的编译器也足够聪明来识别这种特殊情况。


[编辑]因为我认为这是一个非常有趣的话题,所以我写了关于它的博客文章

The biggest problem is that your second example won't work on 64-bit machines.

However, even neglecting that, modern compilers are smart enough to consider branchless prediction in every case possible, and compare the estimated speeds. So, you second example will most likely actually be slower

There will be no difference between the if statement and using a ternary operator, as even most dumb compilers are smart enough to recognize this special case.


[Edit] Because I think this is such an interesting topic, I've written a blog post on it.

南汐寒笙箫 2024-09-11 23:50:09

与任何低级优化一样,在目标 CPU/主板设置上对其进行测试。

在我的编译器(x86_64 上的 gcc 4.5.1)上,第一个示例变为

cmpl    %ebx, %eax
cmovle  %eax, %esi

第二个示例变为

subl    %eax, %ebx
movl    %ebx, %edx
sarl    $31, %edx
andl    %ebx, %edx
leal    (%rdx,%rax), %esi

不确定第一个示例是否在所有情况下都更快,但我敢打赌它是。

Like with any low-level optimization, test it on the target CPU/board setup.

On my compiler (gcc 4.5.1 on x86_64), the first example becomes

cmpl    %ebx, %eax
cmovle  %eax, %esi

The second example becomes

subl    %eax, %ebx
movl    %ebx, %edx
sarl    $31, %edx
andl    %ebx, %edx
leal    (%rdx,%rax), %esi

Not sure if the first one is faster in all cases, but I would bet it is.

一生独一 2024-09-11 23:50:09

无论哪种方式,汇编都只会是一些指令,并且无论哪种方式,执行这些指令都需要皮秒。

我会分析该应用程序,并将您的优化工作集中在更有价值的事情上。

此外,这种类型的优化节省的时间不值得任何试图维护它的人浪费的时间。

对于像这样的简单语句,我发现三元运算符非常直观:

low = (a < b) ? a : b;

清晰简洁。

Either way, the assembly will only be a few instructions and either way it'll take picoseconds for those instructions to execute.

I would profile the application an concentrate your optimization efforts to something more worthwhile.

Also, the time saved by this type of optimization will not be worth the time wasted by anyone trying to maintain it.

For simple statements like this, I find the ternary operator very intuitive:

low = (a < b) ? a : b;

Clear and concise.

独孤求败 2024-09-11 23:50:09

对于如此简单的事情,为什么不直接尝试一下呢?

一般来说,您首先会进行分析,将其识别为热点,尝试更改并查看结果。

我编写了一个简单的程序,将两种传入随机数的技术(这样我们看不到完美的分支预测)与 Visual C++ 2010 进行比较。在我的机器上进行 100,000,000 次迭代时,两种方法之间的差异是什么?总共不到 50 毫秒,if 版本往往更快。查看代码生成器,编译器成功地将简单的 if 转换为 cmovl 指令,完全避免了分支。

For something as simple as this, why not just experiment and try it out?

Generally, you'd profile first, identify this as a hotspot, experiment with a change, and view the result.

I wrote a simple program that compares both techniques passing in random numbers (so that we don't see perfect branch prediction) with Visual C++ 2010. The difference between the approaches on my machine for 100,000,000 iteration? Less than 50ms total, and the if version tended to be faster. Looking at the codegen, the compiler successfully converted the simple if to a cmovl instruction, avoiding a branch altogether.

诺曦 2024-09-11 23:50:09

当您遇到真正有点棘手的黑客行为时,需要警惕的一件事是它们如何与内联后发生的编译器优化交互。例如,可读过程

int foo (int a, int b) {
   return ((a < b) ? a : b);
}

在任何情况下都可能被编译成非常有效的东西,但在某些情况下它可能甚至更好。举个例子,假设有人写了

int bar = foo (x, x+3);

After inlined,编译器会识别出 3 是正数,然后可能会利用有符号溢出未定义的事实来完全消除测试,得到

int bar = x;

很多不太清楚编译器应如何在这种情况下优化您的第二个实现。当然,这是一个相当人为的示例,但类似的优化实际上在实践中很重要。当然,当性能至关重要时,您不应该接受糟糕的编译器输出,但明智的做法是,在使用下一个经过惊人改进的编译器版本不会出现的代码之前,先看看是否能找到能够产生良好输出的清晰代码。能够优化至死。

One thing to be wary of when you get into really bit-fiddly kinds of hacks is how they may interact with compiler optimizations that take place after inlining. For example, the readable procedure

int foo (int a, int b) {
   return ((a < b) ? a : b);
}

is likely to be compiled into something very efficient in any case, but in some cases it may be even better. Suppose, for example, that someone writes

int bar = foo (x, x+3);

After inlining, the compiler will recognize that 3 is positive, and may then make use of the fact that signed overflow is undefined to eliminate the test altogether, to get

int bar = x;

It's much less clear how the compiler should optimize your second implementation in this context. This is a rather contrived example, of course, but similar optimizations actually are important in practice. Of course you shouldn't accept bad compiler output when performance is critical, but it's likely wise to see if you can find clear code that produces good output before you resort to code that the next, amazingly improved, version of the compiler won't be able to optimize to death.

自由如风 2024-09-11 23:50:09

我要指出的一件事是,我没有注意到提到这样的优化很容易被其他问题所淹没。例如,如果您在两个大型数字数组(或更糟糕的是,分散在内存中的数字对)上运行此例程,则在当今的 CPU 上获取值的成本很容易使 CPU 的执行管道停顿。

One thing I will point out that I haven't noticed mention that an optimization like this can easily be overwhelmed by other issues. For example, if you are running this routine on two large arrays of numbers (or worse yet, pairs of number scattered in memory), the cost of fetching the values on today's CPUs can easily stall the CPU's execution pipelines.

锦上情书 2024-09-11 23:50:09

我只是想知道其中哪一个
会更有效率(或者如果
差异微乎其微
相关),以及效率
if-else 语句与替代语句
一般来说。

台式机/服务器 CPU 针对流水线进行了优化。第二个理论上更快,因为 CPU 不必分支并且可以利用多个 ALU 并行计算表达式的各个部分。更多具有混合独立操作的非分支代码最适合此类 CPU。 (但即使如此,现在也被现代的“条件”CPU 指令所否定,这些指令也允许使第一个代码无分支。)

在嵌入式 CPU 上,分支通常成本较低(相对于其他一切),而且它们也没有许多备用 ALU 来评估乱序操作(即它们是否支持乱序执行)。代码/数据越少越好——缓存也很小。 (我什至在嵌入式应用程序中看到了冒泡排序的使用:该算法使用最少的内存/代码,并且对于少量信息来说足够快。)

重要提示:不要忘记编译器优化。使用许多技巧,编译器有时可以自行删除分支:内联、常量传播、重构等。

但最后我想说的是,是的,差异很小。从长远来看,可读的代码会获胜。

从 CPU 方面的进展情况来看,现在投入时间使代码具有多线程和 OpenCL 功能会更有价值。

I'm just wondering which one of these
would be more efficient (or if the
difference is to miniscule to be
relevant), and the efficiency of
if-else statements versus alternatives
in general.

Desktop/server CPUs are optimized for pipelining. Second is theoretically faster because CPU doesn't have to branch and can utilize multiple ALUs to evaluate parts of expression in parallel. More non-branching code with intermixed independent operations are best for such CPUs. (But even that is negated now by modern "conditional" CPU instructions which allow to make the first code branch-less too.)

On embedded CPUs branching if often less expensive (relatively to everything else), nor they have many spare ALUs to evaluate operations out-of-order (that's if they support out-of-order execution at all). Less code/data is better - caches are small too. (I have even seen uses of buble-sort in embedded applications: the algorithm uses least of memory/code and fast enough for small amounts of information.)

Important: do not forget about the compiler optimizations. Using many tricks, the compilers sometimes can remove the branching themselves: inlining, constant propagation, refactoring, etc.

But in the end I would say that yes, the difference is minuscule to be relevant. In long term, readable code wins.

The way things go on the CPU front, it is more rewarding to invest time now in making the code multi-threaded and OpenCL capable.

永不分离 2024-09-11 23:50:09

为什么if中的low = a;else中的low = a;?而且,为什么是31?如果 31 与 CPU 字大小有关,那么如果代码要在不同大小的 CPU 上运行怎么办?

if..else 方式看起来更具可读性。我希望程序对于人类来说就像对于编译器一样可读。

Why low = a; in the if and low = a; in the else? And, why 31? If 31 has anything to do with CPU word size, what if the code is to be run on a CPU of different size?

The if..else way looks more readable. I like programs to be as readable to humans as they are to the compilers.

冧九 2024-09-11 23:50:09

使用 gcc -o foo -g -p -O0 的配置文件结果,Solaris 9 v240

 %Time Seconds Cumsecs  #Calls   msec/call  Name
  36.8    0.21    0.21 8424829      0.0000  foo2
  28.1    0.16    0.37       1    160.      main
  17.5    0.10    0.4716850667      0.0000  _mcount
  17.5    0.10    0.57 8424829      0.0000  foo1
   0.0    0.00    0.57       4      0.      atexit
   0.0    0.00    0.57       1      0.      _fpsetsticky
   0.0    0.00    0.57       1      0.      _exithandle
   0.0    0.00    0.57       1      0.      _profil
   0.0    0.00    0.57    1000      0.000   rand
   0.0    0.00    0.57       1      0.      exit

代码:

int
foo1 (int a, int b, int low)        
{
   if (a < b) 
      low = a;   
   else 
      low = b;         
   return low;
}

int                       
foo2 (int a, int b, int low)
{
   low = (a < b) ? a : b;
   return low;
}


int main()
{
    int low=0;
    int a=0;
    int b=0;
    int i=500;
    while (i--)
    {
       for(a=rand(), b=rand(); a; a--)
       {
           low=foo1(a,b,low);
           low=foo2(a,b,low);
       }        
    }
    return 0;
}

根据数据,在上述环境中,没有发现与此处所述的几个信念完全相反的事实。请注意“在此环境中”如果构造比三元更快? :构建

profile results with gcc -o foo -g -p -O0, Solaris 9 v240

 %Time Seconds Cumsecs  #Calls   msec/call  Name
  36.8    0.21    0.21 8424829      0.0000  foo2
  28.1    0.16    0.37       1    160.      main
  17.5    0.10    0.4716850667      0.0000  _mcount
  17.5    0.10    0.57 8424829      0.0000  foo1
   0.0    0.00    0.57       4      0.      atexit
   0.0    0.00    0.57       1      0.      _fpsetsticky
   0.0    0.00    0.57       1      0.      _exithandle
   0.0    0.00    0.57       1      0.      _profil
   0.0    0.00    0.57    1000      0.000   rand
   0.0    0.00    0.57       1      0.      exit

code:

int
foo1 (int a, int b, int low)        
{
   if (a < b) 
      low = a;   
   else 
      low = b;         
   return low;
}

int                       
foo2 (int a, int b, int low)
{
   low = (a < b) ? a : b;
   return low;
}


int main()
{
    int low=0;
    int a=0;
    int b=0;
    int i=500;
    while (i--)
    {
       for(a=rand(), b=rand(); a; a--)
       {
           low=foo1(a,b,low);
           low=foo2(a,b,low);
       }        
    }
    return 0;
}

Based on data, in the above environment, the exact opposite of several beliefs stated here were not found to be true. Note the 'in this environment' If construct was faster than ternary ? : construct

岁月流歌 2024-09-11 23:50:09

我不久前写过三元逻辑模拟器,这个问题对我来说是可行的,因为它直接影响我的解释器执行速度;我需要尽快模拟大量的三元逻辑门。

在二进制编码三进制系统中,一个 trit 被打包在两位中。最高有效位表示负数,最低有效位表示正数。情况“11”不应发生,但必须正确处理并威胁为 0。

考虑内联 int bct_decoder( unsigned bctData ) 函数,该函数应将格式化的 trit 返回为常规整数 -1、0 或1;据我观察,有 4 种方法:我将它们称为“cond”、“mod”、“math”和“lut”;让我们研究一下它们

首先是基于 jz|jnz 和 jl|jb 条件跳转,因此 cond。它的性能一点也不好,因为依赖于分支预测器。更糟糕的是 - 它会变化,因为不知道是否会有一个或两个先验分支。这是一个例子:

inline int bct_decoder_cond( unsigned bctData ) {
   unsigned lsB = bctData & 1;
   unsigned msB = bctData >> 1;
   return
       ( lsB == msB ) ? 0 : // most possible -> make zero fastest branch
       ( lsB > msB ) ? 1 : -1;
}

这是最慢的版本,在最坏的情况下它可能涉及 2 个分支,这是二进制逻辑失败的地方。在我的 3770k 上,它对随机数据的平均速度约为 200MIPS。 (这里和之后 - 每个测试都是随机填充的 2mb 数据集上 1000 次尝试的平均值)

下一个依赖于模运算符,其速度介于第一和第三之间,但绝对更快 - 600 MIPS:

inline int bct_decoder_mod( unsigned bctData ) {
    return ( int )( ( bctData + 1 ) % 3 ) - 1;
}

下一个是无分支方法,它只涉及数学,因此数学;它根本不假设跳转指令:

inline int bct_decoder_math( unsigned bctData ) {
    return ( int )( bctData & 1 ) - ( int )( bctData >> 1 );
}

这做了应该做的事情,并且表现得非常好。相比之下,性能估计为 1000 MIPS,比分支版本快 5 倍。由于缺乏本机 2 位有符号 int 支持,分支版本可能会变慢。但在我的应用程序中,它本身就是一个很好的版本。

如果这还不够,那么我们可以更进一步,有一些特别的东西。接下来是查找表方法:

inline int bct_decoder_lut( unsigned bctData ) {
    static const int decoderLUT[] = { 0, 1, -1, 0 };
    return decoderLUT[ bctData & 0x3 ];
}

在我的例子中,一个 trit 仅占用 2 位,因此 lut 表只有 2b*4 = 8 字节,值得尝试。它适合缓存,工作速度极快,速度为 1400-1600 MIPS,这是我的测量精度下降的地方。这就是快速数学方法的 1.5 倍加速。这是因为您只有预先计算的结果和单个 AND 指令。遗憾的是,缓存很小,并且(如果您的索引长度大于几位)您根本无法使用它。

所以我想我回答了你的问题,关于分支/无分支代码是什么样的。答案要好得多,并且有详细的示例、真实世界的应用和真实的性能测量结果。

I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible.

In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0.

Consider inline int bct_decoder( unsigned bctData ) function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate them

First is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example:

inline int bct_decoder_cond( unsigned bctData ) {
   unsigned lsB = bctData & 1;
   unsigned msB = bctData >> 1;
   return
       ( lsB == msB ) ? 0 : // most possible -> make zero fastest branch
       ( lsB > msB ) ? 1 : -1;
}

This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset)

Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS:

inline int bct_decoder_mod( unsigned bctData ) {
    return ( int )( ( bctData + 1 ) % 3 ) - 1;
}

Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all:

inline int bct_decoder_math( unsigned bctData ) {
    return ( int )( bctData & 1 ) - ( int )( bctData >> 1 );
}

This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself.

If this is not enough then we can go futher, having something special. Next is called lookup table approach:

inline int bct_decoder_lut( unsigned bctData ) {
    static const int decoderLUT[] = { 0, 1, -1, 0 };
    return decoderLUT[ bctData & 0x3 ];
}

In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single AND instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it.

So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文