C 中 If-Else 和三元运算符之间的速度差异...？

发布于 2024-11-25 06:52:22 字数 650 浏览 4 评论 0原文

因此，在同事的建议下，我刚刚测试了三元运算符和等效的 If-Else 块之间的速度差异......并且似乎三元运算符生成的代码比 If-Else 快 1 倍到 2 倍。我的代码是：（

  gettimeofday(&tv3, 0);
  for(i = 0; i < N; i++)
  {
     a = i & 1;
     if(a) a = b; else a = c;
  }
  gettimeofday(&tv4, 0);


  gettimeofday(&tv1, 0);
  for(i = 0; i < N; i++)
  {
     a = i & 1;
     a = a ? b : c;
  }
  gettimeofday(&tv2, 0);

抱歉使用 gettimeofday 而不是clock_gettime...我会努力改善自己。）

我尝试更改对块进行计时的顺序，但结果似乎仍然存在。什么给？此外，If-Else 在执行速度方面表现出更大的可变性。我应该检查 gcc 生成的程序集吗？

顺便说一句，这都是优化级别零（-O0）。

这是我想象的吗，还是有什么我没有考虑到的，或者这是一个依赖于机器的事情，或者什么？任何帮助表示赞赏。

原文

So at the suggestion of a colleague, I just tested the speed difference between the ternary operator and the equivalent If-Else block... and it seems that the ternary operator yields code that is between 1x and 2x faster than If-Else. My code is:

  gettimeofday(&tv3, 0);
  for(i = 0; i < N; i++)
  {
     a = i & 1;
     if(a) a = b; else a = c;
  }
  gettimeofday(&tv4, 0);


  gettimeofday(&tv1, 0);
  for(i = 0; i < N; i++)
  {
     a = i & 1;
     a = a ? b : c;
  }
  gettimeofday(&tv2, 0);

(Sorry for using gettimeofday and not clock_gettime... I will endeavor to better myself.)

I tried changing the order in which I timed the blocks, but the results seem to persist. What gives? Also, the If-Else shows much more variability in terms of execution speed. Should I be examining the assembly that gcc generates?

By the way, this is all at optimization level zero (-O0).

Am I imagining this, or is there something I'm not taking into account, or is this a machine-dependent thing, or what? Any help is appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

近箐 2024-12-02 06:52:22

三元运算符很有可能被编译为 cmov，而 if/else 则生成 cmp+jmp。只需查看一下程序集（使用 -S）即可确定。启用优化后，无论如何都不再重要了，因为任何好的编译器都应该在两种情况下生成相同的代码。

回复收藏 0 原文

深海里的那抹蓝 2024-12-02 06:52:22

您还可以完全无分支并衡量它是否有任何区别：

int m = -(i & 1);
a = (b & m) | (c & ~m);

在当今的架构中，这种编程风格已经有点过时了。

You could also go completely branchless and measure if it makes any difference:

int m = -(i & 1);
a = (b & m) | (c & ~m);

On today's architectures, this style of programming has grown a bit out of fashion.

回复收藏 0 原文

孤云独去闲 2024-12-02 06:52:22

这是一个很好的解释：http://www.nynaeve.net/?p=178

基本上，存在“条件集”处理器指令，这比单独指令中的分支和设置更快。

回复收藏 0 原文

宁愿没拥抱 2024-12-02 06:52:22

如果有的话，就换你的编译器吧！

对于此类问题，我使用试用 LLVM 页面。它是 LLVM 的旧版本（仍然使用 gcc 前端），但这些都是旧技巧。

这是我的小示例程序（你的简化版本）：

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

int main (int argc, char* argv[]) {
  int N = atoi(argv[0]);

  int a = 0, d = 0, b = atoi(argv[1]), c = atoi(argv[2]);

  int i;
  for(i = 0; i < N; i++)
  {
     a = i & 1;
     if(a) a = b+i; else a = c+i;
  }

  for(i = 0; i < N; i++)
  {
     d = i & 1;
     d = d ? b+i : c+i;
  }

  printf("%d %d", a, d);

  return 0;
}

并且生成了相应的 LLVM IR：

define i32 @main(i32 %argc, i8** nocapture %argv) nounwind {
entry:
  %0 = load i8** %argv, align 8                   ; <i8*> [#uses=1]
  %N = tail call i32 @atoi(i8* %0) nounwind readonly ; <i32> [#uses=5]

  %2 = getelementptr inbounds i8** %argv, i64 1   ; <i8**> [#uses=1]
  %3 = load i8** %2, align 8                      ; <i8*> [#uses=1]
  %b = tail call i32 @atoi(i8* %3) nounwind readonly ; <i32> [#uses=2]

  %5 = getelementptr inbounds i8** %argv, i64 2   ; <i8**> [#uses=1]
  %6 = load i8** %5, align 8                      ; <i8*> [#uses=1]
  %c = tail call i32 @atoi(i8* %6) nounwind readonly ; <i32> [#uses=2]

  %8 = icmp sgt i32 %N, 0                         ; <i1> [#uses=2]
  br i1 %8, label %bb, label %bb11

bb:                                               ; preds = %bb, %entry
  %9 = phi i32 [ %10, %bb ], [ 0, %entry ]        ; <i32> [#uses=2]
  %10 = add nsw i32 %9, 1                         ; <i32> [#uses=2]
  %exitcond22 = icmp eq i32 %10, %N               ; <i1> [#uses=1]
  br i1 %exitcond22, label %bb10.preheader, label %bb

bb10.preheader:                                   ; preds = %bb
  %11 = and i32 %9, 1                             ; <i32> [#uses=1]
  %12 = icmp eq i32 %11, 0                        ; <i1> [#uses=1]
  %.pn13 = select i1 %12, i32 %c, i32 %b          ; <i32> [#uses=1]
  %tmp21 = add i32 %N, -1                         ; <i32> [#uses=1]
  %a.1 = add i32 %.pn13, %tmp21                   ; <i32> [#uses=2]
  br i1 %8, label %bb6, label %bb11

bb6:                                              ; preds = %bb6, %bb10.preheader
  %13 = phi i32 [ %14, %bb6 ], [ 0, %bb10.preheader ] ; <i32> [#uses=2]
  %14 = add nsw i32 %13, 1                        ; <i32> [#uses=2]
  %exitcond = icmp eq i32 %14, %N                 ; <i1> [#uses=1]
  br i1 %exitcond, label %bb10.bb11_crit_edge, label %bb6

bb10.bb11_crit_edge:                              ; preds = %bb6
  %15 = and i32 %13, 1                            ; <i32> [#uses=1]
  %16 = icmp eq i32 %15, 0                        ; <i1> [#uses=1]
  %.pn = select i1 %16, i32 %c, i32 %b            ; <i32> [#uses=1]
  %tmp = add i32 %N, -1                           ; <i32> [#uses=1]
  %d.1 = add i32 %.pn, %tmp                       ; <i32> [#uses=1]
  br label %bb11

bb11:                                             ; preds = %bb10.bb11_crit_edge, %bb10.preheader, %entry
  %a.0 = phi i32 [ %a.1, %bb10.bb11_crit_edge ], [ %a.1, %bb10.preheader ], [ 0, %entry ] ; <i32> [#uses=1]
  %d.0 = phi i32 [ %d.1, %bb10.bb11_crit_edge ], [ 0, %bb10.preheader ], [ 0, %entry ] ; <i32> [#uses=1]
  %17 = tail call i32 (i8*, ...)* @printf(i8* noalias getelementptr inbounds ([6 x i8]* @.str, i64 0, i64 0), i32 %a.0, i32 %d.0) nounwind ; <i32> [#uses=0]
  ret i32 0
}

好的，所以它可能是中文的，尽管我继续重命名了一些变量以使其更易于阅读。

重要的是这两个块：

  %.pn13 = select i1 %12, i32 %c, i32 %b          ; <i32> [#uses=1]
  %tmp21 = add i32 %N, -1                         ; <i32> [#uses=1]
  %a.1 = add i32 %.pn13, %tmp21                   ; <i32> [#uses=2]

  %.pn = select i1 %16, i32 %c, i32 %b            ; <i32> [#uses=1]
  %tmp = add i32 %N, -1                           ; <i32> [#uses=1]
  %d.1 = add i32 %.pn, %tmp                       ; <i32> [#uses=1]

分别设置a和d。

结论是：没有区别

注意：在一个更简单的示例中，两个变量实际上合并了，这里优化器似乎没有检测到相似性...

If there is any, change your compiler!

For this kind of questions I use the Try Out LLVM page. It's an old release of LLVM (still using the gcc front-end), but those are old tricks.

Here is my little sample program (simplified version of yours):

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

int main (int argc, char* argv[]) {
  int N = atoi(argv[0]);

  int a = 0, d = 0, b = atoi(argv[1]), c = atoi(argv[2]);

  int i;
  for(i = 0; i < N; i++)
  {
     a = i & 1;
     if(a) a = b+i; else a = c+i;
  }

  for(i = 0; i < N; i++)
  {
     d = i & 1;
     d = d ? b+i : c+i;
  }

  printf("%d %d", a, d);

  return 0;
}

And there is the corresponding LLVM IR generated:

define i32 @main(i32 %argc, i8** nocapture %argv) nounwind {
entry:
  %0 = load i8** %argv, align 8                   ; <i8*> [#uses=1]
  %N = tail call i32 @atoi(i8* %0) nounwind readonly ; <i32> [#uses=5]

  %2 = getelementptr inbounds i8** %argv, i64 1   ; <i8**> [#uses=1]
  %3 = load i8** %2, align 8                      ; <i8*> [#uses=1]
  %b = tail call i32 @atoi(i8* %3) nounwind readonly ; <i32> [#uses=2]

  %5 = getelementptr inbounds i8** %argv, i64 2   ; <i8**> [#uses=1]
  %6 = load i8** %5, align 8                      ; <i8*> [#uses=1]
  %c = tail call i32 @atoi(i8* %6) nounwind readonly ; <i32> [#uses=2]

  %8 = icmp sgt i32 %N, 0                         ; <i1> [#uses=2]
  br i1 %8, label %bb, label %bb11

bb:                                               ; preds = %bb, %entry
  %9 = phi i32 [ %10, %bb ], [ 0, %entry ]        ; <i32> [#uses=2]
  %10 = add nsw i32 %9, 1                         ; <i32> [#uses=2]
  %exitcond22 = icmp eq i32 %10, %N               ; <i1> [#uses=1]
  br i1 %exitcond22, label %bb10.preheader, label %bb

bb10.preheader:                                   ; preds = %bb
  %11 = and i32 %9, 1                             ; <i32> [#uses=1]
  %12 = icmp eq i32 %11, 0                        ; <i1> [#uses=1]
  %.pn13 = select i1 %12, i32 %c, i32 %b          ; <i32> [#uses=1]
  %tmp21 = add i32 %N, -1                         ; <i32> [#uses=1]
  %a.1 = add i32 %.pn13, %tmp21                   ; <i32> [#uses=2]
  br i1 %8, label %bb6, label %bb11

bb6:                                              ; preds = %bb6, %bb10.preheader
  %13 = phi i32 [ %14, %bb6 ], [ 0, %bb10.preheader ] ; <i32> [#uses=2]
  %14 = add nsw i32 %13, 1                        ; <i32> [#uses=2]
  %exitcond = icmp eq i32 %14, %N                 ; <i1> [#uses=1]
  br i1 %exitcond, label %bb10.bb11_crit_edge, label %bb6

bb10.bb11_crit_edge:                              ; preds = %bb6
  %15 = and i32 %13, 1                            ; <i32> [#uses=1]
  %16 = icmp eq i32 %15, 0                        ; <i1> [#uses=1]
  %.pn = select i1 %16, i32 %c, i32 %b            ; <i32> [#uses=1]
  %tmp = add i32 %N, -1                           ; <i32> [#uses=1]
  %d.1 = add i32 %.pn, %tmp                       ; <i32> [#uses=1]
  br label %bb11

bb11:                                             ; preds = %bb10.bb11_crit_edge, %bb10.preheader, %entry
  %a.0 = phi i32 [ %a.1, %bb10.bb11_crit_edge ], [ %a.1, %bb10.preheader ], [ 0, %entry ] ; <i32> [#uses=1]
  %d.0 = phi i32 [ %d.1, %bb10.bb11_crit_edge ], [ 0, %bb10.preheader ], [ 0, %entry ] ; <i32> [#uses=1]
  %17 = tail call i32 (i8*, ...)* @printf(i8* noalias getelementptr inbounds ([6 x i8]* @.str, i64 0, i64 0), i32 %a.0, i32 %d.0) nounwind ; <i32> [#uses=0]
  ret i32 0
}

Okay, so it's likely to be chinese, even though I went ahead and renamed some variables to make it a bit easier to read.

The important bits are these two blocks:

  %.pn13 = select i1 %12, i32 %c, i32 %b          ; <i32> [#uses=1]
  %tmp21 = add i32 %N, -1                         ; <i32> [#uses=1]
  %a.1 = add i32 %.pn13, %tmp21                   ; <i32> [#uses=2]

  %.pn = select i1 %16, i32 %c, i32 %b            ; <i32> [#uses=1]
  %tmp = add i32 %N, -1                           ; <i32> [#uses=1]
  %d.1 = add i32 %.pn, %tmp                       ; <i32> [#uses=1]

Which respectively set a and d.

And the conclusion is: No difference

Note: in a simpler example the two variables actually got merged, it seems here that the optimizer did not detect the similarity...

回复收藏 0 原文

剪不断理还乱 2024-12-02 06:52:22

如果打开优化，任何像样的编译器都应该为这些生成相同的代码。

回复收藏 0 原文

深海夜未眠 2024-12-02 06:52:22

请理解，这完全取决于编译器如何解释三元表达式（除非您实际上强制它不使用（内联）asm）。它可以像内部表示语言中的“if..else”一样轻松地理解三元表达式，并且根据目标后端，它可以选择生成条件移动指令（在 x86 上，CMOVcc 就是这样的指令。还应该有最小/最大、绝对值等）。使用条件移动的主要动机是将分支错误预测的风险转移到内存/寄存器移动操作。该指令需要注意的是，几乎所有时候，将有条件加载的操作数寄存器都必须以寄存器形式进行计算，才能利用 cmov 指令。

这意味着无条件评估过程现在必须是无条件的，这似乎会增加程序无条件路径的长度。但请理解，分支错误预测通常被解决为“刷新”管道，这意味着将完成执行的指令被忽略（转向无操作指令）。这意味着由于停顿或 NOP，实际执行的指令数会更高，并且影响会随着处理器流水线的深度和错误预测率而变化。

这给确定正确的启发法带来了一个有趣的困境。首先，我们确信如果管道太浅或者分支预测完全能够从分支历史中学习模式，那么 cmov 就不值得做。如果条件论证的评估成本平均大于错误预测的成本，那么也不值得这样做。

这些可能是编译器难以利用 cmov 指令的核心原因，因为启发式确定很大程度上取决于运行时分析信息。在 JIT 编译器上使用它更有意义，因为它可以提供运行时检测反馈并构建更强大的启发式方法（“分支真的不可预测吗？”）。在没有训练数据或分析器的静态编译器方面，最难假设这何时有用。然而，如上所述，如果编译器知道数据集是完全随机的或强制条件，则一个简单的负面启发式是。放松。评估的成本很高（可能是由于像 fp 除法这样的不可约的、成本高昂的操作），不这样做会是很好的启发。

任何称职的编译器都会完成这一切。问题是，在所有可靠的启发式方法都用完之后，它会做什么......

Understand that it's entirely up to the compiler how it interprets ternary expression (unless you actually force it not to with (inline) asm). It could just as easily understand ternary expression as 'if..else' in its Internal Representation language, and depending on the target backend, it may choose to generate conditional move instruction (on x86, CMOVcc is such one. There should also be ones for min/max, abs, etc). The main motivation of using conditional move is to transfer the risk of branch mispredict to a memory/register move operation. The caveat to this instruction is that nearly all the time, the operand register that will be conditionally loaded will have to be evaluated down to register form to take advantage of the cmov instruction.

This means that the unconditional evaluation process now has to be unconditional, and this will appear to increase the length of the unconditional path of the program. But understand that branch mispredict is most often resolved as 'flushing' the pipeline, which means that the instructions that would have finished executing are ignored (turned to No Operation instructions). This means that the actual number of instructions executed is higher because of the stalls or NOPs, and the effect scales with the depth of the processor pipeline and the misprediction rate.

This brings an interesting dilemma in determining the right heuristics. First, we know for sure that if the pipeline is too shallow or the branch prediction is fully able to learn pattern from branch history, then cmov is not worth doing. It's also not worth doing if the cost of evaluation of conditional argument is greater on than the cost from misprediction on average.

These are perhaps the core reasons why compilers have difficulty exploiting cmov instruction, since the heuristics determination is largely dependent on the runtime profiling information. It makes more sense to use this on JIT compiler since it can provide runtime instrumentation feedback and build a stronger heuristics for using this ("Is the branch truly unpredictable?"). On static compiler side without training data or profiler, it's most difficult to assume when this will be useful. However, a simple negative heuristic is, as aforementioned, if the compiler knows that the dataset is completely random or forcing cond. to uncond. evaluation is costly (perhaps due to irreducible, costly operations like fp divides), it would make good heuristics not to do this.

Any compiler worth its salt will do all that. Question is, what will it do after all dependable heuristics have been used up...

回复收藏 0 原文

~没有更多了~