与不使用 if 的测试相比，if 语句的效率如何？ (C++)

发布于 2024-09-04 23:50:09 字数 382 浏览 14 评论 0原文

我需要一个程序来获取两个数字中较小的一个，我想知道使用标准“如果 x 小于 y”是否

int a, b, low;
if (a < b) low = a;
else low = b;

比这个更有效率或更低：（

int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));

或者将 int delta = a - b 位于顶部，并用它重新替换 a - b 的实例）。

我只是想知道其中哪一个更有效（或者如果差异太小而无法相关），以及 if-else 语句与一般替代方案的效率。

原文

I need a program to get the smaller of two numbers, and I'm wondering if using a standard "if x is less than y"

int a, b, low;
if (a < b) low = a;
else low = b;

is more or less efficient than this:

int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));

(or the variation of putting int delta = a - b at the top and rerplacing instances of a - b with that).

I'm just wondering which one of these would be more efficient (or if the difference is too miniscule to be relevant), and the efficiency of if-else statements versus alternatives in general.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱要勇敢去追 2024-09-11 23:50:10

更新的答案采用编译器矢量化的当前（2018）状态。请参阅 danben 的回答，了解矢量化不是问题的一般情况。

TLDR 摘要：避免 if 有助于矢量化。

由于 SIMD 过于复杂，无法在某些元素上进行分支，而在其他元素上则不允许，因此任何包含 if 语句的代码都将无法进行矢量化，除非编译器知道可以将其重写为无分支操作集。我不知道有任何编译器将其作为矢量化过程的集成部分（Clang 独立执行其中一些操作，但不是专门帮助矢量化 AFAIK）

使用 OP 提供的示例：

int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));

许多编译器可以将其矢量化为某种东西大约相当于：

__m128i low128i(__m128i a, __m128i b){
  __m128i diff, tmp;
  diff = _mm_sub_epi32(a,b);
  tmp = _mm_srai_epi32(diff, 31);
  tmp = _mm_and_si128(tmp,diff);
  return _mm_add_epi32(tmp,b);
}

这种优化需要以允许的方式布局数据，但它可以扩展到带有avx2的__m256i或带有avx512的__m512i（甚至进一步展开循环以利用额外的寄存器）或其他架构上的其他 simd 指令。另一个优点是这些指令都是低延迟、高吞吐量指令（延迟约为 1，吞吐量倒数在 0.33 到 0.5 范围内 - 相对于非向量化代码来说非常快）

我看不出编译器为什么要这样做无法将 if 语句优化为向量化条件移动（除了相应的 x86 操作仅在内存位置上工作并且吞吐量较低，而其他架构（如 ARM）可能完全缺乏它），但它可以em> 可以通过执行以下操作来完成：

void lowhi128i(__m128i *a, __m128i *b){ // does both low and high
  __m128i _a=*a, _b=*b;
  __m128i lomask =  _mm_cmpgt_epi32(_a,_b),
  __m128i himask =  _mm_cmpgt_epi32(_b,_a);
  _mm_maskmoveu_si128(_b,lomask,a);
  _mm_maskmoveu_si128(_a,himask,b);
}

但是，与上面的示例相比，由于内存读取和写入以及较低的吞吐量（更高/更差的倒数吞吐量），这会产生更高的延迟。

Updated answer taking the current (2018) state of compiler vectorization. Please see danben's answer for the general case where vectorization is not a concern.

TLDR summary: avoiding ifs can help with vectorization.

Because SIMD would be too complex to allow branching on some elements, but not others, any code containing an if statement will fail to be vectorized unless the compiler knows a "superoptimization" technique that can rewrite it into a branchless set of operations. I don't know of any compilers that are doing this as an integrated part of the vectorization pass (Clang does some of this independently, but not specificly toward helping vectorization AFAIK)

Using the OP's provided example:

int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));

Many compilers can vectorize this to be something approximately equivalent to:

__m128i low128i(__m128i a, __m128i b){
  __m128i diff, tmp;
  diff = _mm_sub_epi32(a,b);
  tmp = _mm_srai_epi32(diff, 31);
  tmp = _mm_and_si128(tmp,diff);
  return _mm_add_epi32(tmp,b);
}

This optimization would require the data to be layed out in a fashion that would allow for it, but it could be extended to __m256i with avx2 or __m512i with avx512 (and even unroll loops further to take advantage of additional registers) or other simd instructions on other architectures. Another plus is that these instructions are all low latency, high-throughput instructions (latencies of ~1 and reciprocal throughputs in the range of 0.33 to 0.5 - so really fast relative to non-vectorized code)

I see no reason why compilers couldn't optimize an if statement to a vectorized conditional move (except that the corresponding x86 operations only work on memory locations and have low throughput and other architectures like arm may lack it entirely) but it could be done by doing something like:

void lowhi128i(__m128i *a, __m128i *b){ // does both low and high
  __m128i _a=*a, _b=*b;
  __m128i lomask =  _mm_cmpgt_epi32(_a,_b),
  __m128i himask =  _mm_cmpgt_epi32(_b,_a);
  _mm_maskmoveu_si128(_b,lomask,a);
  _mm_maskmoveu_si128(_a,himask,b);
}

However this would have a much higher latency due to memory reads and writes and lower throughput (higher/worse reciprocal throughput) than the example above.

与不使用 if 的测试相比，if 语句的效率如何？ (C++)

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（16）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。