为什么“a=(b>0)?1:0”比“if-else”更好CUDA版本?
你能告诉我为什么
a =(b>0)?1:0
更好吗
if (b>0)a=1; else a =0;
比CUDA 中的版本 ?请提供详细信息。非常感谢。
易
Could you tell me why
a =(b>0)?1:0
is better than
if (b>0)a=1; else a =0;
version in CUDA? Please give details. Many thanks.
Yik
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
曾经有一段时间,NVIDIA 编译器使用惯用法测试为三元运算符生成比 if/then/else 结构更高效的代码。这是一个小测试的结果,看看情况是否仍然如此:
使用 CUDA 4.0 版本编译器编译这两个用于计算能力 2.0 的内核,比较部分产生以下结果:
三元
运算符被编译为单个浮点最大值指令,而 if/then/else 被编译成两条指令,一个比较,然后一个选择。两个代码都是有条件执行的——都不产生分支。汇编器发出的机器代码也不同,并且紧密复制 PTX:
因此
,至少对于具有此类构造的 CUDA 4.0 的 Fermi GPU,三元运算符确实产生的指令少于等效的 if /然后/否则。它们之间是否存在性能差异取决于我没有的微基准测试数据。
There was a time when the NVIDIA compiler used idiom testing to generate more efficient code for the ternary operator than if/then/else constructs. This is the results of a small test to see whether this is still the case:
Compiling these two kernels for compute capability 2.0 with the CUDA 4.0 release compiler, the comparison section produces this:
and
The ternary operator gets compiled into a single floating point maximum instruction, whereas the if/then/else gets compiled into two instructions, a compare followed by a select. Both codes are conditionally executed - neither produces branching. The machine code emitted by the assembler for these is also different and closely replicates the PTX:
and
So it would seem that, at least for Fermi GPUs with CUDA 4.0 with this sort of construct, the ternary operator does produce fewer instructions that an equivalent if/then/else. Whether there is a performance difference between them comes down to microbenchmarking data which I don't have.
一般来说,我建议以自然的风格编写 CUDA 代码,并让编译器担心本地分支。除了预测之外,GPU 硬件还实现“选择”类型指令。使用talonmies的框架并粘贴原始海报的代码,我发现使用sm_20的CUDA 4.0编译器为两个版本生成相同的机器代码。我使用 -keep 保留中间文件,并使用 cuobjdump 实用程序生成反汇编。三元运算符和if语句都被翻译成FCMP指令,这是一条“选择”指令。
talonmies检查的样本案例实际上是一个特例。编译器可以识别一些常见的源代码习惯用法,例如经常用于表达 max() 和 min() 操作的特定三元表达式,并相应地生成代码。等效的 if 语句不被识别为习语。
In general, I would recommend to write CUDA code in a natural style, and let the compiler worry about local branching. Besides predication, the GPU hardware also implements "select" type instructions. Using talonmies's framework and sticking in the original poster's code, I find that the same machine code is produced for both versions with the CUDA 4.0 compiler for sm_20. I used -keep to retain intermediate files, and the cuobjdump utility to produce the disassembly. Both the ternary operator and the if-statement are translated into an FCMP instruction, which is a "select" instruction.
The sample case examined by talonmies is actually a special case. The compiler recognizes some common source code idioms, such as the particular ternary expression frequently used to express max() and min() operations, and generates code accordingly. The equivalent if-statement is not recognized as an idiom.
一般来说,您需要避免 CUDA 代码中的分支,否则可能会出现扭曲发散,这可能会导致性能大幅下降。
if
/else
子句通常会产生基于表达式测试的分支。消除分支的一种方法是使用一个可以在没有分支的情况下实现的表达式(如果编译器足够智能的话)——这样,warp 中的所有线程都遵循相同的代码路径。In general you need to avoid branches in CUDA code, otherwise you may get warp divergence which can result in a big performance hit.
if
/else
clauses will normally result in branches based on a test of an expression. One way of eliminating branches is to use an expression which can be implemented without branches if the compiler is smart enough - that way all the threads in a warp follow the same code path.在这两种情况下,编译器都会尝试做同样的事情,它的目标是使用谓词执行。您可以在 CUDA C 编程指南(可通过 网站 获取)以及 < a href="http://en.wikipedia.org/wiki/Branch_predicate" rel="nofollow">维基百科。本质上,对于像这样的短分支,硬件能够为分支的两侧发出指令,并使用谓词来指示哪些线程应该实际执行指令。
换句话说,性能差异很小。对于较旧的编译器,三级运算符有时会有所帮助,但现在它们是等效的。
In both cases the compiler is going to try to do the same thing, it will aim to use predicated execution. You can find more information in the CUDA C Programming Guide (available via the website) and also on Wikipedia. Essentially for short branches such as this the hardware is able to emit instructions for both sides of the branch and use a predicate to indicate which threads should actually execute the instructions.
In other words, there would be minimal performance difference. With older compilers the tertiary operator sometimes helped, but nowadays they are equivalent.
不知道 CUDA,但在 C++ 和 C99 中,使用前者可以初始化 const 变量。
而对于后者,您不能将
a
变量设置为 const,因为您必须在if
之前声明它。请注意,它可以写得更短:
您甚至可以删除括号...但恕我直言,它不会提高阅读能力。
Don't know for CUDA, but in C++ and C99, using the former you can initialize a const variable.
Whereas with the latter, you cannot make your
a
variable const as you have to declare it before theif
.Note that it could be written even shorter:
And you could even remove the parenthesis ... but IMHO it does not improve reading.
我发现它更容易阅读。很明显,整个语句的目的是设置
a
的值。目的是将
a
分配给两个值之一,三元条件运算符语法允许您在语句中只有一个a =
。我认为标准的 if/else 都在一行上是丑陋的(无论它的用途是什么)。
I find it easier to read. It's immediately obvious that the purpose of the whole statement is to set the value of
a
.The intent is to assign
a
to one of two values, and the ternary conditional operator syntax lets you have only onea =
in your statement.I think a standard if/else all on one line is ugly (regardless of what it's used for).