为什么“a=(b>0)?1:0”比“if-else”更好CUDA版本?

发布于 2024-11-30 02:35:58 字数 166 浏览 0 评论 0原文

你能告诉我为什么

a =(b>0)?1:0

更好吗

if (b>0)a=1; else a =0;

比CUDA 中的版本 ?请提供详细信息。非常感谢。

Could you tell me why

a =(b>0)?1:0

is better than

if (b>0)a=1; else a =0;

version in CUDA? Please give details. Many thanks.

Yik

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

飘落散花 2024-12-07 02:35:58

曾经有一段时间,NVIDIA 编译器使用惯用法测试为三元运算符生成比 if/then/else 结构更高效的代码。这是一个小测试的结果,看看情况是否仍然如此:

__global__ void branchTest0(float *a, float *b, float *d)
{
        unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
        float aval = a[tidx], bval = b[tidx];
        float z0 = (aval > bval) ? aval : bval;

        d[tidx] = z0;
}

__global__ void branchTest1(float *a, float *b, float *d)
{
        unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
        float aval = a[tidx], bval = b[tidx];
        float z0;

        if (aval > bval) {
            z0 = aval;
        } else {
            z0 = bval;
        }
        d[tidx] = z0;
}

使用 CUDA 4.0 版本编译器编译这两个用于计算能力 2.0 的内核,比较部分产生以下结果:

branchTest0:
max.f32         %f3, %f1, %f2;

三元

branchTest1:
setp.gt.f32     %p1, %f1, %f2;
selp.f32        %f3, %f1, %f2, %p1;

运算符被编译为单个浮点最大值指令,而 if/then/else 被编译成两条指令,一个比较,然后一个选择。两个代码都是有条件执行的——都不产生分支。汇编器发出的机器代码也不同,并且紧密复制 PTX:

branchTest0:
    /*0070*/     /*0x00201c00081e0000*/     FMNMX R0, R2, R0, !pt;

因此

branchTest1:
    /*0070*/     /*0x0021dc00220e0000*/     FSETP.GT.AND P0, pt, R2, R0, pt;
    /*0078*/     /*0x00201c0420000000*/     SEL R0, R2, R0, P0;

,至少对于具有此类构造的 CUDA 4.0 的 Fermi GPU,三元运算符确实产生的指令少于等效的 if /然后/否则。它们之间是否存在性能差异取决于我没有的微基准测试数据。

There was a time when the NVIDIA compiler used idiom testing to generate more efficient code for the ternary operator than if/then/else constructs. This is the results of a small test to see whether this is still the case:

__global__ void branchTest0(float *a, float *b, float *d)
{
        unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
        float aval = a[tidx], bval = b[tidx];
        float z0 = (aval > bval) ? aval : bval;

        d[tidx] = z0;
}

__global__ void branchTest1(float *a, float *b, float *d)
{
        unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
        float aval = a[tidx], bval = b[tidx];
        float z0;

        if (aval > bval) {
            z0 = aval;
        } else {
            z0 = bval;
        }
        d[tidx] = z0;
}

Compiling these two kernels for compute capability 2.0 with the CUDA 4.0 release compiler, the comparison section produces this:

branchTest0:
max.f32         %f3, %f1, %f2;

and

branchTest1:
setp.gt.f32     %p1, %f1, %f2;
selp.f32        %f3, %f1, %f2, %p1;

The ternary operator gets compiled into a single floating point maximum instruction, whereas the if/then/else gets compiled into two instructions, a compare followed by a select. Both codes are conditionally executed - neither produces branching. The machine code emitted by the assembler for these is also different and closely replicates the PTX:

branchTest0:
    /*0070*/     /*0x00201c00081e0000*/     FMNMX R0, R2, R0, !pt;

and

branchTest1:
    /*0070*/     /*0x0021dc00220e0000*/     FSETP.GT.AND P0, pt, R2, R0, pt;
    /*0078*/     /*0x00201c0420000000*/     SEL R0, R2, R0, P0;

So it would seem that, at least for Fermi GPUs with CUDA 4.0 with this sort of construct, the ternary operator does produce fewer instructions that an equivalent if/then/else. Whether there is a performance difference between them comes down to microbenchmarking data which I don't have.

够钟 2024-12-07 02:35:58

一般来说,我建议以自然的风格编写 CUDA 代码,并让编译器担心本地分支。除了预测之外,GPU 硬件还实现“选择”类型指令。使用talonmies的框架并粘贴原始海报的代码,我发现使用sm_20的CUDA 4.0编译器为两个版本生成相同的机器代码。我使用 -keep 保留中间文件,并使用 cuobjdump 实用程序生成反汇编。三元运算符和if语句都被翻译成FCMP指令,这是一条“选择”指令。

talonmies检查的样本案例实际上是一个特例。编译器可以识别一些常见的源代码习惯用法,例如经常用于表达 max() 和 min() 操作的特定三元表达式,并相应地生成代码。等效的 if 语句不被识别为习语。

__global__ void branchTest0(float *bp, float *d) 
{         
    unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
    float b = bp[tidx];
    float a = (b>0)?1:0;
    d[tidx] = a;
} 

__global__ void branchTest1(float *bp, float *d)
{
    unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
    float b = bp[tidx];
    float a;
    if (b>0)a=1; else a =0;
    d[tidx] = a;
}

code for sm_20
        Function : _Z11branchTest1PfS_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;
/*0010*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
/*0018*/     /*0x10019de218000000*/     MOV32I R6, 0x4;
/*0020*/     /*0x20009ca320044000*/     IMAD R2, R0, c [0x0] [0x8], R2;
/*0028*/     /*0x1020dc435000c000*/     IMUL.U32.U32.HI R3, R2, 0x4;
/*0030*/     /*0x80211c03200d8000*/     IMAD.U32.U32 R4.CC, R2, R6, c [0x0] [0x20];
/*0038*/     /*0x90315c4348004000*/     IADD.X R5, R3, c [0x0] [0x24];
/*0040*/     /*0xa0209c03200d8000*/     IMAD.U32.U32 R2.CC, R2, R6, c [0x0] [0x28];
/*0048*/     /*0x00401c8584000000*/     LD.E R0, [R4];
/*0050*/     /*0xb030dc4348004000*/     IADD.X R3, R3, c [0x0] [0x2c];
/*0058*/     /*0x03f01c003d80cfe0*/     FCMP.LEU R0, RZ, 0x3f800, R0;
/*0060*/     /*0x00201c8594000000*/     ST.E [R2], R0;
/*0068*/     /*0x00001de780000000*/     EXIT;
        ....................................


        Function : _Z11branchTest0PfS_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;
/*0010*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
/*0018*/     /*0x10019de218000000*/     MOV32I R6, 0x4;
/*0020*/     /*0x20009ca320044000*/     IMAD R2, R0, c [0x0] [0x8], R2;
/*0028*/     /*0x1020dc435000c000*/     IMUL.U32.U32.HI R3, R2, 0x4;
/*0030*/     /*0x80211c03200d8000*/     IMAD.U32.U32 R4.CC, R2, R6, c [0x0] [0x20];
/*0038*/     /*0x90315c4348004000*/     IADD.X R5, R3, c [0x0] [0x24];
/*0040*/     /*0xa0209c03200d8000*/     IMAD.U32.U32 R2.CC, R2, R6, c [0x0] [0x28];
/*0048*/     /*0x00401c8584000000*/     LD.E R0, [R4];
/*0050*/     /*0xb030dc4348004000*/     IADD.X R3, R3, c [0x0] [0x2c];
/*0058*/     /*0x03f01c003d80cfe0*/     FCMP.LEU R0, RZ, 0x3f800, R0;
/*0060*/     /*0x00201c8594000000*/     ST.E [R2], R0;
/*0068*/     /*0x00001de780000000*/     EXIT;
        ....................................

In general, I would recommend to write CUDA code in a natural style, and let the compiler worry about local branching. Besides predication, the GPU hardware also implements "select" type instructions. Using talonmies's framework and sticking in the original poster's code, I find that the same machine code is produced for both versions with the CUDA 4.0 compiler for sm_20. I used -keep to retain intermediate files, and the cuobjdump utility to produce the disassembly. Both the ternary operator and the if-statement are translated into an FCMP instruction, which is a "select" instruction.

The sample case examined by talonmies is actually a special case. The compiler recognizes some common source code idioms, such as the particular ternary expression frequently used to express max() and min() operations, and generates code accordingly. The equivalent if-statement is not recognized as an idiom.

__global__ void branchTest0(float *bp, float *d) 
{         
    unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
    float b = bp[tidx];
    float a = (b>0)?1:0;
    d[tidx] = a;
} 

__global__ void branchTest1(float *bp, float *d)
{
    unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
    float b = bp[tidx];
    float a;
    if (b>0)a=1; else a =0;
    d[tidx] = a;
}

code for sm_20
        Function : _Z11branchTest1PfS_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;
/*0010*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
/*0018*/     /*0x10019de218000000*/     MOV32I R6, 0x4;
/*0020*/     /*0x20009ca320044000*/     IMAD R2, R0, c [0x0] [0x8], R2;
/*0028*/     /*0x1020dc435000c000*/     IMUL.U32.U32.HI R3, R2, 0x4;
/*0030*/     /*0x80211c03200d8000*/     IMAD.U32.U32 R4.CC, R2, R6, c [0x0] [0x20];
/*0038*/     /*0x90315c4348004000*/     IADD.X R5, R3, c [0x0] [0x24];
/*0040*/     /*0xa0209c03200d8000*/     IMAD.U32.U32 R2.CC, R2, R6, c [0x0] [0x28];
/*0048*/     /*0x00401c8584000000*/     LD.E R0, [R4];
/*0050*/     /*0xb030dc4348004000*/     IADD.X R3, R3, c [0x0] [0x2c];
/*0058*/     /*0x03f01c003d80cfe0*/     FCMP.LEU R0, RZ, 0x3f800, R0;
/*0060*/     /*0x00201c8594000000*/     ST.E [R2], R0;
/*0068*/     /*0x00001de780000000*/     EXIT;
        ....................................


        Function : _Z11branchTest0PfS_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;
/*0010*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
/*0018*/     /*0x10019de218000000*/     MOV32I R6, 0x4;
/*0020*/     /*0x20009ca320044000*/     IMAD R2, R0, c [0x0] [0x8], R2;
/*0028*/     /*0x1020dc435000c000*/     IMUL.U32.U32.HI R3, R2, 0x4;
/*0030*/     /*0x80211c03200d8000*/     IMAD.U32.U32 R4.CC, R2, R6, c [0x0] [0x20];
/*0038*/     /*0x90315c4348004000*/     IADD.X R5, R3, c [0x0] [0x24];
/*0040*/     /*0xa0209c03200d8000*/     IMAD.U32.U32 R2.CC, R2, R6, c [0x0] [0x28];
/*0048*/     /*0x00401c8584000000*/     LD.E R0, [R4];
/*0050*/     /*0xb030dc4348004000*/     IADD.X R3, R3, c [0x0] [0x2c];
/*0058*/     /*0x03f01c003d80cfe0*/     FCMP.LEU R0, RZ, 0x3f800, R0;
/*0060*/     /*0x00201c8594000000*/     ST.E [R2], R0;
/*0068*/     /*0x00001de780000000*/     EXIT;
        ....................................
心欲静而疯不止 2024-12-07 02:35:58

一般来说,您需要避免 CUDA 代码中的分支,否则可能会出现扭曲发散,这可能会导致性能大幅下降。 if/else 子句通常会产生基于表达式测试的分支。消除分支的一种方法是使用一个可以在没有分支的情况下实现的表达式(如果编译器足够智能的话)——这样,warp 中的所有线程都遵循相同的代码路径。

In general you need to avoid branches in CUDA code, otherwise you may get warp divergence which can result in a big performance hit. if/else clauses will normally result in branches based on a test of an expression. One way of eliminating branches is to use an expression which can be implemented without branches if the compiler is smart enough - that way all the threads in a warp follow the same code path.

寄风 2024-12-07 02:35:58

在这两种情况下,编译器都会尝试做同样的事情,它的目标是使用谓词执行。您可以在 CUDA C 编程指南(可通过 网站 获取)以及 < a href="http://en.wikipedia.org/wiki/Branch_predicate" rel="nofollow">维基百科。本质上,对于像这样的短分支,硬件能够为分支的两侧发出指令,并使用谓词来指示哪些线程应该实际执行指令。

换句话说,性能差异很小。对于较旧的编译器,三级运算符有时会有所帮助,但现在它们是等效的。

In both cases the compiler is going to try to do the same thing, it will aim to use predicated execution. You can find more information in the CUDA C Programming Guide (available via the website) and also on Wikipedia. Essentially for short branches such as this the hardware is able to emit instructions for both sides of the branch and use a predicate to indicate which threads should actually execute the instructions.

In other words, there would be minimal performance difference. With older compilers the tertiary operator sometimes helped, but nowadays they are equivalent.

且行且努力 2024-12-07 02:35:58

不知道 CUDA,但在 C++ 和 C99 中,使用前者可以初始化 const 变量。

int const a = (b>0) ? 1 : 0;

而对于后者,您不能将 a 变量设置为 const,因为您必须在 if 之前声明它。

请注意,它可以写得更短:

int const a = (b>0);

您甚至可以删除括号...但恕我直言,它不会提高阅读能力。

Don't know for CUDA, but in C++ and C99, using the former you can initialize a const variable.

int const a = (b>0) ? 1 : 0;

Whereas with the latter, you cannot make your a variable const as you have to declare it before the if.

Note that it could be written even shorter:

int const a = (b>0);

And you could even remove the parenthesis ... but IMHO it does not improve reading.

烟雨凡馨 2024-12-07 02:35:58

我发现它更容易阅读。很明显,整个语句的目的是设置 a 的值。

目的是将 a 分配给两个值之一,三元条件运算符语法允许您在语句中只有一个 a =

我认为标准的 if/else 都在一行上是丑陋的(无论它的用途是什么)。

I find it easier to read. It's immediately obvious that the purpose of the whole statement is to set the value of a.

The intent is to assign a to one of two values, and the ternary conditional operator syntax lets you have only one a = in your statement.

I think a standard if/else all on one line is ugly (regardless of what it's used for).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文