为什么“a＝(b＞0)？1:0”比“if-else”更好CUDA版本？

发布于 2024-11-30 02:35:58 字数 166 浏览 0 评论 0原文

你能告诉我为什么

a =(b>0)?1:0

更好吗

if (b>0)a=1; else a =0;

比CUDA 中的版本？请提供详细信息。非常感谢。

易

原文

Could you tell me why

a =(b>0)?1:0

is better than

if (b>0)a=1; else a =0;

version in CUDA? Please give details. Many thanks.

Yik

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

飘落散花 2024-12-07 02:35:58

曾经有一段时间，NVIDIA 编译器使用惯用法测试为三元运算符生成比 if/then/else 结构更高效的代码。这是一个小测试的结果，看看情况是否仍然如此：

__global__ void branchTest0(float *a, float *b, float *d)
{
        unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
        float aval = a[tidx], bval = b[tidx];
        float z0 = (aval > bval) ? aval : bval;

        d[tidx] = z0;
}

__global__ void branchTest1(float *a, float *b, float *d)
{
        unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
        float aval = a[tidx], bval = b[tidx];
        float z0;

        if (aval > bval) {
            z0 = aval;
        } else {
            z0 = bval;
        }
        d[tidx] = z0;
}

使用 CUDA 4.0 版本编译器编译这两个用于计算能力 2.0 的内核，比较部分产生以下结果：

branchTest0:
max.f32         %f3, %f1, %f2;

三元

branchTest1:
setp.gt.f32     %p1, %f1, %f2;
selp.f32        %f3, %f1, %f2, %p1;

运算符被编译为单个浮点最大值指令，而 if/then/else 被编译成两条指令，一个比较，然后一个选择。两个代码都是有条件执行的——都不产生分支。汇编器发出的机器代码也不同，并且紧密复制 PTX：

branchTest0:
    /*0070*/     /*0x00201c00081e0000*/     FMNMX R0, R2, R0, !pt;

因此

branchTest1:
    /*0070*/     /*0x0021dc00220e0000*/     FSETP.GT.AND P0, pt, R2, R0, pt;
    /*0078*/     /*0x00201c0420000000*/     SEL R0, R2, R0, P0;

，至少对于具有此类构造的 CUDA 4.0 的 Fermi GPU，三元运算符确实产生的指令少于等效的 if /然后/否则。它们之间是否存在性能差异取决于我没有的微基准测试数据。

There was a time when the NVIDIA compiler used idiom testing to generate more efficient code for the ternary operator than if/then/else constructs. This is the results of a small test to see whether this is still the case:

__global__ void branchTest0(float *a, float *b, float *d)
{
        unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
        float aval = a[tidx], bval = b[tidx];
        float z0 = (aval > bval) ? aval : bval;

        d[tidx] = z0;
}

__global__ void branchTest1(float *a, float *b, float *d)
{
        unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
        float aval = a[tidx], bval = b[tidx];
        float z0;

        if (aval > bval) {
            z0 = aval;
        } else {
            z0 = bval;
        }
        d[tidx] = z0;
}

Compiling these two kernels for compute capability 2.0 with the CUDA 4.0 release compiler, the comparison section produces this:

branchTest0:
max.f32         %f3, %f1, %f2;

and

branchTest1:
setp.gt.f32     %p1, %f1, %f2;
selp.f32        %f3, %f1, %f2, %p1;

The ternary operator gets compiled into a single floating point maximum instruction, whereas the if/then/else gets compiled into two instructions, a compare followed by a select. Both codes are conditionally executed - neither produces branching. The machine code emitted by the assembler for these is also different and closely replicates the PTX:

branchTest0:
    /*0070*/     /*0x00201c00081e0000*/     FMNMX R0, R2, R0, !pt;

and

branchTest1:
    /*0070*/     /*0x0021dc00220e0000*/     FSETP.GT.AND P0, pt, R2, R0, pt;
    /*0078*/     /*0x00201c0420000000*/     SEL R0, R2, R0, P0;

So it would seem that, at least for Fermi GPUs with CUDA 4.0 with this sort of construct, the ternary operator does produce fewer instructions that an equivalent if/then/else. Whether there is a performance difference between them comes down to microbenchmarking data which I don't have.

回复收藏 0 原文

够钟 2024-12-07 02:35:58

一般来说，我建议以自然的风格编写 CUDA 代码，并让编译器担心本地分支。除了预测之外，GPU 硬件还实现“选择”类型指令。使用talonmies的框架并粘贴原始海报的代码，我发现使用sm_20的CUDA 4.0编译器为两个版本生成相同的机器代码。我使用 -keep 保留中间文件，并使用 cuobjdump 实用程序生成反汇编。三元运算符和if语句都被翻译成FCMP指令，这是一条“选择”指令。

talonmies检查的样本案例实际上是一个特例。编译器可以识别一些常见的源代码习惯用法，例如经常用于表达 max() 和 min() 操作的特定三元表达式，并相应地生成代码。等效的 if 语句不被识别为习语。

__global__ void branchTest0(float *bp, float *d) 
{         
    unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
    float b = bp[tidx];
    float a = (b>0)?1:0;
    d[tidx] = a;
} 

__global__ void branchTest1(float *bp, float *d)
{
    unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
    float b = bp[tidx];
    float a;
    if (b>0)a=1; else a =0;
    d[tidx] = a;
}

code for sm_20
        Function : _Z11branchTest1PfS_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;
/*0010*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
/*0018*/     /*0x10019de218000000*/     MOV32I R6, 0x4;
/*0020*/     /*0x20009ca320044000*/     IMAD R2, R0, c [0x0] [0x8], R2;
/*0028*/     /*0x1020dc435000c000*/     IMUL.U32.U32.HI R3, R2, 0x4;
/*0030*/     /*0x80211c03200d8000*/     IMAD.U32.U32 R4.CC, R2, R6, c [0x0] [0x20];
/*0038*/     /*0x90315c4348004000*/     IADD.X R5, R3, c [0x0] [0x24];
/*0040*/     /*0xa0209c03200d8000*/     IMAD.U32.U32 R2.CC, R2, R6, c [0x0] [0x28];
/*0048*/     /*0x00401c8584000000*/     LD.E R0, [R4];
/*0050*/     /*0xb030dc4348004000*/     IADD.X R3, R3, c [0x0] [0x2c];
/*0058*/     /*0x03f01c003d80cfe0*/     FCMP.LEU R0, RZ, 0x3f800, R0;
/*0060*/     /*0x00201c8594000000*/     ST.E [R2], R0;
/*0068*/     /*0x00001de780000000*/     EXIT;
        ....................................


        Function : _Z11branchTest0PfS_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;
/*0010*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
/*0018*/     /*0x10019de218000000*/     MOV32I R6, 0x4;
/*0020*/     /*0x20009ca320044000*/     IMAD R2, R0, c [0x0] [0x8], R2;
/*0028*/     /*0x1020dc435000c000*/     IMUL.U32.U32.HI R3, R2, 0x4;
/*0030*/     /*0x80211c03200d8000*/     IMAD.U32.U32 R4.CC, R2, R6, c [0x0] [0x20];
/*0038*/     /*0x90315c4348004000*/     IADD.X R5, R3, c [0x0] [0x24];
/*0040*/     /*0xa0209c03200d8000*/     IMAD.U32.U32 R2.CC, R2, R6, c [0x0] [0x28];
/*0048*/     /*0x00401c8584000000*/     LD.E R0, [R4];
/*0050*/     /*0xb030dc4348004000*/     IADD.X R3, R3, c [0x0] [0x2c];
/*0058*/     /*0x03f01c003d80cfe0*/     FCMP.LEU R0, RZ, 0x3f800, R0;
/*0060*/     /*0x00201c8594000000*/     ST.E [R2], R0;
/*0068*/     /*0x00001de780000000*/     EXIT;
        ....................................

In general, I would recommend to write CUDA code in a natural style, and let the compiler worry about local branching. Besides predication, the GPU hardware also implements "select" type instructions. Using talonmies's framework and sticking in the original poster's code, I find that the same machine code is produced for both versions with the CUDA 4.0 compiler for sm_20. I used -keep to retain intermediate files, and the cuobjdump utility to produce the disassembly. Both the ternary operator and the if-statement are translated into an FCMP instruction, which is a "select" instruction.

The sample case examined by talonmies is actually a special case. The compiler recognizes some common source code idioms, such as the particular ternary expression frequently used to express max() and min() operations, and generates code accordingly. The equivalent if-statement is not recognized as an idiom.

__global__ void branchTest0(float *bp, float *d) 
{         
    unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
    float b = bp[tidx];
    float a = (b>0)?1:0;
    d[tidx] = a;
} 

__global__ void branchTest1(float *bp, float *d)
{
    unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
    float b = bp[tidx];
    float a;
    if (b>0)a=1; else a =0;
    d[tidx] = a;
}

code for sm_20
        Function : _Z11branchTest1PfS_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;
/*0010*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
/*0018*/     /*0x10019de218000000*/     MOV32I R6, 0x4;
/*0020*/     /*0x20009ca320044000*/     IMAD R2, R0, c [0x0] [0x8], R2;
/*0028*/     /*0x1020dc435000c000*/     IMUL.U32.U32.HI R3, R2, 0x4;
/*0030*/     /*0x80211c03200d8000*/     IMAD.U32.U32 R4.CC, R2, R6, c [0x0] [0x20];
/*0038*/     /*0x90315c4348004000*/     IADD.X R5, R3, c [0x0] [0x24];
/*0040*/     /*0xa0209c03200d8000*/     IMAD.U32.U32 R2.CC, R2, R6, c [0x0] [0x28];
/*0048*/     /*0x00401c8584000000*/     LD.E R0, [R4];
/*0050*/     /*0xb030dc4348004000*/     IADD.X R3, R3, c [0x0] [0x2c];
/*0058*/     /*0x03f01c003d80cfe0*/     FCMP.LEU R0, RZ, 0x3f800, R0;
/*0060*/     /*0x00201c8594000000*/     ST.E [R2], R0;
/*0068*/     /*0x00001de780000000*/     EXIT;
        ....................................


        Function : _Z11branchTest0PfS_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;
/*0010*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
/*0018*/     /*0x10019de218000000*/     MOV32I R6, 0x4;
/*0020*/     /*0x20009ca320044000*/     IMAD R2, R0, c [0x0] [0x8], R2;
/*0028*/     /*0x1020dc435000c000*/     IMUL.U32.U32.HI R3, R2, 0x4;
/*0030*/     /*0x80211c03200d8000*/     IMAD.U32.U32 R4.CC, R2, R6, c [0x0] [0x20];
/*0038*/     /*0x90315c4348004000*/     IADD.X R5, R3, c [0x0] [0x24];
/*0040*/     /*0xa0209c03200d8000*/     IMAD.U32.U32 R2.CC, R2, R6, c [0x0] [0x28];
/*0048*/     /*0x00401c8584000000*/     LD.E R0, [R4];
/*0050*/     /*0xb030dc4348004000*/     IADD.X R3, R3, c [0x0] [0x2c];
/*0058*/     /*0x03f01c003d80cfe0*/     FCMP.LEU R0, RZ, 0x3f800, R0;
/*0060*/     /*0x00201c8594000000*/     ST.E [R2], R0;
/*0068*/     /*0x00001de780000000*/     EXIT;
        ....................................

回复收藏 0 原文

心欲静而疯不止 2024-12-07 02:35:58

一般来说，您需要避免 CUDA 代码中的分支，否则可能会出现扭曲发散，这可能会导致性能大幅下降。 if/else 子句通常会产生基于表达式测试的分支。消除分支的一种方法是使用一个可以在没有分支的情况下实现的表达式（如果编译器足够智能的话）——这样，warp 中的所有线程都遵循相同的代码路径。

回复收藏 0 原文

寄风 2024-12-07 02:35:58

在这两种情况下，编译器都会尝试做同样的事情，它的目标是使用谓词执行。您可以在 CUDA C 编程指南（可通过网站获取）以及 < a href="http://en.wikipedia.org/wiki/Branch_predicate" rel="nofollow">维基百科。本质上，对于像这样的短分支，硬件能够为分支的两侧发出指令，并使用谓词来指示哪些线程应该实际执行指令。

换句话说，性能差异很小。对于较旧的编译器，三级运算符有时会有所帮助，但现在它们是等效的。

回复收藏 0 原文