为什么在 CUDA 中用位运算替换 if-else 会变慢?
我
if((nMark >> tempOffset) & 1){nDuplicate++;}
else{nMark = (nMark | (1 << tempOffset));}
用
nDuplicate += ((nMark >> tempOffset) & 1);
nMark = (nMark | (1 << tempOffset));
这个替代品替换后发现 GT 520 显卡上慢了 5 毫秒。
你能告诉我为什么吗?或者你有什么想法可以帮助我改进它吗?
I replace
if((nMark >> tempOffset) & 1){nDuplicate++;}
else{nMark = (nMark | (1 << tempOffset));}
with
nDuplicate += ((nMark >> tempOffset) & 1);
nMark = (nMark | (1 << tempOffset));
this replacement turns out to be 5ms slower on GT 520 graphics card.
Could you tell me why? or do you have any idea to help me improve it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
GPU 的本机指令集通过预测非常有效地处理小条件。此外,ISET 指令将条件代码寄存器转换为值为 0 或 1 的整数,这自然适合您的条件增量。
我的猜测是,第一个和第二个公式之间的主要区别在于,您有效地隐藏了它是 if/else 的事实。
为了确定这一点,您可以使用 cuobjdump 查看为两种情况生成的微代码:将 --keep 指定为 nvcc 并在 .cubin 文件上使用 cuobjdump 以查看反汇编的微代码。
The native instruction set for the GPU deals with small conditions very efficiently via predication. Additionally, the ISET instruction converts a condition code register into an integer with the value 0 or 1, which naturally fits with your conditional increment.
My guess is that the key difference between the first and second formulations is that you've effectively hidden the fact that it's an if/else.
To tell for sure, you can use cuobjdump to look at the microcode generated for the two cases: specify --keep to nvcc and use cuobjdump on the .cubin file to see the disassembled microcode.
在黑暗中拍摄,但在后一个实现中,您总是递增/重新分配给 nDuplicate 变量,而如果 if 语句中的测试之前为 false,则不会递增/分配给它。猜测开销来自于此,但您没有描述您的测试数据集,所以我不知道情况是否已经如此。
Shot in the dark, but you're always incrementing/re-assigning to the nDuplicate variable now in the latter implementation where as you weren't incrementing/assigning to it if the test in the if statement was false previously. Guessing the overhead comes from that, but you don't describe your test data set so I don't know if that was already the case.
您的程序是否表现出显着的分支分歧?如果您正在运行例如 100 个经线,并且只有 5 个经线具有不同的行为,并且它们在 5 个 SM 中运行,那么您只会看到 21 个时间周期(预计 20 个)... 5% 的增长很容易通过执行 2 倍的工作来击败在每个线程中以避免罕见的分歧。
除此之外,520 是一款相当现代的显卡,并且可能会结合现代 SIMT 调度技术,例如动态扭曲形成和线程块压缩,以隐藏 SIMT 停顿。也许研究架构特性(规格)或编写一个简单的基准来生成 n 路分支分歧并测量速度减慢?
除此之外,请检查变量所在的位置。共享它们会影响性能/结果吗?由于您总是访问第二个变量中的所有变量,而第一个变量可以避免访问 nDimension,因此缓慢(未合并的全局?)内存访问可以解释它。
只是一些需要考虑的事情。
Does your program exhibit significant branch divergence? If you're running e.g. 100 warps and only 5 have divergent behavior, and they run in 5 SMs, you would only see 21 time cycles (expecting 20)... a 5% increase that could easily be defeated by doing 2x the work in each thread to avoid rare divergence.
Barring that, the 520 is a fairly modern graphics card, and might incorporate modern SIMT scheduling techniques, e.g. Dynamic Warp Formation and Thread Block Compaction, to hide SIMT stalls. Maybe look into architectural features (specs) or write a simple benchmark to generate n-way branch divergence and measure slowdown?
Barring that, check where your variables live. Does making them shared affect performance/results? Since you always access all variables in the second and the first can avoid accessing nDimension, slow (uncoalesced global?) memory accesses could explain it.
Just some things to think about.
对于低级优化,直接查看内核的低级汇编(SASS)通常会很有帮助。您可以使用作为 CUDA 工具包一部分分发的 cuobjdump 工具来执行此操作。基本用法是在 nvcc 中使用
-keep
进行编译,然后执行以下操作:然后您可以看到确切的指令顺序并进行比较。我不确定为什么代码的版本 1 会比版本 2 更快,但 SASS 列表可能会给您提供线索。
For low-level optimization, it is often helpful to look at the low-level assembly (SASS) of the kernel directly. You can do this with the
cuobjdump
tool distributed as part of the CUDA Toolkit. Basic usage is to compile with-keep
in nvcc then do:Then you can see the exact sequence of instructions and compare them. I'm not sure why version 1 would be faster than version 2 of the code, but the SASS listings might give you a clue.