CUDA乘法
串行代码片段如下所示:
int i, j;
for(j=0; j<ny; j++)
{
for(i=0; i<nx; i++)
{
x[i + j*nx] *= y[i];
}
}
我使用此内核将其转换为 CUDA:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int i,j;
for(tid = 0; tid <nx*ny; tid++)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
但是 GPU 内核没有提供任何加速改进?关于更好的解决方案有什么建议吗?提前致谢
Serial code snippet looks like this:
int i, j;
for(j=0; j<ny; j++)
{
for(i=0; i<nx; i++)
{
x[i + j*nx] *= y[i];
}
}
I converted this to CUDA using this kernel:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int i,j;
for(tid = 0; tid <nx*ny; tid++)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
However the GPU kernel does not give any speedup improvement? Any suggestions on a better solution?? Thanks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果这是串行代码:
那么您应该这样做:
您正在做的事情相当奇怪:您指示 CUDA 内核的每个线程迭代 0 到 之间的 所有 tid 值nx*ny,并计算与您的 CPU 版本相同的函数!此外,您执行循环的效率实际上不及 CPU 版本,而不仅仅是迭代索引。换句话说,您在每个线程中执行相同的操作,只是效率低于在 CPU 上的 1 个线程中执行的操作。难怪这样会慢一些;它应该慢得多。您的 CUDA 内核是:
这对每个线程执行 nx*ny 次迭代,与您的主机代码相同;你失去了并行性的所有好处,因为每个线程都在做同样的事情;使用 GPU 上的一个线程可以获得相同的性能和相同的结果!
如果这是 CUDA 源文件中的逐字代码,则需要更改它并重新进行比较;如果这是您编写的代码,旨在帮助非 CUDA 受众解释您的代码的作用,那么您需要展示您的实际 CUDA 代码,以便我们可以看到发生了什么……事实上,性能分析我所做的——一件微不足道的事——就是你所期望的一切。
If this is the serial code:
then you should be doing this:
What you're doing is fairly bizarre: you're instructing each thread of the CUDA kernel to iterate over all values of tid between 0 and nx*ny, and compute the same function as your CPU version! Moreover, instead of just iterating over the indices, you're actually doing the loop less efficiently than you did for the CPU version; in other words, you do the same thing in each thread, just less efficiently, than you are doing in 1 thread on the CPU. It's no wonder that this is slower; it should be much, much slower. Your CUDA kernel is:
This does nx*ny iterations, same as your host code, for each thread; you lose all benefit of the parallelism, since each thread is doing the same thing; you would get the same performance using one thread on the GPU, and the same result!
If this is the verbatim code from your CUDA source file, you need to change it and redo the comparison; if this is code you have written to help explain what your code is doing for a lay non-CUDA audience, then you need to present your actual CUDA code so that we can see what's going on... as it is, the performance analysis I have done - the trivial one - is all you can expect.
鉴于您对此答案的评论:
这意味着您打算每次计算启动一个线程,正确的 CUDA 实现将是:
如果您打算让每个线程计算多个计算每次内核启动,那么您将调整网格大小以“填充”目标 GPU 上的每个 SM,而不是使用与输入大小相同数量的线程,然后执行以下操作:
这至少会让您合并读取和写信给
x
,并删除您发布的版本中大量的冗余计算。可以进行许多进一步的优化,但这需要比问题和后续评论中提供的更多有关问题的信息。您的索引方案包含一个整数除法,然后每个计算包含一个整数乘加。对于每个输入值的单个 FLOP 来说,这是很大的开销。然而,话虽如此,如果我引用的问题大小是您感兴趣的实际问题大小,那么 GPU 永远不会比普通的主机 CPU 更快。您需要解决许多数量级的问题才能使用 GPU 来实现此类低算术强度操作的有用加速。Given your comment to this answer:
is implying you are intending to launch one thread per computation, the correct CUDA implementation would just be:
If you were intending for each thread to compute more than one computation per kernel launch, then you would size the grid to "fill" each of the SM on the target GPU, not use the same number of threads as the input size, and then do something like:
That would get you at least coalesced reads and writes to
x
, and remove the enormous number of redundant calculations in your posted version. There are a number of further optimizations that could be made, but it would require more information about the problem than has been supplied in the question and subsequent comments. Your indexing scheme contains an integer division and then an integer multiply-add per calculation. That is a lot of overhead for a single FLOP per input value. However, having said all of that, if the problem size I quoted is that actual problem size you are interested in, the GPU will never be faster than even a modest host CPU. You would require many orders of magnitude larger problems to realize useful speed up using the GPU for this sort low arithmetic intensity operation.该块有多大?可能是复制少量数据到GPU和设置环境所需的时间比计算时间长得多。
还要记住,CUDA 在第一次运行时会进行即时编译,因此为了获得准确的基准测试,您需要多次运行它。
How big is the block? it may be that the time needed to copy a small amount of data to the GPU and setup the envirnoment is much longer than the calculation time.
Remember also that CUDA does a jit compile on the first run so to get accurate benchmarking you need to run it many times.
使用共享内存尝试此操作。最好的实现之一:
Try this using shared memory. One of the best implementations around: