CUDA 内核函数比同等主机函数花费更长的时间
我正在关注 http://code.google.com/p/stanford- cs193g-sp2010/ 和在线发布的视频讲座,在做其中一个发布的问题集(第一个)时,我遇到了一些稍微违反直觉的事情,至少在提出问题的方式方面是这样。该问题要求我推导 cpu 和 gpu 上执行时间的时序模型,假设基于在我自己的机器上运行的示例应用程序的时序进行线性缩放。
-将代码打印的计时数字插入您所在的计算机上 研究该方程并报告盈亏平衡点 (当CPU版本与GPU版本一样快时)将会。
我遇到的问题是我的内核花费的时间比等效函数的主机版本要长得多(我将在下面发布这两个版本),因此没有收支平衡点。我得到的数字如下。
done with copy to gpu kernel
copy to gpu took 26.30630 ms
done with gpu shift cypher kernel
gpu shift cypher took 7.33203 ms
done with copy from gpu kernel
copy from gpu took 28.54141 ms
host shift cypher took 0.00186 ms
Worked! CUDA and reference output match.
你觉得我做事的方式有问题吗?这是内核和主机函数。
// This kernel implements a per element shift
__global__ void shift_cypher(unsigned int *input_array, unsigned int *output_array,
unsigned int shift_amount, unsigned int alphabet_max, unsigned int array_length)
{
int gid = blockIdx.x * blockDim.x + threadIdx.x;
output_array[gid] = (input_array[gid] + shift_amount)%(alphabet_max+1);
}
void host_shift_cypher(unsigned int *input_array, unsigned int *output_array, unsigned int shift_amount, unsigned int alphabet_max, unsigned int array_length)
{
for(unsigned int i=0;i<array_length;i++)
{
int element = input_array[i];
int shifted = element + shift_amount;
if(shifted > alphabet_max)
{
shifted = shifted % (alphabet_max + 1);
}
output_array[i] = shifted;
}
}
示例应用程序使用 16MB 的整数元素运行,块大小为 512。以下是相关文件的完整源代码 http: //pastebin.com/htYH0bA2
I'm following along with http://code.google.com/p/stanford-cs193g-sp2010/ and the video lectures posted online, doing one of the problem sets posted (the first one) I've encountered something slightly counterintuitive at least with respect to the way the question is asked. The question asks me to derive a timing model for execution time on the cpu and gpu assuming linear scaling based on timings from a sample application run on my own machine.
-Plug the timing numbers printed by the code on the computer you're
working on into that equation and report what the break even point
(when the cpu version is as fast as the gpu version) will be.
The issue I'm having is that my kernel is taking a lot longer than the host version of the equivalent function (I'll post both below), such that there is no break even point. The numbers I'm getting are as follows.
done with copy to gpu kernel
copy to gpu took 26.30630 ms
done with gpu shift cypher kernel
gpu shift cypher took 7.33203 ms
done with copy from gpu kernel
copy from gpu took 28.54141 ms
host shift cypher took 0.00186 ms
Worked! CUDA and reference output match.
Do you think there is something wrong with the way I'm doing things? Here is the kernel and host functions.
// This kernel implements a per element shift
__global__ void shift_cypher(unsigned int *input_array, unsigned int *output_array,
unsigned int shift_amount, unsigned int alphabet_max, unsigned int array_length)
{
int gid = blockIdx.x * blockDim.x + threadIdx.x;
output_array[gid] = (input_array[gid] + shift_amount)%(alphabet_max+1);
}
void host_shift_cypher(unsigned int *input_array, unsigned int *output_array, unsigned int shift_amount, unsigned int alphabet_max, unsigned int array_length)
{
for(unsigned int i=0;i<array_length;i++)
{
int element = input_array[i];
int shifted = element + shift_amount;
if(shifted > alphabet_max)
{
shifted = shifted % (alphabet_max + 1);
}
output_array[i] = shifted;
}
}
The sample application runs with 16MB of integer elements, with a block size of 512. Here is the full source for the file in question http://pastebin.com/htYH0bA2
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这看起来很奇怪。无论您在 CPU 上使用 16MB 做什么,都需要花费不到一毫秒的时间。
通过查看pastebin代码,你似乎可以用CUDA事件来计时一切。虽然我没有使用过它们,但我的猜测是你用它来测量 GPU 内核执行的实际时间。在仅调用主机代码的情况下,这几乎没有任何作用。这真的是斯坦福大学课程中衡量主机代码执行情况的方法吗?
只需使用任何类型的 C 计时器检查此结果即可证明我是错的。
This looks very odd. Whatever you do with 16MB on a CPU, it should take more than a fraction of a millisecond.
By looking at the pastebin code it seems you time everything with CUDA events. Although I have not used them, my guess is that you measure actual time of GPU Kernels executing with this. Which, in the case of just calling host code will be next to nothing. Is this really how they measure host code executing in the Stanford course?
You could prove me wrong by just checking this result with any kind of C timer.
正如 wm 指出的那样,这是计时器的问题。我认为,问题在于计时器中的事件记录功能在记录事件之前将控制权移交给了基于 cpu 的主机功能。这有点令人困惑,因为您会认为记录事件会在主机代码执行的时间内发生,但它似乎正在做一些更像是同时记录开始和停止事件的事情,都是在主机代码完成执行之后。添加
cudaThreadSynchronize();
到启动计时器似乎可以解决问题(确保在继续主机代码之前记录事件。这可能是仅 Windows 的差异或基于我的 CUDA 版本或硬件无论如何,我不确定我的新的、更正常的结果如下。This was a problem with the timer as w.m pointed out. The issue was, I believe, that the event recording functions in the timer handed off control to the cpu based host function before recording the event. It's sort of confusing because you would think that recording the event would occur within the time the host code is executing but it seems that it was doing something more like recording the start and stop events simultaneously, both after the host code finished executing. Adding a
cudaThreadSynchronize();
to the start timer seems to fix the problem (ensuring the event gets recorded before continuing with the host code. This could be a windows only discrepancy or based on my CUDA version or hardware etc I'm not sure. In any case my new, much more normal results are as follows.