向量乘法程序中奇怪的 CUDA 行为
我在使用一个非常基本的 CUDA 程序时遇到了一些问题。我有一个程序,可以将主机和设备上的两个向量相乘,然后比较它们。这工作没有问题。错误的是我试图测试不同数量的线程和块以用于学习目的。我有以下内核:
__global__ void multiplyVectorsCUDA(float *a,float *b, float *c, int N){
int idx = threadIdx.x;
if (idx<N)
c[idx] = a[idx]*b[idx];
}
我称之为:
multiplyVectorsCUDA <<<nBlocks, nThreads>>> (vector_a_d,vector_b_d,vector_c_d,N);
目前我已将 nBLocks 固定为 1,因此我只改变向量大小 N 和线程数 n线程
。据我了解,每次乘法都会有一个线程,因此 N 和 nThreads 应该相等。
问题如下:
- 我首先使用
N=16
和nThreads<16
调用内核,但这不起作用。 (这没问题) - 然后我用
N=16
和nThreads=16
调用它,效果很好。 (再次 按预期工作) - 但是当我使用
N=16
和nThreads<16
调用它时,它仍然有效!
我不明白为什么最后一步不像第一步那样失败。仅当我重新启动电脑时,它才会再次失败。
有没有人遇到过类似的事情或者可以解释这种行为?
I'm having some trouble with a very basic CUDA program. I have a program that multiplies two vectors on the Host and on the Device and then compares them. This works without a problem. What's wrong is that I'm trying to test different number of threads and blocks for learning purposes. I have the following kernel:
__global__ void multiplyVectorsCUDA(float *a,float *b, float *c, int N){
int idx = threadIdx.x;
if (idx<N)
c[idx] = a[idx]*b[idx];
}
which I call like:
multiplyVectorsCUDA <<<nBlocks, nThreads>>> (vector_a_d,vector_b_d,vector_c_d,N);
For the moment I've fixed nBLocks
to 1 so I only vary the vector size N
and the number of threads nThreads
. From what I understand, there will be a thread for each multiplication so N
and nThreads
should be equal.
The problem is the following
- I first call the kernel with
N=16
andnThreads<16
which doesn't work. (This is ok) - Then I call it with
N=16
andnThreads=16
which works fine. (Again
works as expected) - But when I call it with
N=16
andnThreads<16
it still works!
I don't understand why the last step doesn't fail like the first one. It only fails again if I restart my PC.
Has anyone run into something like this before or can explain this behavior?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
等等,你是连续打电话给三个人吗?我不知道您的其余代码,但是您确定要清除每次运行之间分配的图形内存吗?如果没有,这可以解释为什么它第一次不起作用,但第三次传递相同的值时却起作用,以及为什么它只有在重新启动后才能再次工作(重新启动会清除所有分配的内存)。
Wait, so are you calling all three in a row? I don't know the rest of your code, but are you sure you're clearing out the graphics memory you alloced between each run? If not, that could explain why it doesn't work the first time but does the third time when you're passing the same values, and why it only works again after rebooting (rebooting clears all the memory alloced).
不知道是否可以回答我自己的问题,但我意识到在比较主机和设备向量时我的代码中有一个错误(该部分代码未发布)。带来不便敬请谅解。有人可以关闭这篇文章吗,因为它不允许我删除它?
Don't know if its ok to answer my own question but I realized I had a bug in my code when comparing the host and device vectors (that part of the code wasn't posted). Sorry for the inconvenience. Could someone please close this post since it won't let me delete it?