调试时共享内存问题

发布于 2024-11-15 22:48:07 字数 613 浏览 2 评论 0原文

我正在尝试使用 Nsight 调试以下代码:

__device__ void change(int shared[])
{
    if(threadIdx.x<10)
        shared[threadIdx.x]=threadIdx.x;
}
__global__ void MyK()
{
    int shared[10]; 
    change(shared);
    __syncthreads();
}

我在 main 方法中调用我的内核,如下所示:

cudaSetDevice(1);
MyK<<<1,20>>>();

当我在 change(shared) 之前放置断点时,我可以看到共享数组是创建并将其值设置为 0。但是,如果我将断点放在 __syncthreads() 之后,调试器会显示以下错误:

cannot resolve name shared

Can't I pass my share array to another device function?

I am trying to use Nsight to debug the following code:

__device__ void change(int shared[])
{
    if(threadIdx.x<10)
        shared[threadIdx.x]=threadIdx.x;
}
__global__ void MyK()
{
    int shared[10]; 
    change(shared);
    __syncthreads();
}

I am calling my kernel in the main method like this :

cudaSetDevice(1);
MyK<<<1,20>>>();

When I put a breakpoint before change(shared), I can see that the shared array is created and its values are set to 0. However, if I put the breakpoint after __syncthreads(), the debugger shows the following error:

cannot resolve name shared

Can't I pass my shared array to another device function?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

凶凌 2024-11-22 22:48:07

您在内存监视窗口中看到“无法解析共享名称”的原因是因为共享数组正在被编译器优化,因为在 change(shared) 之后内核的任何部分都不会使用它。就像前面提到的 @user586831 一样,尝试输出该值作为设备函数的返回值。

另外,不确定您是否真正指的是 __shared__ 数组,还是通过名称 shared 引用该数组。无论如何,您在上面的代码中没有使用共享内存。 int shared 只是一个普通的整数数组类型。您需要指定 __shared__ 限定符才能声明共享内存。例如

extern __shared__ int shared[10];

The reason why you see the "Cannot resolve name shared" in the memory watch window is because shared array is being optimized out by the compiler since it is not being used at all by any part of your kernel after change(shared). Like @user586831 mentioned earlier, try outputing the value as your return value for your device function.

Also on another note, not sure if you really meant __shared__ array or referring to the array by its name shared. Anyway you're not using shared memory in your code above. int shared is just a normal integer array type. You need to specify the __shared__ qualifier in order to declare shared memory. E.g.

extern __shared__ int shared[10];
残花月 2024-11-22 22:48:07

对某些线程(并非所有线程)调用 __syncthreads() 可能会导致死锁。 <代码>threadIdx.x < 10 调用_syncthreads()
如前所述,您在这里没有使用共享内存。
编译器很聪明,如果您之后不使用该值,则内存位置可能会变得无效。
尝试输出该值作为设备函数的返回值。应该可以正常工作,尤其是在移动/删除 __syncthreads() 时。

Calling __syncthreads() for some and not all threads can cause a deadlock. threadIdx.x < 10 calls _syncthreads()
As previously mentioned you are not using shared memory here.
The compiler is clever if you are not using the value afterwards the memory location can become invalid.
Try outputing the value as your return value for your device function. Should work fine especially if you move/remove __syncthreads().

仅此而已 2024-11-22 22:48:07

这是实际的代码还是您从缓冲区声明中省略了 __shared__

另请记住,__device__ 函数由编译器内联,并且调试器只能在整个过程中的某个时刻停止。
尝试使用至少 16 或 32 个线程的倍数的内核,否则您将无法运行
完整的 SP,这可能会欺骗调试器。

Is that the actual code or you omitted __shared__ from the buffer declaration?

Keep also in mind that the __device__ functions get inlined by the compiler and that the debugger can stop only at some point in the whole process.
Try to use a kernel of a multiple of at least 16 or 32 threads or otherwise you are not running
a full SP and that might trick the debugger.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文