cudaMemcpy2D 的分段错误
我有一个存储在 GPU 中的 2D 数组 dev_histogram 和一个存储在 CPU 中的 2D 数组 histogarm 。我想将 dev_histogram 的内容复制到直方图中。以下是我的程序的相关部分。我也可以发布完整的代码。
int *dev_histogram; // Array for histogram, GPU
int histogram[SIZE_THETA][SIZE_RHO]; // Array for histogram, CPU
size_t pitch;
histogramSize = sizeof(int) * SIZE_THETA * SIZE_RHO;
cudaMallocPitch((void**)&dev_histogram, &pitch, SIZE_THETA * sizeof(int), SIZE_RHO)
houghTransformation << <width, height >> >(dev_edges, dev_histogram, pitch, n_pixels, width, height);
// Here I get a Segmentation fault:
cudaMemcpy2D(histogram, pitch, dev_histogram, SIZE_THETA * sizeof(int), SIZE_THETA * sizeof(int), SIZE_RHO * sizeof(int), cudaMemcpyDeviceToHost)
您能帮我了解如何将矩阵复制回来吗?大多数情况下,我对如何作为我的来源的宣传感到困惑。
I have a 2D array dev_histogram stored in GPU and a 2D array histogarm stored in CPU. I want to copy content of dev_histogram into histogram. Below are relevant bits of my program. I can post full code as well.
int *dev_histogram; // Array for histogram, GPU
int histogram[SIZE_THETA][SIZE_RHO]; // Array for histogram, CPU
size_t pitch;
histogramSize = sizeof(int) * SIZE_THETA * SIZE_RHO;
cudaMallocPitch((void**)&dev_histogram, &pitch, SIZE_THETA * sizeof(int), SIZE_RHO)
houghTransformation << <width, height >> >(dev_edges, dev_histogram, pitch, n_pixels, width, height);
// Here I get a Segmentation fault:
cudaMemcpy2D(histogram, pitch, dev_histogram, SIZE_THETA * sizeof(int), SIZE_THETA * sizeof(int), SIZE_RHO * sizeof(int), cudaMemcpyDeviceToHost)
Could you please help me understand how to copy my matrix back? Mostly, I am confused with what to put as pitch for my source.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
指定 SIZE_RHO 作为高度,而不是 SIZE_RHO * sizeof(int):
Specify SIZE_RHO as the height, not SIZE_RHO * sizeof(int):
在 CUDA 工具包参考手册中,您可以看到 cudaMallocPitch 中的间距是为要复制的 2D 数组分配的宽度(以字节为单位)。您的 dev_histogram 将具有等于间距的实际宽度和等于您指定的高度的高度。 2D 数组的每一行都分配有间距字节,但只有 width*sizeof(int) 字节的有效数据。
在同一文档中,cudaMemcpy2D 的原型位于
此处 dst 是主机上的数组,dpitch 是目标数组(直方图)的字节宽度,spitch 是源数组(dev_histogram)的字节宽度。宽度和高度是二维数组的尺寸。
那么你必须这样称呼它:
编辑:在ArchaeaSoftware之后我注意到高度实际上是行数,字节数的高度没有意义。更新了答案,因为您仍然需要更改音调。
In the CUDA toolkit reference manual you can see that the pitch in the cudaMallocPitch is the allocated width in bytes for the 2D array you are copying. Your dev_histogram will have an actual width equal to pitch and height equal to your specified height. Each line of your 2D array has pitch bytes allocated but only width*sizeof(int) bytes of valid data.
In the same document the prototype for cudaMemcpy2D is
here dst is your array on the host, dpitch is the width in bytes of the destination array (histogram) and spitch is the width in bytes of the source array (dev_histogram). width and height are the dimensions of your 2D array.
You must call it like this then:
Edit: after ArchaeaSoftware I noticed that indeed the height is really number of rows, height in number of bytes doesn't make sense. Updated answer because you still need to change the pitches.
通常,当在连续内存中存储数据时,您希望使内存部分的维度为存储单元的倍数,以便可以有效地读取数据。例如,您可以读取一个 32 位字,而不是连续读取 4 个单独的字节。你这样做是为了效率。查找内存对齐方式。
出于同样的原因,您希望使某些数组的大小为节距*高度,其中节距是向上舍入到您正在使用的任何存储单元的最接近倍数的宽度。如果您的数组为 31*5,则使用 32 的间距和 31 的宽度。 4 个 32 位读取预计将比 31 个 1 字节读取更快。您丢弃额外的“填充”字节。
您可能想设置间距=宽度。你的段错误的原因是你没有初始化它。检查宽度和高度是否与您的 GPU 线程块大小规格兼容。
Often when storing data in contiguous memory you want to make a section of memory have a dimension that is a multiple of a storage unit so that data can be read efficiently. For example, rather than reading 4 individual bytes in a row you might read one 32 bit word. You do it for efficiency. Look up memory alignment.
For the same reason you want to make certain arrays have the size of pitch*height where pitch is the width rounded up to the nearest multiple of whatever storage unit you are using. If your array is 31*5 then you use a pitch of 32 but a width of 31. Four 32 bit reads are expected to be faster than thirty one 1 byte reads. You discard the extra "padding" byte.
You probably want to set pitch = width. The reason for your seg fault is that you haven't initialised it. Check that width and height are compatible with your GPU specifications for thread block size.