大型矩阵的 CUDA 矩阵乘法中断

发布于 2024-09-29 21:36:51 字数 3513 浏览 8 评论 0原文

我有以下矩阵乘法代码，使用 CUDA 3.2 和 VS 2008 实现。我在 Windows Server 2008 r2 enterprise 上运行。我正在运行 Nvidia GTX 480。以下代码可以在“宽度”（矩阵宽度）值达到大约 2500 左右的情况下正常工作。

int size = Width*Width*sizeof(float);
float* Md, *Nd, *Pd;
cudaError_t err = cudaSuccess;

//Allocate Device Memory for M, N and P
err = cudaMalloc((void**)&Md, size);
err = cudaMalloc((void**)&Nd, size);
err = cudaMalloc((void**)&Pd, size);

//Copy Matrix from Host Memory to Device Memory
err = cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
err = cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

//Setup the execution configuration
dim3 dimBlock(TileWidth, TileWidth, 1);
dim3 dimGrid(ceil((float)(Width)/TileWidth), ceil((float)(Width)/TileWidth), 1);

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

//Free Device Memory
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);

当我将“宽度”设置为 3000 或更大时，黑屏后出现以下错误： screenshot

我在网上查了一下，发现有些人有这个问题，因为看门狗在挂起超过一段时间后正在杀死内核5秒。我尝试在注册表中编辑“TdrDelay”，这延迟了黑屏和出现相同错误之前的时间。所以我得出结论这不是我的问题。

我调试了我的代码，发现这一行是罪魁祸首：

err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

这是我在调用矩阵乘法内核函数后用来从设备返回结果集的行。到目前为止，一切似乎都进展顺利。我相信我分配的内存是正确的，但无法弄清楚为什么会发生这种情况。我想也许我的卡上没有足够的内存来执行此操作，但是 cudaMalloc 不应该返回错误吗？（我在调试时确认没有）。

任何想法/帮助将不胜感激！...非常感谢大家！

内核代码：

//Matrix Multiplication Kernel - Multi-Block Implementation
__global__ void MatrixMultiplicationMultiBlock_Kernel (float* Md, float* Nd, float* Pd, int Width) 
{
int TileWidth = blockDim.x;

//Get row and column from block and thread ids
int Row = (TileWidth*blockIdx.y) + threadIdx.y;
int Column = (TileWidth*blockIdx.x) + threadIdx.x;

//Pvalue store the Pd element that is computed by the thread
float Pvalue = 0;

for (int i = 0; i < Width; ++i)
{
    float Mdelement = Md[Row * Width + i];
    float Ndelement = Nd[i * Width + Column];
    Pvalue += Mdelement * Ndelement;
}

//Write the matrix to device memory each thread writes one element
Pd[Row * Width + Column] = Pvalue;
}

我还有另一个使用共享内存的函数，它也给出了相同的错误：

调用：

            MatrixMultiplicationSharedMemory_Kernel<<<dimGrid, dimBlock, sizeof(float)*TileWidth*TileWidth*2>>>(Md, Nd, Pd, Width);

内核代码：

 //Matrix Multiplication Kernel - Shared Memory Implementation
 __global__ void MatrixMultiplicationSharedMemory_Kernel (float* Md, float* Nd, float* Pd, int Width) 
 {
int TileWidth = blockDim.x;

//Initialize shared memory
extern __shared__ float sharedArrays[];
float* Mds = (float*) &sharedArrays;
float* Nds = (float*) &Mds[TileWidth*TileWidth];

int tx = threadIdx.x;
int ty = threadIdx.y;

//Get row and column from block and thread ids
int Row = (TileWidth*blockIdx.y) + ty;
int Column = (TileWidth*blockIdx.x) + tx;
float Pvalue = 0;

//For each tile, load the element into shared memory
for( int i = 0; i < ceil((float)Width/TileWidth); ++i)
{
    Mds[ty*TileWidth+tx] = Md[Row*Width + (i*TileWidth + tx)];
    Nds[ty*TileWidth+tx] = Nd[(ty + (i * TileWidth))*Width + Column]; 

    __syncthreads();

    for( int j = 0; j < TileWidth; ++j)
    {
        Pvalue += Mds[ty*TileWidth+j] * Nds[j*TileWidth+tx];
    }

    __syncthreads();
}

//Write the matrix to device memory each thread writes one element
Pd[Row * Width + Column] = Pvalue;
}

原文

I have the following matrix multiplication code, implemented using CUDA 3.2 and VS 2008. I am running on Windows server 2008 r2 enterprise. I am running a Nvidia GTX 480. The following code works fine with values of "Width" (Matrix width) up to about 2500 or so.

int size = Width*Width*sizeof(float);
float* Md, *Nd, *Pd;
cudaError_t err = cudaSuccess;

//Allocate Device Memory for M, N and P
err = cudaMalloc((void**)&Md, size);
err = cudaMalloc((void**)&Nd, size);
err = cudaMalloc((void**)&Pd, size);

//Copy Matrix from Host Memory to Device Memory
err = cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
err = cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

//Setup the execution configuration
dim3 dimBlock(TileWidth, TileWidth, 1);
dim3 dimGrid(ceil((float)(Width)/TileWidth), ceil((float)(Width)/TileWidth), 1);

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

//Free Device Memory
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);

When I set the "Width" to 3000 or greater, I get the following error after a black screen:
screenshot

I looked online and I saw that some people has this issue because the watchdog was killing the kernel after it hangs for more than 5 seconds. I tried editing the "TdrDelay" in the registry and this delayed the time before the black screen and same error appeared. So I concluded this was not my issue.

I debugged into my code and found this line to be the culprit:

err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

This is what I use to return my result set from the device after my matrix multiplication kernel function is called. Everything up until this point seems to run fine. I believe I am allocating memory correctly and cannot figure out why this is happening. I thought maybe I didn't have enough memory on my card for this but then shouldn't cudaMalloc have returned an error? (I confirmed it didn't while debugging).

Any ideas/assistance would be greatly appreciated!... Thanks a lot guys!!

Kernel code:

//Matrix Multiplication Kernel - Multi-Block Implementation
__global__ void MatrixMultiplicationMultiBlock_Kernel (float* Md, float* Nd, float* Pd, int Width) 
{
int TileWidth = blockDim.x;

//Get row and column from block and thread ids
int Row = (TileWidth*blockIdx.y) + threadIdx.y;
int Column = (TileWidth*blockIdx.x) + threadIdx.x;

//Pvalue store the Pd element that is computed by the thread
float Pvalue = 0;

for (int i = 0; i < Width; ++i)
{
    float Mdelement = Md[Row * Width + i];
    float Ndelement = Nd[i * Width + Column];
    Pvalue += Mdelement * Ndelement;
}

//Write the matrix to device memory each thread writes one element
Pd[Row * Width + Column] = Pvalue;
}

I also have this other function that uses shared memory, and it also gives the same error:

Call:

            MatrixMultiplicationSharedMemory_Kernel<<<dimGrid, dimBlock, sizeof(float)*TileWidth*TileWidth*2>>>(Md, Nd, Pd, Width);

Kernel code:

 //Matrix Multiplication Kernel - Shared Memory Implementation
 __global__ void MatrixMultiplicationSharedMemory_Kernel (float* Md, float* Nd, float* Pd, int Width) 
 {
int TileWidth = blockDim.x;

//Initialize shared memory
extern __shared__ float sharedArrays[];
float* Mds = (float*) &sharedArrays;
float* Nds = (float*) &Mds[TileWidth*TileWidth];

int tx = threadIdx.x;
int ty = threadIdx.y;

//Get row and column from block and thread ids
int Row = (TileWidth*blockIdx.y) + ty;
int Column = (TileWidth*blockIdx.x) + tx;
float Pvalue = 0;

//For each tile, load the element into shared memory
for( int i = 0; i < ceil((float)Width/TileWidth); ++i)
{
    Mds[ty*TileWidth+tx] = Md[Row*Width + (i*TileWidth + tx)];
    Nds[ty*TileWidth+tx] = Nd[(ty + (i * TileWidth))*Width + Column]; 

    __syncthreads();

    for( int j = 0; j < TileWidth; ++j)
    {
        Pvalue += Mds[ty*TileWidth+j] * Nds[j*TileWidth+tx];
    }

    __syncthreads();
}

//Write the matrix to device memory each thread writes one element
Pd[Row * Width + Column] = Pvalue;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

扭转时空 2024-10-06 21:36:51

控制 WDDM 超时

问题实际上是内核而不是 cudaMemcpy()。当您启动内核时，GPU 会关闭并与 CPU 异步执行工作，因此只有当您与 GPU 同步时，您才必须等待工作完成。 cudaMemcpy() 涉及隐式同步，因此这就是您看到问题的地方。

您可以通过在内核之后调用 cudaThreadSynchronize() 来仔细检查这一点，问题将出现在 cudaThreadSynchronize() 而不是 cudaMemcpy()< /代码>。

更改 TDR 超时后，您是否重新启动了计算机？不幸的是，需要重新启动 Windows 才能更改 TDR 设置。此 Microsoft 文档对可用的完整设置进行了相当详细的描述。

内核问题

在这种情况下，问题实际上并不是 WDDM 超时。内核中存在您需要解决的错误（例如，您应该能够在每次迭代中将 i 增加一个以上）并检查 matrixMul 示例SDK 中的可能有用。顺便说一句，我希望这是一个学习练习，因为实际上，使用 CUBLAS 执行矩阵乘法（为了性能）会更好。

代码中最关键的问题是您使用共享内存而没有实际分配任何内存。在你的内核中，你有：

//Initialize shared memory
extern __shared__ float sharedArrays[];

但是当你启动内核时，你没有指定为每个块分配多少共享内存：

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

<<<<>>>语法实际上需要四个参数，其中第三个和第四个是可选的。第四个是流索引，用于获得计算和数据传输之间的重叠（以及并发内核执行），但第三个参数指定每个块的共享内存量。在这种情况下，我假设您想要在共享内存中存储 TileWidth * TileWidth 浮点数，因此您将使用：

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock, dimBlock.x * dimBlock.x * sizeof(float)>>>(Md, Nd, Pd, Width);

主要问题

正如您提到的在您的评论中，实际的问题是您的矩阵宽度不是块宽度的倍数（和高度，因为它是正方形，这意味着超出末尾的线程将访问超出数组末尾的部分。代码应该处理非-multiple case 或者它应该确保宽度是块大小的倍数，

我应该早点建议这样做，但是运行 cuda-memcheck 来检查像这样的内存访问冲突通常很有用。。

Controlling the WDDM Timeout

The problem is actually the kernel not the cudaMemcpy(). When you launch the kernel the GPU goes off and does the work asynchronously with the CPU, so it's only when you synchronize with the GPU that you have to wait for the work to finish. cudaMemcpy() involves an implicit synchronization, hence that is where you see the problem.

You could double-check this by calling cudaThreadSynchronize() after the kernel and the problem will appear to be on the cudaThreadSynchronize() instead of the cudaMemcpy().

After changing the TDR timeout, did you restart your machine? Unfortunately Windows needs to be restarted to change the TDR settings. This Microsoft document has a fairly good description of the full settings available.

Kernel problems

In this case the problem is not actually the WDDM timeout. There are errors in the kernel which you would need to resolve (for example you should be able to incremement i by more than one on each iteration) and checking out the matrixMul sample in the SDK may be useful. Incidentally, I hope this is a learning exercise since in reality you would be better off (for performance) using CUBLAS to perform matrix multiplication.

The most critical problem in the code is that you are using shared memory without actually allocating any. In your kernel you have:

//Initialize shared memory
extern __shared__ float sharedArrays[];

But when you launch the kernel you do not specify how much shared memory to allocate for each block:

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

The <<<>>> syntax actually takes four arguments where the third and fourth are optional. The fourth is the stream index which is used to get overlap between compute and data transfer (and for concurrent kernel execution) but the third argument specifies the amount of shared memory per block. In this case I assume you want to store TileWidth * TileWidth floats in the shared memory, so you would use:

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock, dimBlock.x * dimBlock.x * sizeof(float)>>>(Md, Nd, Pd, Width);

The main problem

As you mention in your comment, the actual problem was that your matrix width was not a multiple of the block width (and height since it is square, meaning the threads beyond the end would access beyond the end of the array. The code should either handle the non-multiple case or it should ensure that the width is a multiple of the block size.

I should have suggested this earlier, but it is often useful to run cuda-memcheck to check for memeory access violations like this.

回复收藏 0 原文