CUDA性能疑虑
由于我没有得到 CUDA 论坛的回复,我在这里尝试一下:
在 CUDA 中做了一些程序后,我现在开始获取它们的有效带宽。然而我有一些奇怪的结果,例如在下面的代码中,我可以对向量中的所有元素(无论维度如何)求和,展开代码的带宽和“正常”代码似乎具有相同的中值结果(约 3000 Gb/秒) 我不知道我是否做错了什么(据我所知,程序工作正常),但从我到目前为止所读到的内容来看,展开代码应该具有更高的带宽。
#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#include <math.h>
#define elements 1000
#define blocksize 16
__global__ void vecsumkernel(float*input, float*output,int nelements){
__shared__ float psum[blocksize];
int tid=threadIdx.x;
if(tid + blockDim.x * blockIdx.x < nelements)
psum[tid]=input[tid+blockDim.x*blockIdx.x];
else
psum[tid]=0.0f;
__syncthreads();
//WITHOUT UNROLL
int stride;
for(stride=blockDim.x/2;stride>0;stride>>=1){
if(tid<stride)
psum[tid]+=psum[tid+stride];
__syncthreads();
}
if(tid==0)
output[blockIdx.x]=psum[0];
//WITH UNROLL
/*
if(blocksize>=512 && tid<256) psum[tid]+=psum[tid+256];__syncthreads();
if(blocksize>=256 && tid<128) psum[tid]+=psum[tid+128];__syncthreads();
if(blocksize>=128 && tid<64) psum[tid]+=psum[tid+64];__syncthreads();
if (tid < 32) {
if (blocksize >= 64) psum[tid] += psum[tid + 32];
if (blocksize >= 32) psum[tid] += psum[tid + 16];
if (blocksize >= 16) psum[tid] += psum[tid + 8];
if (blocksize >= 8) psum[tid] += psum[tid + 4];
if (blocksize >= 4) psum[tid] += psum[tid + 2];
if (blocksize >= 2) psum[tid] += psum[tid + 1];
}*/
if(tid==0)
output[blockIdx.x]=psum[0];
}
void vecsumv2(float*input, float*output, int nelements){
dim3 dimBlock(blocksize,1,1);
int i;
for(i=((int)ceil((double)(nelements)/(double)blocksize))*blocksize;i>1;i(int)ceil((double)i/(double)blocksize)){
dim3 dimGrid((int)ceil((double)i/(double)blocksize),1,1);
printf("\ni=%d\ndimgrid=%u\n ",i,dimGrid.x);
vecsumkernel<<<dimGrid,dimBlock>>>(i==((int)ceil((double)(nelements)/(double)blocksize))*blocksize ?input:output,output,i==((int)ceil((double)(nelements)/(double)blocksize))*blocksize ? elements:i);
}
}
void printVec(float*vec,int dim){
printf("\n{");
for(int i=0;i<dim;i++)
printf("%f ",vec[i]);
printf("}\n");
}
int main(){
cudaEvent_t evstart, evstop;
cudaEventCreate(&evstart);
cudaEventCreate(&evstop);
float*input=(float*)malloc(sizeof(float)*(elements));
for(int i=0;i<elements;i++)
input[i]=(float) i;
float*output=(float*)malloc(sizeof(float)*elements);
float *input_d,*output_d;
cudaMalloc((void**)&input_d,elements*sizeof(float));
cudaMalloc((void**)&output_d,elements*sizeof(float));
cudaMemcpy(input_d,input,elements*sizeof(float),cudaMemcpyHostToDevice);
cudaEventRecord(evstart,0);
vecsumv2(input_d,output_d,elements);
cudaEventRecord(evstop,0);
cudaEventSynchronize(evstop);
float time;
cudaEventElapsedTime(&time,evstart,evstop);
printf("\ntempo gasto:%f\n",time);
float Bandwidth=((1000*4*2)/10^9)/time;
printf("\n Bandwidth:%f Gb/s\n",Bandwidth);
cudaMemcpy(output,output_d,elements*sizeof(float),cudaMemcpyDeviceToHost);
cudaFree(input_d);
cudaFree(output_d);
printf("soma do vector");
printVec(output,4);
}
Since i didnt got a response from the CUDA forum, ill try it here:
After doing a few programs in CUDA ive now started to obtain their effective bandwidth. However i have some strange results, for example in the following code, where i can sum all the elements in a vector(regardless of dimension), the bandwidth with the Unroll Code and the "normal" code seems to have the same median result(around 3000 Gb/s)
I dont know if im doing something wrong(AFAIK the program works fine) but from what ive read so far, the Unroll code should have a higher bandwidth.
#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#include <math.h>
#define elements 1000
#define blocksize 16
__global__ void vecsumkernel(float*input, float*output,int nelements){
__shared__ float psum[blocksize];
int tid=threadIdx.x;
if(tid + blockDim.x * blockIdx.x < nelements)
psum[tid]=input[tid+blockDim.x*blockIdx.x];
else
psum[tid]=0.0f;
__syncthreads();
//WITHOUT UNROLL
int stride;
for(stride=blockDim.x/2;stride>0;stride>>=1){
if(tid<stride)
psum[tid]+=psum[tid+stride];
__syncthreads();
}
if(tid==0)
output[blockIdx.x]=psum[0];
//WITH UNROLL
/*
if(blocksize>=512 && tid<256) psum[tid]+=psum[tid+256];__syncthreads();
if(blocksize>=256 && tid<128) psum[tid]+=psum[tid+128];__syncthreads();
if(blocksize>=128 && tid<64) psum[tid]+=psum[tid+64];__syncthreads();
if (tid < 32) {
if (blocksize >= 64) psum[tid] += psum[tid + 32];
if (blocksize >= 32) psum[tid] += psum[tid + 16];
if (blocksize >= 16) psum[tid] += psum[tid + 8];
if (blocksize >= 8) psum[tid] += psum[tid + 4];
if (blocksize >= 4) psum[tid] += psum[tid + 2];
if (blocksize >= 2) psum[tid] += psum[tid + 1];
}*/
if(tid==0)
output[blockIdx.x]=psum[0];
}
void vecsumv2(float*input, float*output, int nelements){
dim3 dimBlock(blocksize,1,1);
int i;
for(i=((int)ceil((double)(nelements)/(double)blocksize))*blocksize;i>1;i(int)ceil((double)i/(double)blocksize)){
dim3 dimGrid((int)ceil((double)i/(double)blocksize),1,1);
printf("\ni=%d\ndimgrid=%u\n ",i,dimGrid.x);
vecsumkernel<<<dimGrid,dimBlock>>>(i==((int)ceil((double)(nelements)/(double)blocksize))*blocksize ?input:output,output,i==((int)ceil((double)(nelements)/(double)blocksize))*blocksize ? elements:i);
}
}
void printVec(float*vec,int dim){
printf("\n{");
for(int i=0;i<dim;i++)
printf("%f ",vec[i]);
printf("}\n");
}
int main(){
cudaEvent_t evstart, evstop;
cudaEventCreate(&evstart);
cudaEventCreate(&evstop);
float*input=(float*)malloc(sizeof(float)*(elements));
for(int i=0;i<elements;i++)
input[i]=(float) i;
float*output=(float*)malloc(sizeof(float)*elements);
float *input_d,*output_d;
cudaMalloc((void**)&input_d,elements*sizeof(float));
cudaMalloc((void**)&output_d,elements*sizeof(float));
cudaMemcpy(input_d,input,elements*sizeof(float),cudaMemcpyHostToDevice);
cudaEventRecord(evstart,0);
vecsumv2(input_d,output_d,elements);
cudaEventRecord(evstop,0);
cudaEventSynchronize(evstop);
float time;
cudaEventElapsedTime(&time,evstart,evstop);
printf("\ntempo gasto:%f\n",time);
float Bandwidth=((1000*4*2)/10^9)/time;
printf("\n Bandwidth:%f Gb/s\n",Bandwidth);
cudaMemcpy(output,output_d,elements*sizeof(float),cudaMemcpyDeviceToHost);
cudaFree(input_d);
cudaFree(output_d);
printf("soma do vector");
printVec(output,4);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您展开的代码中有很多分支。我数了数还有十个分支机构。通常,在 GPU 上的 warp 内进行分支的成本很高,因为 warp 中的所有线程最终都会等待分支(发散)。
有关扭曲发散的更多信息,请参阅此处:
http://forums.nvidia.com/index.html php?showtopic=74842
您是否尝试过使用分析器来查看发生了什么?
Your unrolled code has a lot of branching in it. I count ten additional branches. Typically branching within a warp on a GPU is expensive because all threads in the warp end up waiting on the branch (divergence).
See here for more info on warp divergence:
http://forums.nvidia.com/index.php?showtopic=74842
Have you tried using a profiler to see what's going on?
3000 Gb/s 没有意义。 PCIe 的最大总线速度为每个方向 8Gb/s。
查看这篇论文Parallel Prefix Sum,深入了解如何加速你的实施。
另请考虑 thrust 库已在 缩减 模块
3000 Gb/s Does not make sense. The max bus speed of PCIe is 8Gb/s on each direction.
Take a look at this paper Parallel Prefix Sum to gain insight on how to speed up your implementation.
Also consider that the thrust library have this already implemented in the Reductions module
您未展开的代码无效。对于
stride<32
,同一经纱的某些线程进入 for 循环,而其他线程则不进入。因此,warp 的某些(但不是全部)线程命中了__syncthreads()
。 CUDA 规范规定,当发生这种情况时,行为是未定义的。可能会发生扭曲不同步的情况,并且某些线程已经开始加载下一个数据块,在
__syncthreads()
的下一个实例上停止,而以前的线程仍然停留在之前的循环中。但我不确定在这种特殊情况下您是否会遇到这种情况。
your not-unrolled code is invalid. For
stride<32
some threads of the same warp enter the for-loop, while the others do not. Therefore, some (but not all) threads of the warp hit the__syncthreads()
. CUDA specification says that when that happens, the behaviour is undefined.It can happen that warp gets out of sync and some threads already begin loading next chunk of data, halting on next instances of
__syncthreads()
while previous threads are still stuck in your previous loop.I am not sure though if that is what you are going to face in this particular case.
我看到你正在内核中进行归约和。这是一个很好的演示 NVIDIA 用于优化 GPU 上的缩减。您会注意到,在本指南中,吞吐量为 2 GB/s 的相同代码已优化为 63 GB/s。
I see you're doing Reduction Sum in kernel. Here's a good presentation by NVIDIA for optimizing reduction on GPUs. You'll notice that the same code that was giving a throughput of 2 GB/s is optimized to 63 GB/s in this guide.