CUDA - 这个循环在做什么
嘿 我在网站上看到过这个示例内核
__global__ void loop1( int N, float alpha, float* x, float* y ) {
int i;
int i0 = blockIdx.x*blockDim.x + threadIdx.x;
for(i=i0;i<N;i+=blockDim.x*gridDim.x) {
y[i] = alpha*x[i] + y[i];
}
}
要在 C 中计算这个函数
for(i=0;i<N;i++) {
y[i] = alpha*x[i] + y[i];
}
当然内核中的 for 循环是不必要的吗?您可以执行 y[i0] = alpha*x[i0] + y[i0]
并完全删除 for 循环。
我只是好奇它为什么在那里以及它的目的是什么。假设有一个内核调用,例如 loop1<<<64,256>>>
因此大概是 gridDim.x = 1
Hey
I've seen on a website this example kernel
__global__ void loop1( int N, float alpha, float* x, float* y ) {
int i;
int i0 = blockIdx.x*blockDim.x + threadIdx.x;
for(i=i0;i<N;i+=blockDim.x*gridDim.x) {
y[i] = alpha*x[i] + y[i];
}
}
To compute this function in C
for(i=0;i<N;i++) {
y[i] = alpha*x[i] + y[i];
}
Surely the for loop inside the kernel isn't necessary? and you can just do y[i0] = alpha*x[i0] + y[i0]
and remove the for loop altogether.
I'm just curious as to why it's there and what it's purpose is. This is assuming a kernel call such as loop1<<<64,256>>>>
so presumably gridDim.x = 1
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果向量的条目数多于已启动的线程数,则需要在内核中使用 for 循环。如果可能的话,启动足够多的线程当然会更有效。
You need the for loop in the kernel if your vector has more entrys than you have started threads. If it's possible it is of course more efficent to start enough threads.
有趣的内核。内核内部的循环是必要的,因为 N 大于线程总数,即 16 384 (blockDim.x*gridDim.x),但我认为这样做不是一个好习惯 (CUDA 的重点是使用 SIMT 概念)。根据 CUDA 编程指南,一个内核最多可以有 65535 个线程块。此外,从计算能力 2.x (Fermi) 开始,每个块最多可以有 1024 个线程(Fermi 之前为 512 个)此外,您还可以(如果可能)将代码分离到多个(顺序)内核中。
Interesting kernel. The loop inside the kernel is necessary, because N is greater than total number of threads, which is 16 384 (blockDim.x*gridDim.x), but I think it's not good practice to do it (the whole point of CUDA is to use SIMT concept). According to CUDA Programming Guide you can have at most 65535 thread blocks with one kernel. Futhermore starting from Compute Capability 2.x (Fermi) you can have at most 1024 threads per one block (512 before Fermi) Also you can (if possible) separate code into multiple (sequential) kernels.
尽管我们愿意相信 CUDA GPU 具有无限的执行资源,但事实并非如此,高度优化代码的作者发现展开的 for 循环(通常具有固定数量的块)可提供最佳性能。编码很痛苦,但优化的 CPU 代码也很痛苦。
顺便说一句,一位评论者提到这段代码会有合并问题,我不明白为什么。如果基地址正确对齐(64B,因为它们是浮点数),则此代码的所有内存事务都将被合并,前提是线程/块也能被 64 整除。
Much as we would like to believe that CUDA GPUs have infinite execution resources, they do not, and authors of highly optimized code are finding that unrolled for loops, often with fixed numbers of blocks, give the best performance. Makes for painful coding, but optimized CPU code is also pretty painful.
btw a commenter mentioned that this code would have coalescing problems, and I don't see why. If the base addresses are correctly aligned (64B since those are floats), all of the memory transactions by this code will be coalesced, provided the threads/block is also divisible by 64.