关于cuda的问题
我正在研究 GPU 编程,想了解更多关于 CUDA 的知识。我已经阅读了很多相关内容(来自 Wikipedia、Nvidia 和其他参考文献),但我仍然有一些问题:
以下对架构的描述准确吗?:GPU 具有多处理器,每个多处理器都有流处理器,并且每个流处理器都可以同时运行线程块。
所有参考文献都指出在一个块内创建的最小线程数是 32...为什么?
我有一块 ATI Radeon 显卡。我能够在没有模拟模式的情况下编译一个简单的 CUDA 程序!我认为我只能在支持的 Nvidia VGA 上编译和运行 CUDA 程序。有人可以解释一下吗?
i am doing a research about GPU programming and want to learn more about CUDA. I've already read a lot about it (from Wikipedia, Nvidia and other references) but I still have some questions:
Is the following description of the architecture accurate?: a GPU has multiprocessors, every multiprocessor have streaming processors, and every streaming processor can run blocks of threads at the same time.
All references state that the minimum number of threads to create inside one block is 32... why is that?
I have an ATI Radeon video card. and I was able to compile a simple CUDA program without emulation mode!!. I thought that I can only compile and run CUDA programs on supported Nvidia VGA's. Can someone please explain?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
1 - NVIDIA GPU 确实如此。
2 - 这是硬件设计的限制。
3 - 编译是在 CPU 上完成的,因此您可以像在 x86 上交叉编译 PPC 一样编译您的程序。
如果你想在 ATI 卡上运行 GPU 程序,我建议你看看 OpenCL 或 AMD Stream。
1 - this is true of NVIDIA gpus.
2 - this is a constraint of the hardware design.
3 - compilation is done on the CPU, so you could compile your program much like you could cross-compile for PPC on an x86.
If you want to run gpu programs on an ATI card, I suggest you look at OpenCL or AMD Stream.
CUDA 线程非常轻量级,可以以很小的代价进行调度/停止。这与 CPU 线程不同,CPU 线程切换执行和退出执行需要大量开销。因此,CPU 非常适合任务并行性,而 GPU 则擅长数据并行性。
在 CUDA 架构中,(NVIDIA) GPU 具有“流式多处理器”(SM),每个处理器将执行一个线程块。每个 SM 都有一组流处理器 (SP),每个流处理器都将在任何给定时刻(周期)执行一个线程的指令。
实际上,一个块内的最小线程数是 1。如果每个块只有一个线程,您的代码将正确执行。然而,设置一个具有 32 倍数线程的块要高效得多。这是由于硬件在“warp”(即 32 个线程)上调度操作的方式所致。
您可以交叉编译您的程序。您可以在仿真模式下运行它,即 CPU 正在“模拟”CUDA GPU,但要在硬件上运行,您需要 NVIDIA GPU(启用 CUDA,最近的任何东西,2006 年以后左右的东西都可以)。
当前一代高端 GPU 有 240 个核心 (SP),您可以将其视为在任何给定时刻执行 240 个线程,但将 GPU 视为同时执行数千个线程是有用的,因为加载多个线程的状态(上下文)。
我认为重要的是要认识到 CPU 线程和 GPU 线程之间存在差异。它们确实具有相同的名称,但 GPU 线程是轻量级的,通常对一小部分数据进行操作。也许考虑一组(一组)CPU 线程执行非并行工作,然后每个 CPU 线程分叉成数千个 GPU 线程以进行数据并行工作,然后它们连接回 CPU 线程。显然,如果你能让 CPU 线程与 GPU 同时工作,那就更好了。
请记住,与 CPU 不同,GPU 是一种吞吐量架构,这意味着程序不应使用缓存来隐藏延迟,而是应创建许多线程,以便在某些线程等待数据从内存返回时,其他线程可以执行。我建议您观看 GPU 技术大会的“Advanced C for CUDA”演讲,了解更多信息。
A CUDA thread is very lightweight and can be scheduled/stalled with very little penalty. This is unlike a CPU thread which has a lot of overhead to switch in and out of execution. As a result, CPUs are great for task parallelism and GPUs will excel at data parallelism.
In the CUDA architecture a (NVIDIA) GPU has "Streaming Multiprocessors" (SMs), each of which will execute a block of threads. Each SM has a set of Stream Processors (SPs), each of which will be executing instructions for one thread at any given moment (cycle).
Actually the minimum number of threads inside a block is one. If you have just one thread per block, your code will execute correctly. However, it is far more efficient to set up a block such that it has a multiple of 32 threads. This is due to the way the hardware schedules operations across a "warp" which is 32 threads.
You can cross compile your program. You could run it in emulation mode, i.e. the CPU is "emulating" a CUDA GPU, but to run on hardware you would need an NVIDIA GPU (CUDA enabled, anything recent, post 2006 or so, will do).
A high-end current generation GPU has 240 cores (SPs), you could consider this as executing 240 threads at any given moment but it is useful to think of the GPU as executing thousands of threads simultaneously since the state (context) for multiple threads is loaded.
I think it is important to recognise that there are differences between CPU threads and GPU threads. They do have the same name but a GPU thread is lightweight and typically operates on a small subset of the data. Maybe it will help to think of a (set of) CPU thread(s) doing the non-parallel work, then each CPU thread forks into thousands of GPU threads for the data parallel work, then they join back to the CPU thread. Clearly if you can get the CPU thread to do work at the same time as the GPU then that will be even better.
Remember that, unlike a CPU, a GPU is a throughput architecture which means that instead of caches to hide latency, the program should create many threads so that while some threads are waiting for data to return from memory other threads can be executing. I'd recommend watching the "Advanced C for CUDA" talk from the GPU Technology Conference for more information.
是的。每个 GPU 都是矢量处理器或SIMD(单指令多数据)处理器的阵列。在单个线程向量中(可以是 32、64 或其他数字,具体取决于 GPU),每个线程都以锁步执行内核的相同指令。该基本单元有时称为“扭曲”或“波前”或有时称为“SIMD”。
32 似乎是 NVidia 芯片的典型值,64 是 ATI 芯片的典型值。 IIRC,如果 Itel 的 Larrabee 芯片被制造出来的话,这个数字应该会更高。
在硬件级别,线程在这些单元中执行,但编程模型允许您拥有任意数量的线程。如果您的硬件实现 32 宽波前,并且您的程序仅请求 1 个线程,则该硬件单元的 31/32 将闲置。因此,以 32 的倍数(或其他)创建线程是最有效的方法(假设您可以对其进行编程,以便所有线程都执行有用的工作)。
硬件中实际发生的情况是每个线程至少有一位。指示线程是否“活动”。 32 个波前中额外未使用的线程实际上将进行计算,但无法将任何结果写入任何内存位置,因此就好像它们从未执行过一样。
当 GPU 为某些游戏渲染图形时,每个线程都计算单个像素(如果打开抗锯齿,则计算子像素),并且渲染的每个三角形可以具有任意数量的像素,对吧?如果 GPU 只能渲染包含 32 像素的精确倍数的三角形,则效果不会很好。
戈格的回答说明了一切。
虽然您没有具体询问,但对于 GPU 内核来说,避免分支也非常重要。由于波前中的所有 32 个线程必须同时执行相同的指令,因此当代码中存在 and
if .. then .. else
时会发生什么?如果经纱中的某些线程想要执行“then”部分,而另一些线程想要执行“else”部分?答案是所有 32 个线程都执行这两个部分!这显然会花费两倍的时间,因此您的内核将以一半的速度运行。Yes. every GPU is an array of vector processors or SIMD (Single-Instruction Multiple Data) processors. Within a single vector of threads -- which can be 32, 64, or some other number depending on the GPU -- each thread executes the same instruction of your kernel in lock step. This basic unit is sometimes called a "warp" or a "wavefront" or sometimes "a SIMD".
32 seems to be typical for NVidia chips, 64 for ATI. IIRC, the number for Itel's Larrabee chip is supposed to be even higher, if that chip is ever manufactured.
At the hardware level, threads are executed in these units, but the programming model lets you have an arbitrary number of threads. If your hardware implements a 32-wide wavefront and your program only requests 1 thread, 31/32 of that hardware unit will sit idle. So creating threads in multiples of 32 (or whatever) is the most efficient way to do things (assuming you can program it so that all the threads do usefull work).
What actually happens in the hardware is there is at least one bit for each thread. that indicates whether the thread is "alive" or not. The extra unused threads in a wavefront of 32 will actually be doing calculations, but but will not be able to write any of the results to any memory location, so it's just as if they never executed.
When a GPU is rendering graphics for some game, each thread is computing a single pixel (or a sub-pixel if anti-aliasing is turned on), and each triangle being rendered can have an arbitrary number of pixels, right? If the GPU could only render triangles that contained an exact multiple of 32 pixels, it wouldn't work very well.
goger's answer says it all.
Although you didn't specifically ask, it's also very important for you GPU kernels to avoid branches. Since all 32 threads in a wavefront have to execute the same instruction at the same time, what happens when there's and
if .. then .. else
in the code? If some of the threads in the warp want to execute the "then" part and some want to execute the "else" part? The answer is that all 32 threads execute both parts! Which will obviously take twice as long so your kernel will run at half speed.