关于Cuda 4.0和统一内存模型的问题
Nvidia 似乎在宣扬 Cuda 4.0 允许程序员在 CPU 和 GPU 之间使用统一的内存模型。这不会取代在 GPU 和 CPU 中手动管理内存以获得最佳性能的需要,但它是否会允许更简单的实现,可以进行测试、验证,然后进行优化(手动管理 GPU 和 CPU 内存)?我想听听评论或意见:)
Nvidia seems to be touting that Cuda 4.0 allows programmers to use a unified memory model between the CPU and GPU. This is not going to replace the need to manage the memory manually in the GPU and CPU for best performance, but will it allow for easier implementations that can be tested, proven, and then optimised (manually manage GPU and CPU memory)? I'd like to hear comments or opinions :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
据我了解,重要的区别是,如果您有 2 个或更多 GPU,您将能够将内存从 GPU1 传输到 GPU2,而无需接触主机 RAM。您还可以在主机上仅用一个线程来控制 2 个 GPU。
From what I read, the important difference is that if you have 2 or more GPUs, you will be able to transfer memory from GPU1 to GPU2 without touching host RAM. You will be also able to control 2 GPUs with only one thread on the host.
嗯,这似乎是一个大新闻!由 NVIDIA 自己的工程师构建的 thrust 库已经给您带来了一些感受。您只需通过 = 符号即可将数据从 RAM 移动到 GPU 的 DRAM(无需调用 cudaMalloc 和 cudaMemcpy 之类的东西)。因此推力使 CUDA-C 更像“只是 C”。
也许他们将来会将其集成到 CUDA-API 中。请注意,在反面,过程将是相同的(并将永远保持相同),但为了方便程序员而隐藏。 (我不喜欢这样)
编辑:CUDA 4.0 已经发布,thrust 将与其集成。
Hmmm, that seems a big news! The thrust library built by NVIDIA's own engineers already gives you some flavor. You can move the data from RAM to GPU's DRAM just by a mere = sign (No need to call cudaMalloc and cudaMemcpy and stuff like that). So thrust makes CUDA-C more like 'just C'.
Maybe they'll integrate this into CUDA-API in future. Note that in back-hand the procedure will be the same (and will remain same forever), but hidden from the programmer for ease. (I don't like that)
Edit: CUDA 4.0 has been announced and thrust will be integrated with it.
“统一”内存仅指地址空间。主机和设备指针是从相同的 64 位地址空间分配的,因此任何给定的指针范围在整个进程中都是唯一的。因此,CUDA 可以从指针推断出指针范围“属于”哪个设备。
重要的是不要将地址空间与读/写这些指针范围的能力混淆。 CPU 将无法取消引用设备内存指针。我相信,在支持统一地址的平台上,默认情况下所有主机分配都会被映射,因此 GPU 将能够取消引用主机分配。
注意:Windows Vista/Windows 7 上的默认驱动程序模型不支持此功能。
The "unified" memory only refers to address space. Host and device pointers are allocated from the same 64-bit address space, so any given pointer range is unique across the process. As a result, CUDA can infer from the pointer which device a pointer range "belongs to."
It's important not to confuse address spaces with the ability to read/write those pointer ranges. The CPU will not be able to dereference device memory pointers. I believe that on unified-address-capable platforms, all host allocations will be mapped by default, though, so the GPU(s) will be able to dereference host allocations.
Note: the default driver model on Windows Vista/Windows 7 does not support this feature.