CPU和GPU之间的指令传输

发布于 2025-01-06 07:23:14 字数 227 浏览 2 评论 0原文

我正在寻找有关在使用 GPGPU 计算时 CPU 如何将程序代码移动到 GPU 的信息。互联网上有很多关于数据传输的手册,但没有关于指令/程序加载的手册。

问题是:程序由 CPU 处理,CPU 在每个计算单元上使用足够的标志“配置”GPU 来执行给定的操作。之后,数据被传输和处理。第一次手术是如何进行的?指令是如何下发给GPU的?这些指令是否以某种方式打包以利用总线带宽?我可能忽略了一些基本的东西,所以欢迎任何其他信息。

I'm looking for information related to how CPU moves program code to the GPU when working with GPGPU computation. Internet is plenty of manuals about data transfer, but not about instruction/program loading.

The question is: program is handled by the CPU, which "configures" the GPU with the adequate flags on each computing unit to perform a given operation. After that, data is transfered and processed. How the firs operation is done? How instructions are issued to the GPU? Are the instructions somehow packet to take advantage of the bus bandwidth? I may have ignore something fundamental, so any additional information is welcome.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

-柠檬树下少年和吉他 2025-01-13 07:23:14

关于它的信息确实不多,但你高估了效果。

整个内核代码仅加载到 GPU 上一次(最坏的情况是每个内核调用一次,但看起来实际上每个应用程序运行一次,见下文),然后完全在 GPU 上执行,无需任何操作。 CPU 的干预。因此,在内核调用之前,整个内核代码会被复制到一个块中。估算一下代码大小,我们自制的MD包的所有GPU代码(52个内核,其中一些>150行代码)的.cubin大小只有91 KiB,所以是安全的假设在几乎所有情况下,代码传输时间都可以忽略不计。

以下是我在官方文档中找到的信息:

CUDA 驱动程序 API,代码会在您调用 cuModuleLoad 函数

CUDA 驱动程序 API 不会尝试延迟分配资源
模块需要;如果函数和数据的存储器
模块所需的(常量和全局)无法分配,
cuModuleLoad() 失败

理论上,如果您有多个模块使用了太多常量(或静态分配的全局)内存来同时加载,则您可能必须卸载模块然后再次加载,但这种情况相当罕见,您通常会调用cuModuleLoad 每次应用程序启动时仅在上下文创建后立即执行一次。

CUDA 运行时 API< /a> 没有提供任何控制模块加载/卸载的措施,但看起来所有必要的代码都在初始化期间加载到设备上。

OpenCL 规范 并不具体作为 CUDA 驱动程序 API,但代码很可能(涉及疯狂猜测)在 clBuildProgram 阶段复制到设备。

There is indeed not much information about it, but you overestimate the effect.

The whole kernel code is loaded onto GPU only once (at worst once-per-kernel-invocation, but it looks like it is actually once-per-application-run, see below), and then is executed completely on the GPU without any intervention from CPU. So, whole kernel code is copied in one chunk somewhere before kernel invocation. To estimate code size, the .cubin size of all GPU code of our home-made MD package (52 kernels, some of which are > 150 lines of code) is only 91 KiB, so it's safe to assume that in pretty much all the cases the code transfer time is negligible.

Here's is what information I've found in official docs:

In CUDA Driver API, the code is loaded on device the time you call cuModuleLoad function

The CUDA driver API does not attempt to lazily allocate the resources
needed by a module; if the memory for functions and data
(constant and global) needed by the module cannot be allocated,
cuModuleLoad() fails

Theoretically, you might have to unload the module and then load it again, if you have several modules which use too much constant (or statically allocated global) memory to be loaded simultaneously, but it's quite uncommon, and you usually call cuModuleLoad only once per application launch, right after context creation.

CUDA Runtime API does not provide any measures of controlling module loading/unloading, but it looks like all the necessary code is loaded onto device during it's initialization.

OpenCL Specs are not as specific as CUDA Driver API, but the code is likely (wild guessing involved) copied to device on clBuildProgram stage.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文