当我使用略多于 64kb 的常量缓存时,为什么我的内核不会失败? (OpenCL/CUDA)
我对使用常量缓存的内核进行了一些测试。如果我使用 16,000 个浮点数 (16,000 * 4KB = 64KB),那么一切都会顺利运行。如果我使用16,200,它仍然运行顺利。如果我使用 16,400 个浮点数,我的结果会出现错误(不是来自 OpenCL)。难道只是技术上有 64.x KB 的常量缓存可用吗?如果我正好使用 16,000 个浮点数,我是否应该相信我的代码?通常,当您使用的东西达到规定的限制时,我预计代码会中断。
I ran some tests on my kernel which uses constant cache. If I use 16,000 floats (16,000 * 4KB = 64KB) then everything runs smoothly. If I use 16,200 it still runs smoothly. I get errors in my results (not from OpenCL) if I use 16,400 floats. Could it just be that technically there is 64.x KB of constant cache available? Should I even trust my code if I am using exactly 16,000 floats? Usually I expect code to break when you use stuff to the stated limit.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以并且应该使用 OpenCL clGetDeviceInfo API 并使用参数 CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE 来查询此信息。 OpenCL 1.1 规范规定,符合要求的实现必须提供至少 64K 字节,这可能就是您的设备正在实现的。
如果超过此限制,则 OpenCL 应该给出错误或透明地将常量数组移至全局内存数组中。
如果它没有返回错误,但给出了不好的结果,那么这就是 OpenCL 实现中的错误。不足为奇,他们都还不是很成熟。您绝对应该向供应商报告该错误。 (我认为是 NVidia,因为您引用了 CUDA)(当然,在确保您已安装最新版本之后。)
You can and should query this using the OpenCL clGetDeviceInfo API, with the parameter CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE. The OpenCL 1.1 spec says that a conforming implementation has to provide at least 64K bytes, which is probably what your device is implementing.
If you exceed this limit, then OpenCL should either give you an error or tranparently move you constant array into a global memory array for you.
If it's not returning an error, but giving you bad results, that's a bug in your OpenCL implementation. Not too surprising, none of them are very mature yet. You should definitely report the bug to vendor. (Which I assume is NVidia because of your references to CUDA) (After making sure you've got the latest version installed, of course.)
我什至没有看一眼 GPU 规格来了解哪些机器有或没有 64KB 恒定内存的硬限制;我假设您已经确定这实际上是您卡上的限额。
我将添加观察结果,但一般来说,GPU 及其 CUDA/OpenCL/任何运行时在捕获或标记错误方面并不是非常积极,并且如果使用无效参数,当然也不会努力失败。虽然我从未见过它明确说明,但我的理解是,这部分是为了避免开销,但主要是为了尽可能宽容;在游戏中,最好让怪物的手臂在几帧中看起来很有趣,而不是因为有人进行了一次越界访问而导致整个游戏死亡。
对于那些进行 GPGPU 编程的人来说,这很尴尬——您需要确保所有参数和内存使用都有效,如果不是,结果可能会很奇怪:有时它会起作用,但通常不会。但事情就是这样。如果您超出了给定的内存限制,我当然不会指望事情会可靠地失败,并且会使用一些明显且有用的方法。
I haven't even glanced at GPU specs to find out which machines do and don't have hard limits of 64KB of constant memory; I'll assume you've made sure that this is in fact the limit on your card.
I will add the observation though that generally GPUs and their CUDA/OpenCL/whatever runtimes aren't very agressive about catching or flagging errors, and certainly don't make an effort to fail if invalid parameters are used. While I've never seen it explicitly stated, my understanding is that this is partly to avoid overhead, but mostly to be as forgiving as possible; in a game, it's better that the monsters arm look funny for a few frames than the entire game die because someone made a single out of bounds access.
For those doing GPGPU programming, this is awkward -- it's up to you to make sure all of your parameters and memory uses are valid, and if not, the results can be weird: sometimes it will work, and often it won't. But such is the way of things. I certainly wouldn't count on things failing reliably, and with some obvious and helpful way if you went a bit over a given memory limit.