opencl 命令队列是如何工作的，我能要求它做什么

发布于 2024-11-29 02:55:07 字数 480 浏览 2 评论 0原文

我正在研究一种算法，它可以多次执行几乎相同的操作。由于该操作由一些线性代数（BLAS）组成，我想我会尝试使用 GPU 来完成此操作。

我已经编写了内核并开始将内核推送到命令队列上。因为我不想在每次通话后等待，所以我想我会尝试将我的通话与事件进行菊花链连接，然后开始将它们推入队列。

call kernel1(return event1)
call kernel2(wait for event 1, return event 2)
...
call kernel1000000(wait for event 999999)

现在我的问题是，所有这些都被推送到驱动程序存储队列的图形芯片吗？我可以使用的事件数量或命令队列的长度有限制，我环顾四周，但找不到这个。

我正在使用 atMonitor 来检查 GPU 的利用率，很难将其提高到 20% 以上，这可能仅仅是因为我无法足够快地将调用推送到那里吗？我的数据已经存储在 GPU 上，我传递的只是实际的调用。

原文

I'm working on an algorithm that does prettymuch the same operation a bunch of times. Since the operation consists of some linear algebra(BLAS), I thourght I would try using the GPU for this.

I've writen my kernel and started pushing kernels on the command queue. Since I don't wanna wait after each call I figures I would try daisy-chaining my calls with events and just start pushing these on the queue.

call kernel1(return event1)
call kernel2(wait for event 1, return event 2)
...
call kernel1000000(wait for event 999999)

Now my question is, does all of this get pushed to the graphic chip of does the driver store the queue? It there a bound on the number of event I can use, or to the length of the command queue, I've looked around but I've not been able to find this.

I'm using atMonitor to check the utilization of my gpu' and its pretty hard to push it above 20%, could this simply be becaurse I'm not able to push the calls out there fast enough? My data is already stored on the GPU and all I'm passing out there is the actual calls.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱她像谁 2024-12-06 02:55:07

首先，您不应该等待来自前一个内核的事件，除非下一个内核对前一个内核具有数据依赖性。设备利用率（通常）取决于队列中始终有准备好的东西。仅当需要等待事件时才等待事件。

“所有这些都被推送到驱动程序存储队列的图形芯片吗？”

这是实现定义的。请记住，OpenCL 不仅仅适用于 GPU！就 CUDA 式设备/主机二分法而言，您可能应该考虑“主机”上的命令队列操作（对于大多数实现）。

尝试对多个内核调用进行排队，而无需在它们之间等待。另外，请确保您使用最佳的工作组规模。如果您同时执行这两项操作，您应该能够充分利用您的设备。

回复收藏 0 原文

骑趴 2024-12-06 02:55:07

不幸的是，我不知道你所有问题的答案，而且你现在也让我想知道同样的事情，但我可以说我怀疑 OpenCL 队列是否会变满，因为你的 GPU 应该完成执行最后一个排队的命令在提交至少 20 个命令之前。不过，只有当您的 GPU 有“看门狗”时，这才是正确的，因为这会阻止执行长得离谱的内核（我认为是 5 秒或更长时间）。

回复收藏 0 原文

~没有更多了~