vulkan 子组中的非活动调用
我正在阅读 vulkan 子组教程,它提到如果本地工作组大小小于子组大小,那么我们将始终有不活动的调用。
这篇文章澄清了 SubgroupLocalInitationId
和 LocalInitationId
之间没有直接关系。如果子组和本地工作组id之间没有关系,那么小规模的本地工作组如何保证非活动调用?
我的猜测如下
我认为工作组中的调用(线程)在 GPU 上执行之前会被分成子组。每个子组都与 GPU 上的基本执行单元完全匹配(NVIDIA GPU 的扭曲)。这意味着,如果工作组大小小于子组大小,则系统会尝试构建可在 GPU 上执行的最小子组。这将需要使用一些“非活动/死”调用来满足导致上述保证非活动调用的最小子组大小标准。这种理解正确吗? (为了简单起见,我故意尝试使用基本单词,如果有任何术语不正确,请告诉我)
谢谢
I am reading the vulkan subgroup tutorial and it mentions that if the local workgroup size is less than the subgroup size, then we will always have inactive invocations.
This post clarifies that there is no direct relation between a SubgroupLocalInvocationId
and LocalInvocationId
. If there is no relation between the subgroup and local workgroup ids, how does the small size of local workgroup guarantee inactive invocations?
My guess is as follows
I am thinking that the invocations (threads) in a workgroup are divided into subgroups before executing on the GPU. Each subgroup would be an exact match for the basic unit of execution on the GPU (warp for an NVIDIA GPU). This means that if the workgroup size is smaller than the subgroup size then the system somehow tries to construct a minimal subgroup which can be executed on the GPU. This would require using some "inactive/dead" invocations just to meet the minimum subgroup size criteria leading to the aforementioned guaranteed inactive invocations. Is this understanding correct? (I deliberately tried to use basic words for simplicity, please let me know if any of the terminology is incorrect)
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
计算调度用其参数定义全局工作组。全局工作组有 x×y×z 次调用。
每个调用都分为本地组(由着色器定义)。本地工作组还有另一组 x×y×z 调用。
本地工作组被划分为子组。它的调用被重新排列成子组。子组具有(一维)
SubgroupSize
数量的调用,所有这些调用都不需要分配本地工作组调用。并且子组不得跨越多个本地工作组;它只能使用来自单个本地工作组的调用。否则,如何完成此分区似乎很大程度上未指定,除了在非常特定的条件下保证完整的子组,这意味着
SubgroupSize
子组中的任何调用都不会保持空闲状态。如果不满足这些条件,则驱动程序可以在其认为合适的情况下使子组中的某些调用保持不活动状态。如果本地工作组的调用总数少于
SubgroupSize
,则子组的某些调用确实需要保持不活动状态,因为没有足够的可用本地工作组调用来填充甚至一个子组。A dispatch of compute defines with its parameters the global workgroup. The global workgroup has x×y×z invocations.
Each of those invocations are divided into local groups (defined by the shader). A local workgroup also has another set of x×y×z invocations.
A local workgroup is partitioned into subgroups. Its invocations are rearranged into subgroups. A subgroup has (1-dimensional)
SubgroupSize
amount of invocations, which all need not be assigned a local workgroup invocation. And a subgroup must not span over multiple local workgroups; it can use only invocations from a single local workgroup.Otherwise how this partitioning is done seems largely unspecified, except that under very specific conditions you are guaranteed full subgroups, which means none of the invocations in a subgroup of
SubgroupSize
will stay vacant. If those conditions are not fulfilled, then the driver may keep some invocations inactive in the subgroup as it sees fit.If the local workgroup has in total less invocations than
SubgroupSize
, then some of the invocations of the subgroup indeed need to stay inactive as there are not enough available local workgroup invocations to fill even one subgroup.