如何可靠地查询SIMD组大小的金属计算着色器? threadExecutionWidth不总是匹配
我正在尝试在Mac应用程序中的一系列合理复杂的计算内核中使用SIMD组减少/前缀函数。我需要分配一些线程组内存以在同一线程组中的SIMD组之间进行协调。因此,此数组应具有[[SIMDGROUPS_PER_THREADGROUP]]
的不同,但这不是编译时值,因此不能将其用作数组维度。
现在,根据各种WWDC会话视频,管道对象上的threadexecutionWidth
应返回SIMD组大小,然后我可以使用setThreadGroupMemoryLength:atindex:在计算编码器上。
这在某些硬件上始终工作(例如Apple M1,ThreadExecutionWidth
似乎总是报告32),但是我正在击中配置,其中threadeexecutionWidth
与Simd组不匹配,导致Runnime size sign Time由于无法访问界限而导致的错误。 (例如,在 intel und uhd图形上630 ,threadexecutionwidth
= 16对于某些复杂的内核,尽管SIMD组大小似乎是32)
SO:
- 有没有可靠的方法查询SIMD组的大小在运行之前的计算内核?
- 交替, SIMD组的大小对于设备上的所有内核总是相同吗?,
如果后者至少是True,i大概可以信任最重要的内核吗?还是我应该向GPU提交一个琐碎的内核,以返回[[THEERS_PER_SIMDGROUP]]
?
我怀疑这个问题可能发生在金属提供“奇数”(非pow2)最大线程组大小的内核中,尽管在我遇到的情况下,最大线组大小报告为896,这是32个整数倍数,因此,对于threadexecutionWidth
,它似乎并不是在使用Max ThreadGroup大小和SIMD组大小之间使用最大的共同点。
I'm trying to use the SIMD group reduction/prefix functions in a series of reasonably complex compute kernels in a Mac app. I need to allocate some threadgroup memory for coordinating between SIMD groups in the same thread group. This array should therefore should have a capacity depending on [[simdgroups_per_threadgroup]]
, but that's not a compile time value, so it can't be used as an array dimension.
Now, according to various WWDC session videos, threadExecutionWidth
on the pipeline object should return the SIMD group size, with which I could then allocate an appropriate amount of memory using setThreadgroupMemoryLength:atIndex:
on the compute encoder.
This works consistently on some hardware (e.g. Apple M1, threadExecutionWidth
always seems to report 32) but I'm hitting configurations where threadExecutionWidth
does not match apparent SIMD group size, causing runtime errors due to out of bounds access. (e.g. on Intel UHD Graphics 630, threadExecutionWidth
= 16 for some complex kernels, although SIMD group size seems to be 32)
So:
- Is there a reliable way to query SIMD group size for a compute kernel before it runs?
- Alternately, will the SIMD group size always be the same for all kernels on a device?
If the latter is at least true, I can presumably trust threadExecutionWidth
for the most trivial of kernels? Or should I submit a trivial kernel to the GPU which returns [[threads_per_simdgroup]]
?
I suspect the problem might occur in kernels where Metal offers an "odd" (non-pow2) maximum thread group sizes, although in the case I'm encountering, the maximum threadgroup size is reported as 896, which is an integer multiple of 32, so it's not as if it's using the greatest common denominator between max threadgroup size and SIMD group size for threadExecutionWidth
.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
但是我至少找到了一个有效的
threadexecutionWidth
的速度开始此操作。simdgroups_per_threadgroup
的实际值进行比较。如果匹配,很棒,请运行其余内核。设备
groment memory buffer。然后提早出现计算内核。设备
内存中的状态提早退出。如果是这样,请检查报告的SIMD组大小,调整缓冲区分配,然后用新值重新运行内核。对于真正的偏执狂,明智的做法是将步骤2中的检查成为下限或上限或范围,而不是相等检查: em>或来自 n线程。这样,如果线程组缓冲区分配应更改
simdgroups_per_threadgroup
(I never found a particularly satisfying solution to this, but I did at least find an effective one:
threadExecutionWidth
.simdgroups_per_threadgroup
. If it matches, great, run the rest of the kernel.device
argument memory buffer. Then early-out of the compute kernel.device
memory. If so inspect the reported SIMD group size, adjust buffer allocations, then re-run the kernel with the new value.For the truly paranoid, it may be wise to make the check in step 2 a lower or upper bound or perhaps a range, rather than an equality check: e.g., the allocated memory is safe for SIMD group sizes up to or from N threads. That way, if threadgroup buffer allocations should change
simdgroups_per_threadgroup
(????) you don't end up bouncing backwards and forwards between vaulues, making no progress.Also pay attention to what you do in SIMD groups: not all GPU models support SIMD group reduction functions, even if they support SIMD permutations, so ship alternate versions of kernels for such older GPUs if necessary.
Finally, I've found most GPUs to report SIMD group sizes of 32 threads, but Intel Iris Graphics 6100 from ~2015 MacBook Pros reports a
simdgroups_per_threadgroup
(andthreadExecutionWidth
) value of 8. (And it doesn't support SIMD reduction functions, but does support SIMD permutation functions includingsimd_ballot()
which can be almost as effective as reductions for some algorithms.)