如何可靠地查询SIMD组大小的金属计算着色器? threadExecutionWidth不总是匹配

发布于 2025-02-10 19:47:47 字数 978 浏览 3 评论 0原文

我正在尝试在Mac应用程序中的一系列合理复杂的计算内核中使用SIMD组减少/前缀函数。我需要分配一些线程组内存以在同一线程组中的SIMD组之间进行协调。因此,此数组应具有[[SIMDGROUPS_PER_THREADGROUP]]的不同,但这不是编译时值,因此不能将其用作数组维度。

现在,根据各种WWDC会话视频,管道对象上的threadexecutionWidth应返回SIMD组大小,然后我可以使用setThreadGroupMemoryLength:atindex:在计算编码器上。

这在某些硬件上始终工作(例如Apple M1,ThreadExecutionWidth似乎总是报告32),但是我正在击中配置,其中threadeexecutionWidth与Simd组不匹配,导致Runnime size sign Time由于无法访问界限而导致的错误。 (例如,在 intel und uhd图形上630 threadexecutionwidth = 16对于某些复杂的内核,尽管SIMD组大小似乎是32)

SO:

  1. 有没有可靠的方法查询SIMD组的大小在运行之前的计算内核?
  2. 交替, SIMD组的大小对于设备上的所有内核总是相同吗?

如果后者至少是True,i大概可以信任最重要的内核吗?还是我应该向GPU提交一个琐碎的内核,以返回[[THEERS_PER_SIMDGROUP]]

我怀疑这个问题可能发生在金属提供“奇数”(非pow2)最大线程组大小的内核中,尽管在我遇到的情况下,最大线组大小报告为896,这是32个整数倍数,因此,对于threadexecutionWidth,它似乎并不是在使用Max ThreadGroup大小和SIMD组大小之间使用最大的共同点。

I'm trying to use the SIMD group reduction/prefix functions in a series of reasonably complex compute kernels in a Mac app. I need to allocate some threadgroup memory for coordinating between SIMD groups in the same thread group. This array should therefore should have a capacity depending on [[simdgroups_per_threadgroup]], but that's not a compile time value, so it can't be used as an array dimension.

Now, according to various WWDC session videos, threadExecutionWidth on the pipeline object should return the SIMD group size, with which I could then allocate an appropriate amount of memory using setThreadgroupMemoryLength:atIndex: on the compute encoder.

This works consistently on some hardware (e.g. Apple M1, threadExecutionWidth always seems to report 32) but I'm hitting configurations where threadExecutionWidth does not match apparent SIMD group size, causing runtime errors due to out of bounds access. (e.g. on Intel UHD Graphics 630, threadExecutionWidth = 16 for some complex kernels, although SIMD group size seems to be 32)

So:

  1. Is there a reliable way to query SIMD group size for a compute kernel before it runs?
  2. Alternately, will the SIMD group size always be the same for all kernels on a device?

If the latter is at least true, I can presumably trust threadExecutionWidth for the most trivial of kernels? Or should I submit a trivial kernel to the GPU which returns [[threads_per_simdgroup]]?

I suspect the problem might occur in kernels where Metal offers an "odd" (non-pow2) maximum thread group sizes, although in the case I'm encountering, the maximum threadgroup size is reported as 896, which is an integer multiple of 32, so it's not as if it's using the greatest common denominator between max threadgroup size and SIMD group size for threadExecutionWidth.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

柠檬色的秋千 2025-02-17 19:47:47

但是我至少找到了一个有效的

  1. 我从来没有找到特别满意的解决方案, 参数被用作分配缓冲尺寸的基础。以threadexecutionWidth的速度开始此操作。
  2. 作为计算内核的第一部分,将其与simdgroups_per_threadgroup的实际值进行比较。如果匹配,很棒,请运行其余内核。
  3. 如果它不匹配,请在反馈/错误报告变量/字段中返回实际 simd大小,设备 groment memory buffer。然后提早出现计算内核。
  4. 在主机侧,检查计算内核是否通过设备内存中的状态提早退出。如果是这样,请检查报告的SIMD组大小,调整缓冲区分配,然后用新值重新运行内核。

对于真正的偏执狂,明智的做法是将步骤2中的检查成为下限或上限或范围,而不是相等检查: em>或来自 n线程。这样,如果线程组缓冲区分配应更改 simdgroups_per_threadgroup

I never found a particularly satisfying solution to this, but I did at least find an effective one:

  1. Pass the expected SIMD group size as a kernel argument, which was used as basis for allocating buffer sizes. Start this off as threadExecutionWidth.
  2. As the first part of the compute kernel, compare this to the actual value of simdgroups_per_threadgroup. If it matches, great, run the rest of the kernel.
  3. If it doesn't match, return the actual SIMD size in a feedback/error reporting variable/field in a device argument memory buffer. Then early-out of the compute kernel.
  4. On the host side, check if the compute kernel exited early via the status in device memory. If so inspect the reported SIMD group size, adjust buffer allocations, then re-run the kernel with the new value.

For the truly paranoid, it may be wise to make the check in step 2 a lower or upper bound or perhaps a range, rather than an equality check: e.g., the allocated memory is safe for SIMD group sizes up to or from N threads. That way, if threadgroup buffer allocations should change simdgroups_per_threadgroup (????) you don't end up bouncing backwards and forwards between vaulues, making no progress.

Also pay attention to what you do in SIMD groups: not all GPU models support SIMD group reduction functions, even if they support SIMD permutations, so ship alternate versions of kernels for such older GPUs if necessary.

Finally, I've found most GPUs to report SIMD group sizes of 32 threads, but Intel Iris Graphics 6100 from ~2015 MacBook Pros reports a simdgroups_per_threadgroup (and threadExecutionWidth) value of 8. (And it doesn't support SIMD reduction functions, but does support SIMD permutation functions including simd_ballot() which can be almost as effective as reductions for some algorithms.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文