计算教授的非相干和相干 gst/gld 场? (CUDA/OpenCL)

发布于 2024-09-26 04:20:50 字数 508 浏览 11 评论 0原文

我正在使用 Compute Prof 3.2 和 Geforce GTX 280。我相信我的计算能力为 1.3。

此文件,似乎表明我应该能够看到这些字段,因为我使用的是 1.x 计算设备。嗯,我没有看到它们,3.2 工具包的用户指南说我看不到它们,但将它们称为 gst_uncoalescedgst_coalesced

总而言之,如果我从全局内存中进行非合并读取,我很困惑应该如何从探查器中找出答案。看起来费米卡也不会说,但我现在并不担心它们。如果有人能详细说明一下情况,我将不胜感激。

另外,我被告知要查看内核的组装来解决这个问题,因此任何关于如何做到这一点的详细说明也值得赞赏。我也刚刚开始尝试解决这个问题:)

I am using Compute Prof 3.2 and a Geforce GTX 280. I have compute capability 1.3 then I believe.

This file, seems to show that I should be able to see these fields since I am using a 1.x compute device. Well I don't see them and the User Guide for 3.2 toolkit says I can't see them, but calls them gst_uncoalesced and gst_coalesced.

To sum up, I am confused about how I should figure out from the profiler if I am making non-coalesced reads from global memory. It doesn't look like Fermi cards will say either, but I am not worried about them for now. If anybody can elaborate on the situation I would appreciate it.

Also, I've been told to look at the assembly of my kernels to figure this stuff out, so any elaboration on how to do this is appreciated too. I am just starting to try and figure that stuff out too :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

拔了角的鹿 2024-10-03 04:20:50

我在分析输出时也遇到了类似的问题。虽然在 8600(计算能力 1.0)上它显示了合并和未合并的读/写,但它在 GTX280 上仅显示合并。我认为这是由于 gtx 280 上更好的合并使得剪切变得不太清晰(是不是除了一个单词之外的所有内容都不需要未合并的内存读取?)。不过,您可以只查看汇总表。在那里您可以找到每个内核的加载和存储效率。如果所有访问都合并,则效率应为 1,否则其小于 1(0.5 表示仅使用一半的加载字节)。

当然,由于这并不能帮助您弄清楚未合并的访问到底在内核中的位置,所以最好的方法仍然是了解合并的工作原理(每个 halfwarp 的地址被收集到 32、64 和 128 字节的访问中,而不是访问的值中)无论如何,该区域内的数据都会被转移),并且分析您的访问模式仍然是最终的方法。

I had similar problems with the profiling output. While on a 8600 (compute capability 1.0) it showed both coalesced and uncoalesced reads/writes, it showed only coalesced on GTX280. I assumed that was due to the better coalescing on the gtx 280 making the cut less clear (is a memory read for which all but one word is not needed uncoalesced?). However you can just look into the summary table. There you find a load and a store efficieny for each kernel. If all accesses are coalesced that efficiency should be 1, otherwise its less then one (0.5 meaning that only half of the loaded bytes are used).

Of course since that doesn't help you much figuring out where exactly your uncoalesced accesses are inside your kernel, the best way is still knowing how the coalescing works (addresses of each halfwarp are gathered into 32, 64 and 128byte accesses, not accessed values inside that area are transferred anyways) and analysing your accesspatterns is still the way to go in the end.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文