所有块读取相同的全局内存位置部分。最快的方法是?

发布于 2025-01-31 05:21:46 字数 167 浏览 2 评论 0原文

我正在编写一个算法,所有块都在读取相同的地址。例如,我们有一个列表= [1、2、3、4],所有块都在读取并将其存储到他们自己的共享内存中...我的测试显示越来越多的块阅读它,它将慢。我猜到这里没有广播吗?我能使它更快吗?谢谢你!!!

我从上一篇文章中了解到,这可以用一个包裹播放,似乎无法用不同的包裹进行。

I am writing an algorithm which all blocks are reading a same address. Such as we have a list=[1, 2, 3, 4], and all blocks are reading it and store it to their own shared memory...My test shows the more blocks reading it, the slower it will be...I guess no broadcast happen here? Any idea I can make it faster? Thank you!!!

I learnt from previous post that this can be broadcast in one wrap, seems can not happen in different wrap....(Actually in my case, the threads in one wrap are not reading a same location...)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

只怪假的太真实 2025-02-07 05:21:46

一旦通过SM单元的第一个经纱访问了列表元素,同一SM单元中的第二个经纱将从缓存和广播到所有Simt Lanes。但是另一个SM单元的翘曲可能没有在L1缓存中具有它,因此它首先从L2到L1。

它在__常数__内存中相似,但是所有线程都需要访问相同的地址。它的延迟更接近注册访问。 __常数__内存就像指令缓存,当所有线程都做同样的事情时,您将获得更多的性能。

例如,如果您有一个高斯过滤器,可以在所有线程上的同一系数滤镜上迭代迭代,则最好使用常数内存。使用共享存储器没有太大的优势,因为未随机扫描过滤器数组。当过滤器数组内容每个块不同或需要随机访问时,共享存储器会更好。

您还可以将恒定内存和共享内存结合在一起。从恒定内存中获取一半的列表,然后从共享内存中获得另一半。这应该让1024个线程隐藏一个内存类型的潜伏期隐藏在另一个内存后面。

如果列表足够小,则可以直接使用寄存器(必须是编译时间已知索引)。但这会增加寄存器压力并可能会降低占用率,因此请谨慎。

一些旧的CUDA体系结构(如果进行FMA操作),需要从恒定内存中获取一个操作数,而另一个操作数则是从寄存器中获取的,以在计算斜角算法中获得更好的性能。

在以12000个浮子为滤镜的测试中,要应用于所有线程输入,共享存储器版本具有330毫秒的每块螺纹完成的工作,而恒定记忆版本以260毫秒完成,L1访问性能是真正的瓶颈这两个版本都因此,只要它与所有线程都是相似的索引,那么真正的恒定内存性能也会更好。

Once list element is accessed by first warp of a SM unit, the second warp in same SM unit gets it from cache and broadcasts to all simt lanes. But another SM unit's warp may not have it in L1 cache so it fetches from L2 to L1 first.

It is similar in __constant__ memory but it requires same address to be accessed by all threads. Its latency is closer to register access. __constant__ memory is like instruction cache, you get more performance when all threads do same thing.

For example, if you have a Gaussian-filter that iterates over same coefficient-list of filter on all threads, it is better to use constant memory. Using shared memory does not have much advantage as the filter array is not scanned randomly. Shared memory is better when the filter array content is different per block or if it needs random access.

You can also combine constant memory and shared memory. Get half of list from constant memory, then the other half from shared memory. This should let 1024 threads hide latency of one memory type hidden behind the other.

If list is small enough, you can use registers directly (has to be compile-time known indices). But it increases register pressure and may decrease occupancy so be careful about this.

Some old cuda architectures (in case of fma operation) required one operand fetched from constant memory and the other operand from a register to achieve better performance in compute-bottlenecked algorithms.

In a test with 12000 floats as filter to be applied on all threads inputs, shared memory version with 128 threads-per-block completed work in 330 milliseconds while constant-memory version completed in 260 milliseconds and the L1 access performance was the real bottleneck in both versions so the real constant-memory performance is even better, as long as it is similar-index for all threads.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文