如何在多GPU上实现基数排序？

发布于 2024-10-02 05:46:56 字数 82 浏览 4 评论 0原文

如何在多 GPU 上实现基数排序——与单 GPU 上的方式相同，即分割数据，然后在单独的 GPU 上构建直方图，然后使用合并数据回来（就像一堆卡片）？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

软糖 2024-10-09 05:46:56

该方法可行，但我认为这不是最快的方法。具体来说，合并每 K 位的直方图（K=4 目前最好）需要在 GPU 之间交换密钥 32/K = 8 次才能对 32 位整数进行排序。由于 GPU 之间的内存带宽 (~5GB/s) 远低于 GPU 上的内存带宽 (~150GB/s)，这会降低性能。

更好的策略是将数据拆分为多个部分，在不同的 GPU 上并行对每个部分进行排序，然后在最后合并一次。这种方法只需要一次 GPU 间传输（相对于上面的 8 次），因此速度会快得多。

回复收藏 0 原文

街道布景 2024-10-09 05:46:56

不幸的是，这个问题没有得到充分提出。它取决于元素大小、元素在内存中开始存在的位置以及您希望排序的元素最终驻留在何处。

有时，可以通过将元素存储在共享相同公共前缀的组中来压缩排序列表，或者您可以动态地唯一元素，将每个元素与关联的计数一起存储在排序列表中一次。例如，您可以将一个巨大的 32 位整数列表排序为 64K 不同的 16 位值列表，从而将内存需求减少一半。

一般原则是您希望对数据进行尽可能少的传递次数，并且您的吞吐量几乎总是与与存储策略相关的带宽限制相对应。

如果您的数据集超出了快速内存的大小，您可能希望以合并过程结束，而不是继续基数排序，正如另一个人已经回答的那样。

我刚刚进入 GPU 架构，我不明白上面的 K=4 评论。我还从未见过这样的架构可以证明如此小的 K 是最佳的。

我怀疑合并直方图也是错误的方法。我可能会让元素在内存中碎片化，而不是合并直方图。在 GPU 结构中管理中尺度分散/聚集列表有那么困难吗？我当然希望不会。

最后，很难想象为什么要使用多个 GPU 来完成此任务。假设您的卡有 2GB 内存和 60GB/s 写入带宽（这就是我的中档卡所显示的）。三遍基数排序（11 位直方图）需要 6GB 写入带宽（可能是您的速率限制因素），或者大约 100 毫秒来对 2GB 32 位整数列表进行排序。太好了，它们已经排序了，现在怎么办？如果您需要在没有某种预处理或压缩的情况下将它们运送到其他地方，那么分类时间将是小鱼。

无论如何，今天刚刚编译了我的第一个示例程序。还有很多东西需要学习。我的目标应用是排列密集型，这与排序密切相关。我确信我将来会再次考虑这个话题。

Unfortunately this question is not adequately posed. It depends on element size, where the elements begin life in memory, and where you want the sorted elements to end up residing.

Sometimes it's possible to compress the sorted list by storing elements in groups sharing the same common prefix, or you can unique elements on the fly, storing each element once in the sorted list with an associated count. For example, you might sort a huge list of 32-bit integers into 64K distinct lists of 16-bit values, cutting your memory requirement in half.

The general principle is that you want to make the fewest number of passes over the data as possible and that your throughput will almost always correspond to bandwidth constraints associated with your storage policy.

If your data set exceeds the size of fast memory, you probably want to finish with a merge pass rather than continue to radix sort, as another person has already answered.

I'm just getting into GPU architecture and I don't understand the K=4 comment above. I've never seen an architecture yet where such a small K would prove optimal.

I suspect merging histograms is also the wrong approach. I'd probably let the elements fragment in memory rather than merge histograms. Is it that hard to manage meso-scale scatter/gather lists in the GPU fabric? I sure hope not.

Finally, it's hard to conceive of a reason why you would want to involve multiple GPUs for this task. Say your card has 2GB of memory and 60GB/s write bandwidth (that's what my mid-range card is showing). A three pass radix sort (11-bit histograms) requires 6GB of write bandwidth (likely your rate limiting factor), or about 100ms to sort a 2GB list of 32-bit integers. Great, they're sorted, now what? If you need to ship them anywhere else without some kind of preprocessing or compression, the sorting time will be small fish.

In any case, just compiled my first example programs today. There's still a lot to learn. My target application is permutation intensive, which is closely related to sorting. I'm sure I'll weigh in on this subject again in future.

回复收藏 0 原文

~没有更多了~