GLSL 中的快速排序？

发布于 2024-07-16 23:11:28 字数 173 浏览 14 评论 0原文

我正在考虑使用 GLSL 着色器将大量处理移植到 GPU。我偶然发现的直接问题之一是，在其中一个步骤中，算法需要维护一个元素列表，对它们进行排序并取出几个最大的元素（哪个数字取决于数据）。在 CPU 上，这只需使用 STL 矢量和 qsort() 即可完成，但在 GLSL 中我没有这样的设施。有没有办法弥补这个不足呢？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你的他你的她 2024-07-23 23:11:29

披露：我真的不知道 GLSL——我一直在使用 AMD Stream SDK 进行 GPGPU 编程，它有不同的编程语言。

从您对 Bjorn 的回答的评论来看，我认为您对使用 GPU 对大型数据库进行排序不感兴趣，例如创建反向电话簿或其他什么，但相反，您有一个小数据集，并且每个片段都有自己的数据集进行排序。更像是尝试进行中值像素过滤？

我只能笼统地说：

对于小数据集，排序算法确实不重要。虽然人们一生都在担心对于非常大的数据库来说哪种排序算法是最好的，但对于小 N 来说，是否使用快速排序、堆排序、基数排序、希尔排序、优化冒泡排序、未优化冒泡排序实际上并不重要，等等。至少对CPU来说没有多大关系。

GPU 是 SIMD 设备，因此它们喜欢让每个内核以锁步执行相同的操作。计算很便宜，但分支很昂贵，并且每个内核以不同方式分支的数据依赖分支非常非常非常昂贵。

因此，如果每个内核都有自己的小数据集要排序，并且要排序的数据数量取决于数据，并且每个内核的数字可能不同，那么您最好选择最大大小（如果可以的话），填充具有无穷大或某个大数字的数组，并且让每个内核执行完全相同的排序，这将是未优化的无分支冒泡排序，如下所示：

伪代码（因为我不知道 GLSL），排序为 9 分

#define TwoSort(a,b) { tmp = min (a, b); b = a + b - tmp; a = tmp; }
for (size_t n = 8; n ; --n) {
  for (size_t i = 0; i < n; ++i) {
    TwoSort (A[i], A[i+1]);
  }
}

Disclosure: I really don't know GLSL -- I've been doing GPGPU programming with the AMD Stream SDK, which has different programming language.

From you comment on Bjorn's answer, I gather that you are not interested in using the GPU to sort a huge database -- like creating a reverse phone book or whatever, but instead, you have a small dataset and each fragment has it's own dataset to sort. More like trying to do median pixel filtering?

I can only say in general:

For small datasets, the sort algorithm really doesn't matter. While people have spent careers worrying about which is the best sort algorithm for very large databases, for small N it really doesn't matter whether you use Quick sort, Heap Sort, Radix Sort, Shell Sort, Optimized Bubble Sort, Unoptimized Bubble sort, etc. At least it doesn't matter much on a CPU.

GPUs are SIMD devices, so they like to have each kernel executing the same operations in lock step. Calculations are cheap but branches are expensive and data-dependent branches where each kernel branchs a different way is very, very, very, expensive.

So if each kernel has it's own small dataset to sort, and the # of data to sort is data dependent and it could be a different number for each kernel, you're probably better off picking a maximum size (if you can), padding the arrays with Infinity or some large number, and having each kernel perform the exact same sort, which would be an unoptimized branchless bubble sort, something like this:

Pseudocode (since I don't know GLSL), sort of 9 points

#define TwoSort(a,b) { tmp = min (a, b); b = a + b - tmp; a = tmp; }
for (size_t n = 8; n ; --n) {
  for (size_t i = 0; i < n; ++i) {
    TwoSort (A[i], A[i+1]);
  }
}

回复收藏 0 原文