GPU 上的并行冒泡排序

发布于 2024-10-23 00:41:07 字数 354 浏览 10 评论 0原文

我正在使用 CUDA 实现简单的冒泡排序算法，我有一个问题。
我执行以下代码以交换数组中的 2 个连续元素：

if(a[threadIdx.x]>a[threadIdx.x + 1])
    Swap(a[threadIdx.x] , a[threadIdx.x + 1]);

请注意，块中的线程数是数组大小的一半。这是一个好的实施吗？即使存在分支，单个 warp 中的线程也会并行执行吗？因此实际上需要 N 次迭代才能对数组进行排序？

另请注意，我知道我可以实现更好的排序算法，并且我可以使用 Thrust、CUDPP 或 SDK 中的示例排序算法，但就我而言，我只需要一个简单的算法来实现。

原文

i am implementing the simple bubble sort algorithm using CUDA, and i have a question.
i perform the following code in order to swap 2 consecutive elements in the array:

if(a[threadIdx.x]>a[threadIdx.x + 1])
    Swap(a[threadIdx.x] , a[threadIdx.x + 1]);

note that the number of threads in the block is half the size of the array. Is this a good implementation? would threads in a single warp execute in parallel even if there is a branch? therefore it would actually take N iterations in order to sort the array?

also note that i know that there are better sorting algorithms that i could implement,and i can use Thrust, CUDPP, or a sample sorting algorithm from the SDK, but in my case, i just need a simple algorithm to implement.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柠檬色的秋千 2024-10-30 00:41:07

我假设：

您想要排序的数组很小（<100 个元素）
它是一些更大的 GPU 算法的一部分
该数组驻留在共享内存空间中，或者可以复制到那里

如果其中任何一个不正确，则不要进行冒泡排序！

块中的线程数是数组大小的一半。这是一个好的实现吗？

这是合理的。当扭曲中出现发散分支时，所有线程都会完美同步地执行所有分支，只是某些线程设置了“禁用”标志。这样，每个分支仅执行一次。唯一的例外——当来自 warp 的线程没有采用分支时——那么该分支将被简单地跳过。

BUG！

但是在你的代码中我发现了一个问题。如果希望一个线程对数组的两个元素进行操作，请让它们独占处理，即：

if(a[2*threadIdx.x]>a[2*threadIdx.x + 1])
    Swap(a[2*threadIdx.x] , a[2*threadIdx.x + 1]);

否则，如果 Swap 由两个相邻线程执行，则某些值可能会消失，而另一些值可能会消失。在数组中重复。

另一个错误！

如果您的块大于扭曲大小，请记住在需要时放置__syncthreads()。即使您的块较小（不应该），您也应该检查 __threadfence_block() 以确保同一块的其他线程可以看到对共享内存的写入。否则，编译器可能会过于激进地进行优化并使您的代码无效。

另一个问题

如果您修复了第一个错误，您的共享内存上将会出现 2 路存储体冲突。这并不是非常重要，但您可能需要重新组织数组中的数据以避免它们，例如按以下顺序排列连续元素：

[1, 3, 5, 7, 9, . .., 29, 31, 2, 4, 6, 8, ... , 30, 32]

这样，元素 1 和 2 属于共享内存中的同一存储体。

I assume:

The array you want to sort is small (<100 elements)
It is a part of some bigger GPU algorithm
The array resides in shared memory space, or can be copied there

If any of these is not true, don't do bubble sort!

the number of threads in the block is half the size of the array. Is this a good implementation?

It is reasonable. When a divergent branch occurs in a warp, all threads execute all branches in a perfect sync, simply some threads have their flag "disabled" set. This way, every branch is executed only once. The only exception --- when no thread from a warp is taking a branch --- then the branch is simply skipped.

BUG!

In your code, however I see a problem. If you want one thread to operate on two elements of the array, make them handle it exclusively, that is:

if(a[2*threadIdx.x]>a[2*threadIdx.x + 1])
    Swap(a[2*threadIdx.x] , a[2*threadIdx.x + 1]);

Otherwise, if Swap is executed by two neighbouring threads, some values may dissapear and some other values may get duplicated in the array.

Another bug!

If your block is bigger than a warp size, remember to put __syncthreads() when needed. Even if your block is smaller (shouldn't be), you should check the __threadfence_block() to ensure that writes to shared memory are visible by other threads of the same block. Otherwise, the compiler may be too aggressive on optimisations and make your code invalid.

Another problem

If you fix the first bug, you will have 2-way bank conflict on your shared memory. It is not very important, but you might want to reorganise the data in your array to avoid them, e.g. have the consecutive elements in the following order:

[1, 3, 5, 7, 9, ..., 29, 31, 2, 4, 6, 8, ... , 30, 32]

This way, elements 1 and 2 belong to the same bank in shared memory.

回复收藏 0 原文

折戟 2024-10-30 00:41:07

我很高兴您意识到 GPU 上的冒泡排序可能表现得非常糟糕！我正在努力弄清楚如何在不必启动许多内核的情况下获得足够的并行性。此外，完成后您可能会很难锻炼。

不管怎样，回答你的具体问题：是的，在这种情况下你很可能会出现扭曲发散。但是，鉴于“else”分支实际上是空的，这不会减慢您的速度。平均而言（直到此列表排序），warp 中大约一半的线程将采用“if”分支，其他线程将等待，然后当“if”分支完成时，warp 线程可以恢复原状脚背。这远不是你最大的问题:)

回复收藏 0 原文

~没有更多了~