双调排序网络与 Thrust::sort_by_key
我实现了一种使用排序的算法。我尝试了 Thrust::sort_by_key ,它花费了大约 0.4 秒的时间对包含 10^7 个元素的数组进行排序。
我认为双调排序网络应该比 Thrust::sort_by_key 更快。然而,双调排序对上述相同的数组进行排序大约需要 2.5 秒。我使用了SDK提供的双调排序网络。我只是稍微修改了原来的双音排序。
你能告诉我为什么吗?或者给我一些建议?
谢谢,
Yik
2011 年 8 月 15 日
I implemented an algorithm which used sorting. I tried Thrust::sort_by_key that took around 0.4s to sort an array with 10^7 elements.
I thought bitonic sorting network should be faster than Thrust::sort_by_key. However, bitonic sorting took about 2.5s to sort the same array mentioned above. I used the bitonic sorting network provided by SDK. I just modified the original bitonic sort a little bit.
Could you tell me why? or give me some advice?
Thanks,
Yik
Aug, 15, 2011
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
简而言之,CUDA SDK 提供的双调排序示例主要用于教学目的。它根本不如 Thrust 的排序实现那么优化,Thrust 的排序实现基于
两种算法的渐近性能也不同。双调排序网络的复杂度为
O(n lg^2 n)
,而 Thrust 使用的基数排序的复杂度更像是O(#bits n)
复杂度,其中 < code>#bits 表示表示输入数据所需的位数。换句话说,基数排序比双调排序需要渐近更少的全局内存读取和写入。您可能会发现,对于较小的工作负载,双调排序性能更好。
The short answer is that the bitonic sorting example provided by the CUDA SDK is primarily meant to be pedagogical. It simply isn't as optimized as Thrust's sorting implementation, which is based on the very efficient radix sort provided by Merrill.
The asymptotic performance of the two algorithms is also different. A bitonic sorting network has
O(n lg^2 n)
complexity, while the radix sort employed by Thrust has something more likeO(#bits n)
complexity, where#bits
denotes the number of bits required to represent the input data. In other words, the radix sort requires asymptotically fewer global memory reads and writes than does the bitonic sort.You might find that for smaller workloads, bitonic sort performs better.