快速寄存器内排序字节?
给定 4 个字节的寄存器(对于 SIMD,为 16 个字节),必须有一种有效的方法来使用一些指令对寄存器中的字节进行排序。
提前致谢。
Given a register of 4 bytes (or 16 for SIMD), there has to be an efficient way to sort the bytes in-register with a few instructions.
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
所有排序算法都需要将值从一个地方“交换”到另一个地方。由于您谈论的是文字 CPU 寄存器,这意味着任何类型都需要另一个寄存器来用作临时位置来保存正在交换的字节。
我从未见过具有用于对寄存器内的字节进行排序的内置方法的芯片。并不是说它还没有完成,但我想不出这样的指令有多少用途。
All sorting algorithms require "swapping" values from one place to another. Since you're talking about a literal CPU register, that means any sort would need another register to use as a temporary place to hold the bytes being swapped.
I've never seen a chip with a built-in method for sorting bytes within a register. Not saying it hasn't been done, but I can't think of many uses for such an instruction.
找到了!它出现在 Furtak、Amaral 和 Nieviadomski 于 2007 年发表的论文“使用 SIMD 寄存器和指令在排序算法中启用指令级并行性”中。第 4 部分。
它使用 4 个 SSE 寄存器,有 12 个步骤,运行包括加载和存储在内的 19 条指令。
同一篇论文在使用 SIMD 动态构建排序网络方面也有一些出色的工作。
Found it! It's in the 2007 paper "Using SIMD Registers and Instructions to Enable Instruction-Level Parallelism in Sorting Algorithms" by Furtak, Amaral, and Niewiadomski. Section 4.
It uses 4 SSE registers, has 12 steps, and runs in 19 instructions including load and store.
The same paper has some excellent work on dynamically making sorting networks with SIMD.
查找高效的排序网络,其中 N = 您关心的字节数(4 或 16) 。将其转换为比较和交换指令序列。 (不过,对于 N=16,这将不仅仅是“一些”。)
Look up an efficient sorting network for N = the number of bytes you care about (4 or 16). Convert that to a sequence of compare and exchange instructions. (For N=16 that'll be more than 'a few', though.)
为了加快字符串排序速度,我最终将每个双精度数打包 7 个字节,并对 SSE2 中的 16 个双精度数数组进行排序(排名),使用双调排序创建两次 8 个运行,并使用二进制合并来合并这两个运行。您可以在这里看到第一部分 http://mischasan.wordpress.com/2011/07/29/okay-one-more-poke-at-sse2-sorting-doubles/(asm)和这里http://mischasan.wordpress.com/2011/09/ 02/update-on-bitonic-sse2-sort-of-16-doubles/ (C),以及 bitonic 合并步骤(如果你想一直使用 SSE):http://mischasan.wordpress.com/ 2012/11/04/sse2-odd-even-merge-the-last-step-in-sorting/ 。我用这种排序替换了qsort底部的插入排序,它的速度大约是直接qsort的5倍。 HTH
我还没看过 UofA 的论文;双调逻辑来自老式 (CTM) GPGPU 编程。
对于嵌入的链接字符串感到抱歉;我不知道如何在评论 stackoverflow 中添加可点击的链接。
To speed up sorting of strings, I ended up packing 7 bytes per double and sorting (ranking) an array of 16 doubles in SSE2, using bitonic sort to create two runs of 8, and a binary merge to merge the two runs. You can see the first part here http://mischasan.wordpress.com/2011/07/29/okay-one-more-poke-at-sse2-sorting-doubles/ (asm) and here http://mischasan.wordpress.com/2011/09/02/update-on-bitonic-sse2-sort-of-16-doubles/ (C), and the bitonic merge step (if you want to go SSE all the way) here: http://mischasan.wordpress.com/2012/11/04/sse2-odd-even-merge-the-last-step-in-sorting/ . I replaced the insertion sort at the bottom of qsort with this sort, and it's about 5 times as fast as straight qsort. HTH
I hadn't seen the UofA paper; the bitonic logic is from old school (CTM) GPGPU programming.
Sorry about the embedded link strings; I don't know how to add clickable links in comments stackoverflow.