在位上转置 8x8 块中的位的最快方法是什么?
我不确定我想要做的事情的确切术语。我有一个 8x8
的 位
块存储在 8 个字节
中,每个字节存储一行。当我完成后,我希望每个字节存储一列。
例如,当我完成时:
Byte0out = Byte0inBit0 + Bit0inByte1 + Bit0inByte2 + Bit0inByte3 + ...
Byte1out = Bit1inByte0 + Bit1inByte1 + Bit1inByte2 + Bit1inByte3 + ...
在 C 中执行此操作的最简单方法是什么?并且性能良好?这将在 dsPIC 微控制器上运行
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
这段代码直接抄袭自“黑客之乐” - 图 7-2 转置 8x8 位矩阵,我对此没有任何功劳:
我没有检查它是否按照您需要的方向旋转,如果不是您可能需要调整代码。
另外,请记住数据类型和数据类型。尺寸 -
int
&unsigned (int)
在您的平台上可能不是 32 位。顺便说一句,我怀疑这本书(《黑客之乐》)对于你正在做的工作来说是必不可少的……看看吧,里面有很多很棒的东西。
This code is cribbed directly from "Hacker's Delight" - Figure 7-2 Transposing an 8x8-bit matrix, I take no credit for it:
I didn't check if this rotates in the direction you need, if not you might need to adjust the code.
Also, keep in mind datatypes & sizes -
int
&unsigned (int)
might not be 32 bits on your platform.BTW, I suspect the book (Hacker's Delight) is essential for the kind of work you're doing... check it out, lots of great stuff in there.
如果您正在寻找最简单的解决方案:
如果您正在寻找最快的解决方案:
如何利用 SSE2 在程序集中转置位矩阵。
If you are looking for the simplest solution:
If your are looking for the fastest solution:
How to transpose a bit matrix in the assembly by utilizing SSE2.
这听起来很像使用位平面的显示器上使用的所谓“厚块到平面”例程。以下链接使用 MC68K 汇编器作为其代码,但提供了问题的一个很好的概述(假设我正确理解了问题):
http://membres.multimania.fr/amycoders/sources/c2ptut.html
This sounds a lot like a so-called "Chunky to planar" routine used on displays that use bitplanes. The following link uses MC68K assembler for its code, but provides a nice overview of the problem (assuming I understood the question correctly):
http://membres.multimania.fr/amycoders/sources/c2ptut.html
Lisp 原型:
这就是运行代码的方式:
有时我会反汇编代码以检查是否存在对安全函数的不必要的调用。
这是一个基准。经常运行该函数以处理(二进制)HDTV 图像。
这只花了 51 毫秒。请注意,我花了很多钱,因为该函数始终分配新的 8 字节数组。我确信 C 语言的实现可以进行更多调整。
这里还有一些测试用例:
现在我真的想看看我的代码与 Andrejs Cainikovs 的 C 解决方案相比如何
(编辑:我认为这是错误的):
并对它进行基准测试:
HDTV 图像上的每个循环需要 2.5 毫秒。这比我未优化的 Lisp 快很多。
不幸的是,C 代码没有给出与我的 lisp 相同的结果:
Lisp prototype:
This is how you can run the code:
Occasionally I disassemble code to check that there are no unnecessary calls to safety functions.
This is a benchmark. Run the function often enough to process a (binary) HDTV image.
That took only took 51ms. Note that I'm consing quite a lot because the function allocates new 8 byte arrays all the time. I'm sure an implementation in C can be tweaked a lot more.
Here are some more test cases:
Now I really want to see how my code compares to Andrejs Cainikovs' C solution
(Edit: I think its wrong):
And benchmarking it:
Each loop over the HDTV image takes 2.5ms. That is quite a lot faster than my unoptimized Lisp.
Unfortunately the C code doesn't give the same results like my lisp:
这类似于获取位板问题中的列,并且可以通过将这些输入字节视为 8 个字节来有效解决64 位整数。如果位 0 是最低有效位,字节 0 是数组中的第一个字节,那么我假设您想要
对 bXY 执行以下操作,即字节 X 的位号 Y。在这种形式中,旋转最左边的列只是打包所有将最高有效位以相反的顺序转换为单个字节,并且类似地可以旋转其他列。
为此,我们屏蔽掉所有最后 7 列并将数组读取为
uint64_t
。结果采用小端字节序,
abcdefgh
分别为 b07 到 b77。现在我们只需将该值与 幻数 0x0002040810204081 相乘即可得到hgfedcba
的值在 MSB 中,这是我们所期望的,因为您将 8 字节数组视为 uint64_t,您可能需要正确对齐阵列以获得更好的性能,因为这样只需要单个内存负载
在 AVX2 中,Intel 引入了 PDEP 指令(可通过
_pext_u64
内在函数访问) href="https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#BMI2" rel="nofollow noreferrer">BMI2 指令集用于此目的,因此该功能可以在单个指令中完成但不幸的是,这赢了无法按照您的预期在 dsPIC 中工作
更多转置数组的方法可以在
This is similar to the get column in a bitboard problem and can be solved efficiently by considering those input bytes as 8 bytes of a 64-bit integer. If bit 0 is the least significant one and byte 0 is the first byte in the array then I assume you want to do the following
with bXY is byte X's bit number Y. In that form rotating the left-most column is just packing all the most significant bits into a single byte in reverse order, and similarly other columns can be rotated
To do that we mask out all the last 7 columns and read the array as an
uint64_t
. The result isin little endian, with
abcdefgh
are b07 to b77 respectively. Now we just need to multiply that value with the magic number 0x0002040810204081 to make a value withhgfedcba
in the MSB which is what we expectedBecause you treat the 8-byte array as
uint64_t
, you may need to align the array properly to get better performance because that way only a single memory load is neededIn AVX2 Intel introduced the PDEP instruction (accessible via the
_pext_u64
intrinsic) in the BMI2 instruction set for this purpose so the function can be done in a single instructionBut unfortunately this won't work in dsPIC as you expected
More ways to transpose the array can be found in the chess programming wiki
您确实想使用 SIMD 指令和 GCC 矢量矢量支持来执行类似的操作: http:// /ds9a.nl/gcc-simd/example.html
You really want to do something like this with SIMD instructions with something like the GCC vector vector support: http://ds9a.nl/gcc-simd/example.html
如果您想要一个优化的解决方案,您可以使用 x86 中的 SSE 扩展。
您需要使用其中 4 个 SIMD 操作码。
和 2 个常规 x86 操作码
25 个 x86 操作码/指令,而不是堆栈
for
循环64 次迭代的解决方案。If you wanted an optimized solution you would use the SSE extensions in x86.
You'd need to use 4 of these SIMD opcodes.
And 2 regular x86 opcodes
25 x86 opcodes/instructions as opposed to the stacked
for
loop solution with 64 iterations.