在位上转置 8x8 块中的位的最快方法是什么?

发布于 2024-11-27 12:15:18 字数 378 浏览 0 评论 0 原文

我不确定我想要做的事情的确切术语。我有一个 8x8 块存储在 8 个字节 中,每个字节存储一行。当我完成后,我希望每个字节存储一列。

例如,当我完成时:

Byte0out = Byte0inBit0 + Bit0inByte1 + Bit0inByte2 + Bit0inByte3 + ...
Byte1out = Bit1inByte0 + Bit1inByte1 + Bit1inByte2 + Bit1inByte3 + ...

C 中执行此操作的最简单方法是什么?并且性能良好?这将在 dsPIC 微控制器上运行

I'm not sure the exact term for what I'm trying to do. I have an 8x8 block of bits stored in 8 bytes, each byte stores one row. When I'm finished, I'd like each byte to store one column.

For example, when I'm finished:

Byte0out = Byte0inBit0 + Bit0inByte1 + Bit0inByte2 + Bit0inByte3 + ...
Byte1out = Bit1inByte0 + Bit1inByte1 + Bit1inByte2 + Bit1inByte3 + ...

What is the easiest way to do this in C which performs well? This will run on a dsPIC microcontroller

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

铜锣湾横着走 2024-12-04 12:15:18

这段代码直接抄袭自“黑客之乐” - 图 7-2 转置 8x8 位矩阵,我对此没有任何功劳:

void transpose8(unsigned char A[8], int m, int n, 
                unsigned char B[8]) {
   unsigned x, y, t; 

   // Load the array and pack it into x and y. 

   x = (A[0]<<24)   | (A[m]<<16)   | (A[2*m]<<8) | A[3*m]; 
   y = (A[4*m]<<24) | (A[5*m]<<16) | (A[6*m]<<8) | A[7*m]; 

   t = (x ^ (x >> 7)) & 0x00AA00AA;  x = x ^ t ^ (t << 7); 
   t = (y ^ (y >> 7)) & 0x00AA00AA;  y = y ^ t ^ (t << 7); 

   t = (x ^ (x >>14)) & 0x0000CCCC;  x = x ^ t ^ (t <<14); 
   t = (y ^ (y >>14)) & 0x0000CCCC;  y = y ^ t ^ (t <<14); 

   t = (x & 0xF0F0F0F0) | ((y >> 4) & 0x0F0F0F0F); 
   y = ((x << 4) & 0xF0F0F0F0) | (y & 0x0F0F0F0F); 
   x = t; 

   B[0]=x>>24;    B[n]=x>>16;    B[2*n]=x>>8;  B[3*n]=x; 
   B[4*n]=y>>24;  B[5*n]=y>>16;  B[6*n]=y>>8;  B[7*n]=y; 
}

我没有检查它是否按照您需要的方向旋转,如果不是您可能需要调整代码。

另外,请记住数据类型和数据类型。尺寸 - int & unsigned (int) 在您的平台上可能不是 32 位。

顺便说一句,我怀疑这本书(《黑客之乐》)对于你正在做的工作来说是必不可少的……看看吧,里面有很多很棒的东西。

This code is cribbed directly from "Hacker's Delight" - Figure 7-2 Transposing an 8x8-bit matrix, I take no credit for it:

void transpose8(unsigned char A[8], int m, int n, 
                unsigned char B[8]) {
   unsigned x, y, t; 

   // Load the array and pack it into x and y. 

   x = (A[0]<<24)   | (A[m]<<16)   | (A[2*m]<<8) | A[3*m]; 
   y = (A[4*m]<<24) | (A[5*m]<<16) | (A[6*m]<<8) | A[7*m]; 

   t = (x ^ (x >> 7)) & 0x00AA00AA;  x = x ^ t ^ (t << 7); 
   t = (y ^ (y >> 7)) & 0x00AA00AA;  y = y ^ t ^ (t << 7); 

   t = (x ^ (x >>14)) & 0x0000CCCC;  x = x ^ t ^ (t <<14); 
   t = (y ^ (y >>14)) & 0x0000CCCC;  y = y ^ t ^ (t <<14); 

   t = (x & 0xF0F0F0F0) | ((y >> 4) & 0x0F0F0F0F); 
   y = ((x << 4) & 0xF0F0F0F0) | (y & 0x0F0F0F0F); 
   x = t; 

   B[0]=x>>24;    B[n]=x>>16;    B[2*n]=x>>8;  B[3*n]=x; 
   B[4*n]=y>>24;  B[5*n]=y>>16;  B[6*n]=y>>8;  B[7*n]=y; 
}

I didn't check if this rotates in the direction you need, if not you might need to adjust the code.

Also, keep in mind datatypes & sizes - int & unsigned (int) might not be 32 bits on your platform.

BTW, I suspect the book (Hacker's Delight) is essential for the kind of work you're doing... check it out, lots of great stuff in there.

这样的小城市 2024-12-04 12:15:18

如果您正在寻找最简单的解决方案:

/* not tested, not even compiled */

char bytes_in[8];
char bytes_out[8];

/* please fill bytes_in[] here with some pixel-crap */

memset(bytes_out, 0, 8);
for(int i = 0; i < 8; i++) {
    for(int j = 0; j < 8; j++) {
        bytes_out[i] = (bytes_out[i] << 1) | ((bytes_in[j] >> (7 - i)) & 0x01);
    }
}

如果您正在寻找最快的解决方案:

如何利用 SSE2 在程序集中转置位矩阵。

If you are looking for the simplest solution:

/* not tested, not even compiled */

char bytes_in[8];
char bytes_out[8];

/* please fill bytes_in[] here with some pixel-crap */

memset(bytes_out, 0, 8);
for(int i = 0; i < 8; i++) {
    for(int j = 0; j < 8; j++) {
        bytes_out[i] = (bytes_out[i] << 1) | ((bytes_in[j] >> (7 - i)) & 0x01);
    }
}

If your are looking for the fastest solution:

How to transpose a bit matrix in the assembly by utilizing SSE2.

峩卟喜欢 2024-12-04 12:15:18

这听起来很像使用位平面的显示器上使用的所谓“厚块到平面”例程。以下链接使用 MC68K 汇编器作为其代码,但提供了问题的一个很好的概述(假设我正确理解了问题):

http://membres.multimania.fr/amycoders/sources/c2ptut.html

This sounds a lot like a so-called "Chunky to planar" routine used on displays that use bitplanes. The following link uses MC68K assembler for its code, but provides a nice overview of the problem (assuming I understood the question correctly):

http://membres.multimania.fr/amycoders/sources/c2ptut.html

平生欢 2024-12-04 12:15:18

Lisp 原型:

(declaim (optimize (speed 3) (safety 0)))
(defun bit-transpose (a)
  (declare (type (simple-array unsigned-byte 1) a))
  (let ((b (make-array 8 :element-type '(unsigned-byte 8))))
    (dotimes (j 8)
      (dotimes (i 8)
    (setf (ldb (byte 1 i) (aref b j))
          (ldb (byte 1 j) (aref a i)))))
    b))

这就是运行代码的方式:

#+nil
(bit-transpose (make-array 8 :element-type 'unsigned-byte
               :initial-contents '(1 2 3 4 5 6 7 8)))
;; => #(85 102 120 128 0 0 0 0)

有时我会反汇编代码以检查是否存在对安全函数的不必要的调用。

#+nil
(disassemble #'bit-transpose)

这是一个基准。经常运行该函数以处理(二进制)HDTV 图像。

#+nil
(time 
 (let ((a (make-array 8 :element-type 'unsigned-byte
              :initial-contents '(1 2 3 4 5 6 7 8)))
       (b (make-array 8 :element-type 'unsigned-byte
              :initial-contents '(1 2 3 4 5 6 7 8))))
   (dotimes (i (* (/ 1920 8) (/ 1080 8)))
     (bit-transpose a))))

这只花了 51 毫秒。请注意,我花了很多钱,因为该函数始终分配新的 8 字节数组。我确信 C 语言的实现可以进行更多调整。

Evaluation took:
  0.051 seconds of real time
  0.052004 seconds of total run time (0.052004 user, 0.000000 system)
  101.96% CPU
  122,179,503 processor cycles
  1,048,576 bytes consed

这里还有一些测试用例:

#+nil
(loop for j below 12 collect
  (let ((l (loop for i below 8 collect (random 255))))
    (list l (bit-transpose (make-array 8 :element-type 'unsigned-byte
                :initial-contents l)))))
;; => (((111 97 195 202 47 124 113 164) #(87 29 177 57 96 243 111 140))
;;     ((180 192 70 173 167 41 30 127) #(184 212 221 232 193 185 134 27))
;;     ((244 86 149 57 191 65 129 178) #(124 146 23 24 159 153 35 213))
;;     ((227 244 139 35 38 65 214 64) #(45 93 82 4 66 27 227 71))
;;     ((207 62 236 89 50 64 157 120) #(73 19 71 207 218 150 173 69))
;;     ((89 211 149 140 233 72 193 192) #(87 2 12 57 7 16 243 222))
;;     ((97 144 19 13 135 198 238 33) #(157 116 120 72 6 193 97 114))
;;     ((145 119 3 85 41 202 79 134) #(95 230 202 112 11 18 106 161))
;;     ((42 153 67 166 175 190 114 21) #(150 125 184 51 226 121 68 58))
;;     ((58 232 38 210 137 254 19 112) #(80 109 36 51 233 167 170 58))
;;     ((27 245 1 197 208 221 21 101) #(239 1 234 33 115 130 186 58))
;;     ((66 204 110 232 46 67 37 34) #(96 181 86 30 0 220 47 10)))

现在我真的想看看我的代码与 Andrejs Cainikovs 的 C 解决方案相比如何
编辑:我认为这是错误的):

#include <string.h>

unsigned char bytes_in[8]={1,2,3,4,5,6,7,8};
unsigned char bytes_out[8];

/* please fill bytes_in[] here with some pixel-crap */
void bit_transpose(){
  memset(bytes_out, 0, 8);
  int i,j;
  for(i = 0; i < 8; i++)
    for(j = 0; j < 8; j++) 
      bytes_out[i] = (bytes_out[i] << 1) | ((bytes_in[j] >> (7 - i)) & 0x01);
}

int
main()
{
  int j,i;
  for(j=0;j<100;j++)
    for(i=0;i<(1920/8*1080/8);i++)
      bit_transpose();
  return 0;
}

并对它进行基准测试:

wg@hp:~/0803/so$ gcc -O3 trans.c
wg@hp:~/0803/so$ time ./a.out 

real    0m0.249s
user    0m0.232s
sys     0m0.000s

HDTV 图像上的每个循环需要 2.5 毫秒。这比我未优化的 Lisp 快很多。

不幸的是,C 代码没有给出与我的 lisp 相同的结果:

#include <stdio.h>
int
main()
{
  int j,i;
  bit_transpose();
  for(i=0;i<8;i++)
    printf("%d ",(int)bytes_out[i]);
  return 0;
}
wg@hp:~/0803/so$ ./a.out 
0 0 0 0 1 30 102 170 

Lisp prototype:

(declaim (optimize (speed 3) (safety 0)))
(defun bit-transpose (a)
  (declare (type (simple-array unsigned-byte 1) a))
  (let ((b (make-array 8 :element-type '(unsigned-byte 8))))
    (dotimes (j 8)
      (dotimes (i 8)
    (setf (ldb (byte 1 i) (aref b j))
          (ldb (byte 1 j) (aref a i)))))
    b))

This is how you can run the code:

#+nil
(bit-transpose (make-array 8 :element-type 'unsigned-byte
               :initial-contents '(1 2 3 4 5 6 7 8)))
;; => #(85 102 120 128 0 0 0 0)

Occasionally I disassemble code to check that there are no unnecessary calls to safety functions.

#+nil
(disassemble #'bit-transpose)

This is a benchmark. Run the function often enough to process a (binary) HDTV image.

#+nil
(time 
 (let ((a (make-array 8 :element-type 'unsigned-byte
              :initial-contents '(1 2 3 4 5 6 7 8)))
       (b (make-array 8 :element-type 'unsigned-byte
              :initial-contents '(1 2 3 4 5 6 7 8))))
   (dotimes (i (* (/ 1920 8) (/ 1080 8)))
     (bit-transpose a))))

That took only took 51ms. Note that I'm consing quite a lot because the function allocates new 8 byte arrays all the time. I'm sure an implementation in C can be tweaked a lot more.

Evaluation took:
  0.051 seconds of real time
  0.052004 seconds of total run time (0.052004 user, 0.000000 system)
  101.96% CPU
  122,179,503 processor cycles
  1,048,576 bytes consed

Here are some more test cases:

#+nil
(loop for j below 12 collect
  (let ((l (loop for i below 8 collect (random 255))))
    (list l (bit-transpose (make-array 8 :element-type 'unsigned-byte
                :initial-contents l)))))
;; => (((111 97 195 202 47 124 113 164) #(87 29 177 57 96 243 111 140))
;;     ((180 192 70 173 167 41 30 127) #(184 212 221 232 193 185 134 27))
;;     ((244 86 149 57 191 65 129 178) #(124 146 23 24 159 153 35 213))
;;     ((227 244 139 35 38 65 214 64) #(45 93 82 4 66 27 227 71))
;;     ((207 62 236 89 50 64 157 120) #(73 19 71 207 218 150 173 69))
;;     ((89 211 149 140 233 72 193 192) #(87 2 12 57 7 16 243 222))
;;     ((97 144 19 13 135 198 238 33) #(157 116 120 72 6 193 97 114))
;;     ((145 119 3 85 41 202 79 134) #(95 230 202 112 11 18 106 161))
;;     ((42 153 67 166 175 190 114 21) #(150 125 184 51 226 121 68 58))
;;     ((58 232 38 210 137 254 19 112) #(80 109 36 51 233 167 170 58))
;;     ((27 245 1 197 208 221 21 101) #(239 1 234 33 115 130 186 58))
;;     ((66 204 110 232 46 67 37 34) #(96 181 86 30 0 220 47 10)))

Now I really want to see how my code compares to Andrejs Cainikovs' C solution
(Edit: I think its wrong):

#include <string.h>

unsigned char bytes_in[8]={1,2,3,4,5,6,7,8};
unsigned char bytes_out[8];

/* please fill bytes_in[] here with some pixel-crap */
void bit_transpose(){
  memset(bytes_out, 0, 8);
  int i,j;
  for(i = 0; i < 8; i++)
    for(j = 0; j < 8; j++) 
      bytes_out[i] = (bytes_out[i] << 1) | ((bytes_in[j] >> (7 - i)) & 0x01);
}

int
main()
{
  int j,i;
  for(j=0;j<100;j++)
    for(i=0;i<(1920/8*1080/8);i++)
      bit_transpose();
  return 0;
}

And benchmarking it:

wg@hp:~/0803/so$ gcc -O3 trans.c
wg@hp:~/0803/so$ time ./a.out 

real    0m0.249s
user    0m0.232s
sys     0m0.000s

Each loop over the HDTV image takes 2.5ms. That is quite a lot faster than my unoptimized Lisp.

Unfortunately the C code doesn't give the same results like my lisp:

#include <stdio.h>
int
main()
{
  int j,i;
  bit_transpose();
  for(i=0;i<8;i++)
    printf("%d ",(int)bytes_out[i]);
  return 0;
}
wg@hp:~/0803/so$ ./a.out 
0 0 0 0 1 30 102 170 
影子是时光的心 2024-12-04 12:15:18

这类似于获取位板问题中的列,并且可以通过将这些输入字节视为 8 个字节来有效解决64 位整数。如果位 0 是最低有效位,字节 0 是数组中的第一个字节,那么我假设您想要

                              Column 7 becomes...
                              ↓
[ b07 b06 b05 b04 b03 b02 b01 b00   [ b70 b60 b50 b40 b30 b20 b10 b00 ← row 0
  b17 b16 b15 b14 b13 b12 b11 b10     b71 b61 b51 b41 b31 b21 b11 b01
  b27 b26 b25 b24 b23 b22 b21 b20     b72 b62 b52 b42 b32 b22 b12 b02
  b37 b36 b35 b34 b33 b32 b31 b30  →  b73 b63 b53 b43 b33 b23 b13 b03
  b47 b46 b45 b44 b43 b42 b41 b40  →  b74 b64 b54 b44 b34 b24 b14 b04
  b57 b56 b55 b54 b53 b52 b51 b50     b75 b65 b55 b45 b35 b25 b15 b05
  b67 b66 b65 b64 b63 b62 b61 b60     b76 b66 b56 b46 b36 b26 b16 b06
  b77 b76 b75 b74 b73 b72 b71 b70 ]   b77 b67 b57 b47 b37 b27 b17 b07 ]

对 bXY 执行以下操作,即字节 X 的位号 Y。在这种形式中,旋转最左边的列只是打包所有将最高有效位以相反的顺序转换为单个字节,并且类似地可以旋转其他列。

为此,我们屏蔽掉所有最后 7 列并将数组读取为 uint64_t。结果采用

0b h0000000 g0000000 f0000000 e0000000 d0000000 c0000000 b0000000 a0000000
   ↑        ↑        ↑        ↑        ↑        ↑        ↑        ↑
   b77      b67      b57      b47      b37      b27      b17      b07

小端字节序,abcdefgh 分别为 b07 到 b77。现在我们只需将该值与 幻数 0x0002040810204081 相乘即可得到 hgfedcba 的值在 MSB 中,这是我们所期望的,

uint8_t transpose_column(uint64_t matrix, unsigned col)
{
    const uint64_t column_mask = 0x8080808080808080ull;
    const uint64_t magic       = 0x0002040810204081ull;
    
    return ((matrix << col) & column_mask) * magic >> 56;
}

uint64_t block8x8;
memcpy(&block8x8, bytes_in, sizeof(block8x8));
#if __BYTE_ORDER == __BIG_ENDIAN
block8x8 = swap_bytes(block8x8);
#endif

for (unsigned i = 0; i < 8; i++)
    byte_out[i] = transpose_column(block8x8, 7 - i);

因为您将 8 字节数组视为 uint64_t,您可能需要正确对齐阵列以获得更好的性能,因为这样只需要单个内存负载


在 AVX2 中,Intel 引入了 PDEP 指令(可通过 _pext_u64 内在函数访问) href="https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#BMI2" rel="nofollow noreferrer">BMI2 指令集用于此目的,因此该功能可以在单个指令中完成

data[i] = _pext_u64(matrix, column_mask << (7 - col));

但不幸的是,这赢了无法按照您的预期在 dsPIC 中工作

更多转置数组的方法可以在

This is similar to the get column in a bitboard problem and can be solved efficiently by considering those input bytes as 8 bytes of a 64-bit integer. If bit 0 is the least significant one and byte 0 is the first byte in the array then I assume you want to do the following

                              Column 7 becomes...
                              ↓
[ b07 b06 b05 b04 b03 b02 b01 b00   [ b70 b60 b50 b40 b30 b20 b10 b00 ← row 0
  b17 b16 b15 b14 b13 b12 b11 b10     b71 b61 b51 b41 b31 b21 b11 b01
  b27 b26 b25 b24 b23 b22 b21 b20     b72 b62 b52 b42 b32 b22 b12 b02
  b37 b36 b35 b34 b33 b32 b31 b30  →  b73 b63 b53 b43 b33 b23 b13 b03
  b47 b46 b45 b44 b43 b42 b41 b40  →  b74 b64 b54 b44 b34 b24 b14 b04
  b57 b56 b55 b54 b53 b52 b51 b50     b75 b65 b55 b45 b35 b25 b15 b05
  b67 b66 b65 b64 b63 b62 b61 b60     b76 b66 b56 b46 b36 b26 b16 b06
  b77 b76 b75 b74 b73 b72 b71 b70 ]   b77 b67 b57 b47 b37 b27 b17 b07 ]

with bXY is byte X's bit number Y. In that form rotating the left-most column is just packing all the most significant bits into a single byte in reverse order, and similarly other columns can be rotated

To do that we mask out all the last 7 columns and read the array as an uint64_t. The result is

0b h0000000 g0000000 f0000000 e0000000 d0000000 c0000000 b0000000 a0000000
   ↑        ↑        ↑        ↑        ↑        ↑        ↑        ↑
   b77      b67      b57      b47      b37      b27      b17      b07

in little endian, with abcdefgh are b07 to b77 respectively. Now we just need to multiply that value with the magic number 0x0002040810204081 to make a value with hgfedcba in the MSB which is what we expected

uint8_t transpose_column(uint64_t matrix, unsigned col)
{
    const uint64_t column_mask = 0x8080808080808080ull;
    const uint64_t magic       = 0x0002040810204081ull;
    
    return ((matrix << col) & column_mask) * magic >> 56;
}

uint64_t block8x8;
memcpy(&block8x8, bytes_in, sizeof(block8x8));
#if __BYTE_ORDER == __BIG_ENDIAN
block8x8 = swap_bytes(block8x8);
#endif

for (unsigned i = 0; i < 8; i++)
    byte_out[i] = transpose_column(block8x8, 7 - i);

Because you treat the 8-byte array as uint64_t, you may need to align the array properly to get better performance because that way only a single memory load is needed


In AVX2 Intel introduced the PDEP instruction (accessible via the _pext_u64 intrinsic) in the BMI2 instruction set for this purpose so the function can be done in a single instruction

data[i] = _pext_u64(matrix, column_mask << (7 - col));

But unfortunately this won't work in dsPIC as you expected

More ways to transpose the array can be found in the chess programming wiki

嗳卜坏 2024-12-04 12:15:18

您确实想使用 SIMD 指令和 GCC 矢量矢量支持来执行类似的操作: http:// /ds9a.nl/gcc-simd/example.html

You really want to do something like this with SIMD instructions with something like the GCC vector vector support: http://ds9a.nl/gcc-simd/example.html

时光清浅 2024-12-04 12:15:18

如果您想要一个优化的解决方案,您可以使用 x86 中的 SSE 扩展。

您需要使用其中 4 个 SIMD 操作码。

  • MOVQ - 移动 8 个字节
  • PSLLW - 打包左移逻辑字
  • PMOVMSKB - 打包移动掩码字节

和 2 个常规 x86 操作码

  • LEA - 加载有效地址
  • MOV - 移动
byte[] m = byte[8]; //input
byte[] o = byte[8]; //output
LEA ecx, [o]
// ecx = the address of the output array/matrix
MOVQ xmm0, [m]
// xmm0 = 0|0|0|0|0|0|0|0|m[7]|m[6]|m[5]|m[4]|m[3]|m[2]|m[1]|m[0]
PMOVMSKB eax, xmm0
// eax = m[7][7]...m[0][7] the high bit of each byte
MOV [ecx+7], al
// o[7] is now the last column
PSLLW xmm0, 1
// shift 1 bit to the left
PMOVMSKB eax, xmm0
MOV [ecx+6], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+5], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+4], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+3], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+2], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+1], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx], al

25 个 x86 操作码/指令,而不是堆栈 for 循环64 次迭代的解决方案。

If you wanted an optimized solution you would use the SSE extensions in x86.

You'd need to use 4 of these SIMD opcodes.

  • MOVQ - move 8 bytes
  • PSLLW - packed shift left logical words
  • PMOVMSKB - packed move mask byte

And 2 regular x86 opcodes

  • LEA - load effective address
  • MOV - move
byte[] m = byte[8]; //input
byte[] o = byte[8]; //output
LEA ecx, [o]
// ecx = the address of the output array/matrix
MOVQ xmm0, [m]
// xmm0 = 0|0|0|0|0|0|0|0|m[7]|m[6]|m[5]|m[4]|m[3]|m[2]|m[1]|m[0]
PMOVMSKB eax, xmm0
// eax = m[7][7]...m[0][7] the high bit of each byte
MOV [ecx+7], al
// o[7] is now the last column
PSLLW xmm0, 1
// shift 1 bit to the left
PMOVMSKB eax, xmm0
MOV [ecx+6], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+5], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+4], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+3], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+2], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+1], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx], al

25 x86 opcodes/instructions as opposed to the stacked for loop solution with 64 iterations.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文