如何在位图中的位之间插入零?
我有一些执行位操作的高性能代码。它可以简化为以下明确定义的问题:
给定一个 13 位位图,构造一个 26 位位图,其中包含偶数位置间隔的原始位。
说明:
0000000000000000000abcdefghijklm (input, 32 bits)
0000000a0b0c0d0e0f0g0h0i0j0k0l0m (output, 32 bits)
我目前已实现它在 C 中以以下方式:
if (input & (1 << 12))
output |= 1 << 24;
if (input & (1 << 11))
output |= 1 << 22;
if (input & (1 << 10))
output |= 1 << 20;
...
我的编译器(MS Visual Studio)将其变成以下内容:
test eax,1000h
jne 0064F5EC
or edx,1000000h
... (repeated 13 times with minor differences in constants)
我想知道是否可以使其更快。我希望用 C 语言编写代码,但可以切换到汇编语言。
- 我可以使用一些 MMX/SSE 指令一次处理所有位吗?
- 也许我可以使用乘法? (乘以 0x11111111 或其他神奇常数)
- 使用条件设置指令(SETcc)而不是条件跳转指令会更好吗?如果是,我怎样才能让编译器为我生成这样的代码?
- 还有其他想法如何让它更快吗?
- 知道如何进行逆位图转换(我也必须实现它,但它不太重要)?
I have some performance-heavy code that performs bit manipulations. It can be reduced to the following well-defined problem:
Given a 13-bit bitmap, construct a 26-bit bitmap that contains the original bits spaced at even positions.
To illustrate:
0000000000000000000abcdefghijklm (input, 32 bits)
0000000a0b0c0d0e0f0g0h0i0j0k0l0m (output, 32 bits)
I currently have it implemented in the following way in C:
if (input & (1 << 12))
output |= 1 << 24;
if (input & (1 << 11))
output |= 1 << 22;
if (input & (1 << 10))
output |= 1 << 20;
...
My compiler (MS Visual Studio) turned this into the following:
test eax,1000h
jne 0064F5EC
or edx,1000000h
... (repeated 13 times with minor differences in constants)
I wonder whether I can make it any faster. I would like to have my code written in C, but switching to assembly language is possible.
- Can I use some MMX/SSE instructions to process all bits at once?
- Maybe I can use multiplication? (multiply by 0x11111111 or some other magical constant)
- Would it be better to use condition-set instruction (SETcc) instead of conditional-jump instruction? If yes, how can I make the compiler produce such code for me?
- Any other idea how to make it faster?
- Any idea how to do the inverse bitmap transformation (I have to implement it too, bit it's less critical)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
有一种巧妙的方法可以做到这一点,这可能会有所帮助。它实际上
解决了一个稍微更一般的位洗牌问题。你的问题有一个
输入:
....但让我们考虑所有位:
并尝试将它们全部交错,如下所示:
对于第一步,考虑输入的中半部分:
构造 8 位值:{
I^ Q
、J^R
、K^S
、L^a
、M^b
、N^c
,O^d
,P^e
}。如果我们将这个 8 位值与位 [15:8] 进行异或,并且也进行异或
与位 [23:16] 相同的 8 位值,我们将交换中间的两个字节:
例如,位 23(最初为
I
)将变为I ^ (I^Q) = Q
和位 15(原本
Q
)将变成Q ^ (I^Q) = I
。为此: tmp = (input ^ (input >> 8)) & 0x0000ff00;:
现在我们需要的8位值位于位[15:8]中,所有其他位均为0。
现在我们可以进行交换,
结果是:
对于下一步,分而治之...执行类似的中间交换
左半部分的位:
...和右半部分:
我们可以使用与第一步完全相同的技巧,因为我们想要
对 32 位字的两个 16 位半部分执行完全相同的操作,
我们可以并行执行它们:
构造我们将用于交换的两对 4 位,然后
实际进行交换。
我们可以继续应用相同的原则,直到交换完成。
在每个点参与交换的位都用
#
标记:代码:
可以通过向后运行 4 个步骤来执行反向操作:
尽管您可以针对您的特定应用程序对此进行改进,
如果已知所有其他位都为零:请参阅我对另一个的回答
问题这里。
最后一点,不要相信任何人关于相对表现的说法
此处建议的任何方法都没有在您的环境中对它们进行基准测试
应用程序。 (特别是,大型查找表看起来要好得多
在简单的微基准测试中比在给定的真实实践中实际情况要好
应用程序,由于从缓存中逐出大量其他数据,
这可能会对外部循环产生负面影响。)
There is a clever way to do this which may be helpful here. It actually
solves a slightly more general bit-shuffling problem. Your problem has an
input of:
....but let's consider all of the bits:
and attempt to interleave them all like so:
For the first step, consider the middle half of the input:
Construct the 8-bit value: {
I^Q
,J^R
,K^S
,L^a
,M^b
,N^c
,O^d
,P^e
}.If we exclusive-OR this 8-bit value with bits [15:8], and also exclusive-OR
the same 8-bit value with bits [23:16], we will swap the middle two bytes: for
example, bit 23 (originally
I
) will becomeI ^ (I^Q) = Q
and bit 15(originally
Q
) will becomeQ ^ (I^Q) = I
.To do that:
tmp = (input ^ (input >> 8)) & 0x0000ff00;
:Now the 8-bit value that we need is in bits [15:8], with all other bits 0.
Now we can do the swap with
resulting in:
For the next step, divide and conquer... perform a similar swap of the middle
bits of both the left hand half:
...and the right-hand half:
We can use exactly the same trick as in the first step, and because we want
to perform exactly the same operation on both 16-bit halves of the 32-bit word,
we can do them in parallel:
constructs the two pairs of 4 bits that we will use for the swap, and then
actually does the swap.
We can continue applying the same principle until the swap is complete.
The bits that participate in the exchange at each point are marked with
#
:Code:
The reverse operation can be performed by running the 4 steps backwards:
although you may be able to improve on this for your particular application,
if every other bit is known to be zero: see my answer to another
question here.
As a final note, don't believe anything anyone says about relative performance
of any of the methods suggested here without benchmarking them in your
application. (In particular, large lookup tables can appear to be much better
in simple microbenchmarks than they actually are in practice in a given real
application, due to evicting large quantities of other data from the cache,
which can have a negative effect on the outer loop(s).)
用查找表来做。 2^13 听起来像是很多条目,但它们很容易放入 CPU 缓存中。
哦,如果其他 19 位中有垃圾,你需要先将它们屏蔽掉。
Do it with a lookup table. 2^13 sound like a lot of entries but they will easily fit into the CPU cache.
Oh, and if there's garbage in the other 19 bits, you need to mask them out first.
你可以这样做:
You could do:
不要使用分支:
这是同一事物的一个可能更容易阅读/理解的版本:
Don't use branching:
Here's a possibly easier to read/understand version of the same thing:
在从 Haswell 开始的 Intel x86 处理器上,您可以使用
BMI2
指令集中的单个pdep
指令来执行此操作:On Intel x86 processors starting from Haswell, you can use single
pdep
instruction fromBMI2
instruction set to do it:我将给出一个无需条件(仅加法和按位运算)即可工作的算法,我相信这将比您当前的解决方案更快。
这是 13 位的 C 代码。下面是该方法如何适用于 3 位的说明,我希望概括会很清楚。
(注意:代码是循环展开的。一个好的编译器会为您完成此操作,因此您可以将其压缩为循环。)
现在,这是 3 位方法的说明。初始状态是“00abc”。首先将 'a' 向左移动两位,添加 01100,然后与 10011 进行 AND 运算(这恰好是前一个数字的按位 NOT)。这就是 a=0,1 的工作方式(第一个箭头是加法,第二个箭头是 AND):
a=0: 00abc = 000bc -> 011bc-> 000bc = a00bc
a=1: 00abc = 001bc -->公元前100年-> 100bc = a00bc
接下来,通过添加 00010,然后与 10101 进行 AND 运算,将“b”向左移动一位:
b=0: a00bc = a000c -> a001c-> a000c = a0b0c
b=1:a00bc=a001c-> a010c-> a010c = a0b0c
就是这样。
I'll give an algorithm that works without conditionals (only addition and bitwise operations), and I believe this will be faster than your current solution.
Here's the C code for 13 bits. Below there's an illustration of how the method works for 3 bits, and the generalization will be clear I hope.
(Note: The code is loop-unrolled. A good compiler will do that for you, so you can just condense it to a loop.)
Now, here's the explanation of the method for 3 bits. The initial state is '00abc'. Start by moving 'a' two places to the left by adding 01100 and then ANDing with 10011 (which happens to be the bitwise NOT of the previous number). This is how it works for a=0,1 (first arrow is the addition, second arrow is the AND):
a=0: 00abc = 000bc -> 011bc -> 000bc = a00bc
a=1: 00abc = 001bc -> 100bc -> 100bc = a00bc
Next, move 'b' one place to the left by adding 00010 and then ANDing with 10101:
b=0: a00bc = a000c -> a001c -> a000c = a0b0c
b=1: a00bc = a001c -> a010c -> a010c = a0b0c
That's it.
首先,对于“26 位”值,最高位应始终清晰,因此它实际上是一个 25 位值。
1)MMX(和/或SSE)不会有帮助,因为主要问题是没有一系列简单的算术或布尔运算可以给出您想要的结果,并且所有东西都支持相同的算术和布尔运算。
2)我想不出或找到乘法的神奇常数。
3)我看不出使用任何条件设置指令(例如SETcc)的方法比移位/添加指令有任何优势。
4) jdv 和 paul(上图)是对的。如果您需要经常进行这种转换以至于性能很重要,那么查找表将是现代 CPU 上最好/最快的选择。 “13 位到 26 位”的查找表将是 2**13 个双字,即 32 KiB。在旧的 CPU(具有较小的 L1 缓存)上,CPU 速度和 RAM 速度之间的相对差异并不像现在那么糟糕。
如果您无法为“13 位到 25 位”查找表腾出 32 KiB,则可以将 13 位值拆分为一对值(一个 6 位值和一个 7 位值),然后在组合结果之前对每个值使用查找表,如下所示:
在本例中,查找表有 128 个条目(每个条目 2 个字节),因此只有 256 个字节。
5) 对于反向操作,一个简单的查找表将花费您 64 MiB (2**25*2),所以这不是一个好主意。但是,您可以将 25 位值拆分为 13 位值和 11 位值(最高位始终清零的 12 位值),并使用每个条目一个字节的 8192 条目表(总共成本为 8 KiB)。不过,您没有理由不能将 25 位值拆分为更多/更小的部分(并使用更小的表)。
First, for your "26-bit" values the highest bit should always be clear, so it's actually a 25-bit value.
1) MMX (and/or SSE) won't help, as the main problem is that there's no simple series of arithmetic or boolean operations that makes gives the results you want, and everything supports the same arithmetic and boolean operations.
2) I couldn't think of or find a magic constant for multiplication.
3) I can't see a method of using any condition-set instruction (e.g. SETcc) that has any advantages over shift/add instructions.
4) jdv and paul (above) are right. If you need to do this conversion often enough that performance matters, then a lookup table would be the best/fastest option on modern CPUs. The lookup table for "13-bit to 26-bit" would 2**13 dwords, or 32 KiB. On old CPUs (with small L1 caches) the relative difference between CPU speed and RAM speed isn't as bad as it is now.
If you can't spare 32 KiB for the "13-bit to 25-bit" lookup table, you can split the 13-bit value into a pair of values (one 6-bit value and one 7-bit value) and then use the lookup table on each of these values before combining the results, like this:
In this case, the lookup table has 128 entries (with 2 bytes per entry), so it's only 256 bytes.
5) For the reverse operation, a simple lookup table would cost you 64 MiB (2**25*2) so that isn't a good idea. However, you could split the 25-bit value into a 13-bit value and a 11-bit value (a 12-bit value where the highest bit is always clear), and use an 8192 entry table with one byte per entry (total cost is 8 KiB). There's no reason you couldn't split the 25-bit values into more/smaller pieces though (and use a much smaller table).
您始终可以使用 for 循环:
这更短,但我不认为它明显更快。
You can always use a for-loop:
This is shorter, but I don't think that it's significantly faster.
我认为 这 可能相关,但我不完全确定。我知道 MMX 指令用于交错 32/64 位值的字节,但不知道单个位。
I think this might be relevant, but I'm not completely certain. I know of MMX instructions for interleaving bytes of 32/64 bit values, but not individual bits.
您尚未指定要运行的平台,我想尝试一种与已发布的方法不同的方法(我喜欢查找表方法,它在位数增加之前工作正常)。
大多数平台都有单独的移位和旋转指令。几乎总是有一条指令包含进位/溢出标志,因此您可以“移入”您想要的位。假设我们有这些说明:
* SHIFTLEFT:左移并用零填充低位。
* ROTATELEFT:进行左移,设置进位标志中前一个值的最低位,并设置左侧“移出”的位的进位。
伪代码:
...重复 13 次。随意展开。
第一个移位应该将最高位放在进位之前。 ROTATELEFT A 将把 MSB 压入进位,ROTATELEFT B 将把该位压入 B 的 LSB,而 SHIFTLEFT B 将把 0 压入。对所有位都这样做。
编辑/添加:
您可以使用相同的指令执行相反的操作(逆位图转换),如下所示:
将值加载到寄存器A中;
将 0 加载到寄存器 B 中;
向左旋转A;
向左旋转A;
向左旋转 B;
...重复13次
进而
左移 B; (registerwidth-13) 次。
LSB 进位;忘记它,下一个LSB进入进位,将其放入目标寄存器,对所有位重复,然后对齐结果。
You haven't specified the platform this is to run on, and I'd like to try a different approach from the ones already posted (I like the lookup table one, which works fine until the number of bits is increased).
Most platforms have separate shift and rotate instructions. Almost always there is an instruction that includes the carry / overflow flags, so you can "shift in" a bit you want. Let's say we have these instructions:
* SHIFTLEFT: does a leftshift and fills the lower bit with zero.
* ROTATELEFT: does a leftshift, sets the lowest bit from the former value in carry flag, and sets the carry from the bit that got shifted "out" on left.
Pseudocode:
... repeat 13 times. Unrolling as you please.
The first shift should get the uppermost bit into place right before the carry. ROTATELEFT A will push the MSB into the carry, ROTATELEFT B will push the bit into the LSB of B, and SHIFTLEFT B will put the 0 in. Do that for all the bits.
Edit/Added:
You can do the opposite (inverse bitmap transformation) with the same instructions, like this:
LOAD value into register A;
LOAD 0 into register B;
ROTATELEFT A;
ROTATELEFT A;
ROTATELEFT B;
... repeat 13 times
and then
SHIFTLEFT B; for (registerwidth-13) times.
LSB to carry; forget about it, next LSB into carry, put that into the target register, repeat for all bits, then align the result.
检查您的 CPU 是否支持字节和字交换(对于字节序转换) - 如果是这样 - 只需在其上进行交换 - 这会短一些 6(5) 条指令。
Check if your CPU supports byte and word swapping (For endian conversion) - if so - just roll a swap over it - that would be some 6(5) instructions shorter.