仅使用恒定移位来模拟可变位移位?
我正在尝试找到一种方法来执行间接左移/右移操作,而无需实际使用变量移位操作或任何分支。
我正在研究的特定 PowerPC 处理器有一个怪癖,即按常量立即移位,如
int ShiftByConstant( int x ) { return x << 3 ; }
快速、单操作和超标量,而按变量移位,如
int ShiftByVar( int x, int y ) { return x << y ; }
微编码操作需要 7-11 个周期才能执行,而整个管道的其余部分则停止运行。
我想要做的是找出哪个非微编码整数 PPC 操作 sraw 解码为然后单独发出它们。 这对 sraw
本身 - 它将用六个操作替换一个操作 - 但在这六个操作之间,我可以将一些工作双重分派给另一个操作执行单位并获得净收益。
我似乎无法在任何地方找到 μops sraw 解码成的内容 - 有谁知道如何用一系列常量移位和基本整数运算来替换变量移位? (for 循环或 switch 或任何带有分支的东西都不起作用,因为分支惩罚甚至比微代码惩罚更大,即使对于正确预测的分支也是如此。)
这不需要在汇编中回答; 我希望学习算法而不是特定的代码,因此 C 或高级语言甚至伪代码的答案将非常有帮助。
编辑:我应该添加一些说明:
- 我一点也不担心可移植性
PPC有条件移动,所以我们可以假设存在无分支内在函数
int isel(a, b, c) { 返回 a >= 0 ? 乙:丙; }
(如果你写出一个做同样事情的三元,我就会明白你的意思)
- 整数乘法也是微编码的,甚至比
sraw
慢。 :-( - 在 Xenon PPC 上,预测分支的延迟为 8 个周期,因此即使是一个预测分支的延迟也与微编码指令一样昂贵。跳转到指针(任何间接分支或函数指针)肯定是错误预测,24 个周期摊位。
I'm trying to find a way to perform an indirect shift-left/right operation without actually using the variable shift op or any branches.
The particular PowerPC processor I'm working on has the quirk that a shift-by-constant-immediate, like
int ShiftByConstant( int x ) { return x << 3 ; }
is fast, single-op, and superscalar, whereas a shift-by-variable, like
int ShiftByVar( int x, int y ) { return x << y ; }
What I'd like to do is figure out which non-microcoded integer PPC ops the sraw decodes into and then issue them individually. This won't help with the latency of the sraw
itself — it'll replace one op with six — but in between those six ops I can dual-dispatch some work to the other execution units and get a net gain.
I can't seem to find anywhere what μops sraw decodes into — does anyone know how I can replace a variable bit-shift with a sequence of constant shifts and basic integer operations? (A for loop or a switch or anything with a branch in it won't work because the branch penalty is even bigger than the microcode penalty, even for correctly-predicted branches.)
This needn't be answered in assembly; I'm hoping to learn the algorithm rather than the particular code, so an answer in C or a high level language or even pseudo code would be perfectly helpful.
Edit: A couple of clarifications that I should add:
- I'm not even a little bit worried about portability
PPC has a conditional-move, so we can assume the existence of a branchless intrinsic function
int isel(a, b, c) { return a >= 0 ? b : c; }
(if you write out a ternary that does the same thing I'll get what you mean)
- integer multiplication is also microcoded and even slower than
sraw
. :-( - On Xenon PPC, the latency of a predicted branch is 8 cycles, so even one makes it as costly as the microcoded instruction. Jump-to-pointer (any indirect branch or function pointer) is a guaranteed mispredict, a 24 cycle stall.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
给你...
我决定也尝试一下,因为 Mike Acton 声称这比在他的 CellPerformance 网站上使用 CELL/PS3 微编码转换更快,其中 他建议避免间接转变。 然而,在我的所有测试中,使用微编码版本不仅比间接移位的完整通用无分支替换更快,而且它占用的代码内存(1 条指令)也更少。
我将这些作为模板的唯一原因是为了获得有符号(通常是算术)和无符号(逻辑)移位的正确输出。
编辑:关于 isel() 的注释
我看到你的 您网站上的 isel() 代码。
FWIW,如果您重写 isel() 来执行掩码和掩码补码,那么在 PowerPC 目标上速度会更快,因为编译器足够智能,可以生成“andc”操作码。 操作码数量相同,但操作码中结果与输入寄存器的依赖关系少了一个。 这两个掩码操作也可以在超标量处理器上并行发出。 如果一切都正确排列的话,速度可以快 2-3 个周期。 对于 PowerPC 版本,您只需将返回更改为:
Here you go...
I decided to try these out as well since Mike Acton claimed it would be faster than using the CELL/PS3 microcoded shift on his CellPerformance site where he suggests to avoid the indirect shift. However, in all my tests, using the microcoded version was not only faster than a full generic branch-free replacement for indirect shift, it takes way less memory for the code (1 instruction).
The only reason I did these as templates was to get the right output for both signed (usually arithmetic) and unsigned (logical) shifts.
EDIT: Note on isel()
I saw your isel() code on your website.
FWIW, if you rewrite your isel() to do a mask and mask complement, it will be faster on your PowerPC target since the compiler is smart enough to generate an 'andc' opcode. It's the same number of opcodes but there is one fewer result-to-input-register dependency in the opcodes. The two mask operations can also be issued in parallel on a superscalar processor. It can be 2-3 cycles faster if everything is lined up correctly. You just need to change the return to this for the PowerPC versions:
怎么样:
可能需要更长的时间来执行,但如果您有其他代码需要插入,则更容易交错。
How about this:
will probably take longer yet to execute but easier to interleave if you have other code to go between.
假设您的最大移位为 31。因此移位量是一个 5 位数字。 因为转变是累积的,所以我们可以将其分为五个恒定转变。 明显的版本使用分支,但你排除了这一点。
令 N 为 1 到 5 之间的数字。如果值为 2N 的位,您希望将 x 移位 2N 设置在 y 中,否则保持 x 不变。 这里有一种方法:
宏将 x 分配给 x
x << 2ᴺ
或x
,取决于 y 中是否设置了第 Nth 位。然后驱动程序:
注意,N 是一个宏变量,并且变为常数。
但不知道这是否真的比变量移位更快。 如果是的话,人们想知道为什么微代码不会运行这个......
Let's assume that your max shift is 31. So the shift amount is a 5-bit number. Because shifting is cumulative, we can break this into five constant shifts. The obvious version uses branching, but you ruled that out.
Let N be a number between 1 and 5. You want to shift x by 2N if the bit whose value is 2N is set in y, otherwise keep x intact. Here one way to do it:
The macro assigns to x either
x << 2ᴺ
orx
, depending on whether the Nth bit is set in y or not.And then the driver:
Note that N is a macro variable and becomes constant.
Don't know though if this is going to be actually faster than the variable shift. If it would be, one wonders why the microcode wouldn't run this instead...
这让我头疼。 我现在已经放弃了六个想法。 它们都利用了这样的概念:向自身添加一个东西会左移 1,对结果做同样的操作会左移 4,依此类推。 如果保留左移 0、1、2、4、8 和 16 的所有部分结果,则通过测试移位变量的位 0 到 4,您可以获得初始移位。 现在再次执行此操作,对移位变量中的每 1 位执行一次。 坦率地说,您不妨将处理器派出去喝咖啡。
我寻求真正帮助的一个地方是 Hank Warren 的 Hacker's Delight(其中是这个答案唯一有用的部分)。
This one breaks my head. I've now discarded a half dozen ideas. All of them exploit the notion that adding a thing to itself shifts left 1, doing the same to the result shifts left 4, and so on. If you keep all the partial results for shift left 0, 1, 2, 4, 8, and 16, then by testing bits 0 to 4 of the shift variable you can get your initial shift. Now do it again, once for each 1 bit in the shift variable. Frankly, you might as well send your processor out for coffee.
The one place I'd look for real help is Hank Warren's Hacker's Delight (which is the only useful part of this answer).
这个怎么样:
How about this:
如果可以提前计算班次计数,那么我有两个可能可行的想法
使用自修改代码
只需修改指令中的移位量立即数即可。 或者为具有变量移位的函数动态生成代码
如果可能,将具有相同移位计数的值组合在一起,并使用 Duff 的设备或函数指针一次性执行所有操作,以最大程度地减少分支错误预测
此方法也可以与自修改或运行时代码生成结合使用,以消除对函数指针的需要。
编辑:正如所评论的,不幸的是根本没有关于跳转到注册的分支预测,因此唯一可行的方法是生成我上面所说的代码,或者 如果可能的话,使用 SIMD 将
If值的范围很小,查找表是另一种可能的解决方案
现在 x << n 只是
shl[x][n]
,其中 x 是一个uint8_t
。 该表占用 2KB (8 × 256 B) 内存。 然而,对于 16 位值,您需要一个 1MB 表 (16 × 64 KB),这可能仍然可行,并且您可以通过将两个 16 位移位组合在一起来进行 32 位移位If the shift count can be calculated far in advance then I have two ideas that might work
Using self-modifying code
Just modify the shift amount immediate in the instruction. Alternatively generate code dynamically for the functions with variable shift
Group the values with the same shift count together if possible, and do the operation all at once using Duff's device or function pointer to minimize branch misprediction
This method might also be done in combination with self-modifying or run-time code generation to remove the need for a function pointer.
Edit: As commented, unfortunately there's no branch prediction on jump to register at all, so the only way this could work is generating code as I said above, or using SIMD
If the range of the values is small, lookup table is another possible solution
Now
x << n
is simplyshl[x][n]
with x being anuint8_t
. The table costs 2KB (8 × 256 B) of memory. However for 16-bit values you'll need a 1MB table (16 × 64 KB), which may still be viable and you can do a 32-bit shift by combining two 16-bit shifts together这里有一些关于位操作黑魔法的好东西:
高级位操作符(Christer Ericson 的博客)
不知道其中是否有直接适用的,但如果有办法的话,很可能在某处有一些关于这种方式的提示。
There is some good stuff here regarding bit manipulation black magic:
Advanced bit manipulation fu (Christer Ericson's blog)
Don't know if any of it's directly applicable, but if there is a way, likely there are some hints to that way in there somewhere.
这是一些根本无法展开的东西:
Here's something that is trivially unrollable: