仅使用恒定移位来模拟可变位移位？

发布于 2024-07-14 06:42:30 字数 1494 浏览 12 评论 0原文

我正在尝试找到一种方法来执行间接左移/右移操作，而无需实际使用变量移位操作或任何分支。

我正在研究的特定 PowerPC 处理器有一个怪癖，即按常量立即移位，如

int ShiftByConstant( int x ) { return x << 3 ; }

快速、单操作和超标量，而按变量移位，如

int ShiftByVar( int x, int y ) { return x << y ; }

微编码操作需要 7-11 个周期才能执行，而整个管道的其余部分则停止运行。

我想要做的是找出哪个非微编码整数 PPC 操作 sraw 解码为然后单独发出它们。这对 sraw 本身 - 它将用六个操作替换一个操作 - 但在这六个操作之间，我可以将一些工作双重分派给另一个操作执行单位并获得净收益。

我似乎无法在任何地方找到 μops sraw 解码成的内容 - 有谁知道如何用一系列常量移位和基本整数运算来替换变量移位？（for 循环或 switch 或任何带有分支的东西都不起作用，因为分支惩罚甚至比微代码惩罚更大，即使对于正确预测的分支也是如此。）

这不需要在汇编中回答；我希望学习算法而不是特定的代码，因此 C 或高级语言甚至伪代码的答案将非常有帮助。

编辑：我应该添加一些说明：

我一点也不担心可移植性
PPC有条件移动，所以我们可以假设存在无分支内在函数
```
int isel(a, b, c) { 返回 a >= 0 ?   乙：丙；   } 
  
```
（如果你写出一个做同样事情的三元，我就会明白你的意思）
整数乘法也是微编码的，甚至比 sraw 慢。 :-(
在 Xenon PPC 上，预测分支的延迟为 8 个周期，因此即使是一个预测分支的延迟也与微编码指令一样昂贵。跳转到指针（任何间接分支或函数指针）肯定是错误预测，24 个周期摊位。

原文

I'm trying to find a way to perform an indirect shift-left/right operation without actually using the variable shift op or any branches.

The particular PowerPC processor I'm working on has the quirk that a shift-by-constant-immediate, like

int ShiftByConstant( int x ) { return x << 3 ; }

is fast, single-op, and superscalar, whereas a shift-by-variable, like

int ShiftByVar( int x, int y ) { return x << y ; }

is a microcoded operation that takes 7-11 cycles to execute while the entire rest of the pipeline stops dead.

What I'd like to do is figure out which non-microcoded integer PPC ops the sraw decodes into and then issue them individually. This won't help with the latency of the sraw itself — it'll replace one op with six — but in between those six ops I can dual-dispatch some work to the other execution units and get a net gain.

I can't seem to find anywhere what μops sraw decodes into — does anyone know how I can replace a variable bit-shift with a sequence of constant shifts and basic integer operations? (A for loop or a switch or anything with a branch in it won't work because the branch penalty is even bigger than the microcode penalty, even for correctly-predicted branches.)

This needn't be answered in assembly; I'm hoping to learn the algorithm rather than the particular code, so an answer in C or a high level language or even pseudo code would be perfectly helpful.

Edit: A couple of clarifications that I should add:

I'm not even a little bit worried about portability
PPC has a conditional-move, so we can assume the existence of a branchless intrinsic function
```
int isel(a, b, c)  { return a >= 0 ? b : c; }
```
(if you write out a ternary that does the same thing I'll get what you mean)
integer multiplication is also microcoded and even slower than sraw. :-(
On Xenon PPC, the latency of a predicted branch is 8 cycles, so even one makes it as costly as the microcoded instruction. Jump-to-pointer (any indirect branch or function pointer) is a guaranteed mispredict, a 24 cycle stall.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鲜肉鲜肉永远不皱 2024-07-21 06:42:30

给你...

我决定也尝试一下，因为 Mike Acton 声称这比在他的 CellPerformance 网站上使用 CELL/PS3 微编码转换更快，其中他建议避免间接转变。然而，在我的所有测试中，使用微编码版本不仅比间接移位的完整通用无分支替换更快，而且它占用的代码内存（1 条指令）也更少。

我将这些作为模板的唯一原因是为了获得有符号（通常是算术）和无符号（逻辑）移位的正确输出。

template <typename T> FORCEINLINE T VariableShiftLeft(T nVal, int nShift)
{   // 31-bit shift capability (Rolls over at 32-bits)
    const int bMask1=-(1&nShift);
    const int bMask2=-(1&(nShift>>1));
    const int bMask3=-(1&(nShift>>2));
    const int bMask4=-(1&(nShift>>3));
    const int bMask5=-(1&(nShift>>4));
    nVal=(nVal&bMask1) + nVal;   //nVal=((nVal<<1)&bMask1) | (nVal&(~bMask1));
    nVal=((nVal<<(1<<1))&bMask2) | (nVal&(~bMask2));
    nVal=((nVal<<(1<<2))&bMask3) | (nVal&(~bMask3));
    nVal=((nVal<<(1<<3))&bMask4) | (nVal&(~bMask4));
    nVal=((nVal<<(1<<4))&bMask5) | (nVal&(~bMask5));
    return(nVal);
}
template <typename T> FORCEINLINE T VariableShiftRight(T nVal, int nShift)
{   // 31-bit shift capability (Rolls over at 32-bits)
    const int bMask1=-(1&nShift);
    const int bMask2=-(1&(nShift>>1));
    const int bMask3=-(1&(nShift>>2));
    const int bMask4=-(1&(nShift>>3));
    const int bMask5=-(1&(nShift>>4));
    nVal=((nVal>>1)&bMask1) | (nVal&(~bMask1));
    nVal=((nVal>>(1<<1))&bMask2) | (nVal&(~bMask2));
    nVal=((nVal>>(1<<2))&bMask3) | (nVal&(~bMask3));
    nVal=((nVal>>(1<<3))&bMask4) | (nVal&(~bMask4));
    nVal=((nVal>>(1<<4))&bMask5) | (nVal&(~bMask5));
    return(nVal);
}

编辑：关于 isel() 的注释
我看到你的您网站上的 isel() 代码。

// if a >= 0, return x, else y
int isel( int a, int x, int y )
{
    int mask = a >> 31; // arithmetic shift right, splat out the sign bit
    // mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
    return x + ((y - x) & mask);
};

FWIW，如果您重写 isel() 来执行掩码和掩码补码，那么在 PowerPC 目标上速度会更快，因为编译器足够智能，可以生成“andc”操作码。操作码数量相同，但操作码中结果与输入寄存器的依赖关系少了一个。这两个掩码操作也可以在超标量处理器上并行发出。如果一切都正确排列的话，速度可以快 2-3 个周期。对于 PowerPC 版本，您只需将返回更改为：

return (x & (~mask)) + (y & mask);

Here you go...

I decided to try these out as well since Mike Acton claimed it would be faster than using the CELL/PS3 microcoded shift on his CellPerformance site where he suggests to avoid the indirect shift. However, in all my tests, using the microcoded version was not only faster than a full generic branch-free replacement for indirect shift, it takes way less memory for the code (1 instruction).

The only reason I did these as templates was to get the right output for both signed (usually arithmetic) and unsigned (logical) shifts.

template <typename T> FORCEINLINE T VariableShiftLeft(T nVal, int nShift)
{   // 31-bit shift capability (Rolls over at 32-bits)
    const int bMask1=-(1&nShift);
    const int bMask2=-(1&(nShift>>1));
    const int bMask3=-(1&(nShift>>2));
    const int bMask4=-(1&(nShift>>3));
    const int bMask5=-(1&(nShift>>4));
    nVal=(nVal&bMask1) + nVal;   //nVal=((nVal<<1)&bMask1) | (nVal&(~bMask1));
    nVal=((nVal<<(1<<1))&bMask2) | (nVal&(~bMask2));
    nVal=((nVal<<(1<<2))&bMask3) | (nVal&(~bMask3));
    nVal=((nVal<<(1<<3))&bMask4) | (nVal&(~bMask4));
    nVal=((nVal<<(1<<4))&bMask5) | (nVal&(~bMask5));
    return(nVal);
}
template <typename T> FORCEINLINE T VariableShiftRight(T nVal, int nShift)
{   // 31-bit shift capability (Rolls over at 32-bits)
    const int bMask1=-(1&nShift);
    const int bMask2=-(1&(nShift>>1));
    const int bMask3=-(1&(nShift>>2));
    const int bMask4=-(1&(nShift>>3));
    const int bMask5=-(1&(nShift>>4));
    nVal=((nVal>>1)&bMask1) | (nVal&(~bMask1));
    nVal=((nVal>>(1<<1))&bMask2) | (nVal&(~bMask2));
    nVal=((nVal>>(1<<2))&bMask3) | (nVal&(~bMask3));
    nVal=((nVal>>(1<<3))&bMask4) | (nVal&(~bMask4));
    nVal=((nVal>>(1<<4))&bMask5) | (nVal&(~bMask5));
    return(nVal);
}

EDIT: Note on isel()
I saw your isel() code on your website.

// if a >= 0, return x, else y
int isel( int a, int x, int y )
{
    int mask = a >> 31; // arithmetic shift right, splat out the sign bit
    // mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
    return x + ((y - x) & mask);
};

FWIW, if you rewrite your isel() to do a mask and mask complement, it will be faster on your PowerPC target since the compiler is smart enough to generate an 'andc' opcode. It's the same number of opcodes but there is one fewer result-to-input-register dependency in the opcodes. The two mask operations can also be issued in parallel on a superscalar processor. It can be 2-3 cycles faster if everything is lined up correctly. You just need to change the return to this for the PowerPC versions:

return (x & (~mask)) + (y & mask);

回复收藏 0 原文

等数载，海棠开 2024-07-21 06:42:30

怎么样：

if (y & 16) x <<= 16;
if (y & 8) x <<= 8;
if (y & 4) x <<= 4;
if (y & 2) x <<= 2;
if (y & 1) x <<= 1;

可能需要更长的时间来执行，但如果您有其他代码需要插入，则更容易交错。

How about this:

if (y & 16) x <<= 16;
if (y & 8) x <<= 8;
if (y & 4) x <<= 4;
if (y & 2) x <<= 2;
if (y & 1) x <<= 1;

will probably take longer yet to execute but easier to interleave if you have other code to go between.

回复收藏 0 原文

懵少女 2024-07-21 06:42:30

假设您的最大移位为 31。因此移位量是一个 5 位数字。因为转变是累积的，所以我们可以将其分为五个恒定转变。明显的版本使用分支，但你排除了这一点。

令 N 为 1 到 5 之间的数字。如果值为 2^{N 的位，您希望将 x 移位 2^N}设置在 y 中，否则保持 x 不变。这里有一种方法：

#define SHIFT(N) x = isel(((y >> N) & 1) - 1, x << (1 << N), x);

宏将 x 分配给 x x << 2ᴺ 或 x，取决于 y 中是否设置了第 N^th 位。

然后驱动程序：

SHIFT(1); SHIFT(2); SHIFT(3); SHIFT(4); SHIFT(5)

注意，N 是一个宏变量，并且变为常数。

但不知道这是否真的比变量移位更快。如果是的话，人们想知道为什么微代码不会运行这个......

Let's assume that your max shift is 31. So the shift amount is a 5-bit number. Because shifting is cumulative, we can break this into five constant shifts. The obvious version uses branching, but you ruled that out.

Let N be a number between 1 and 5. You want to shift x by 2^N if the bit whose value is 2^N is set in y, otherwise keep x intact. Here one way to do it:

#define SHIFT(N) x = isel(((y >> N) & 1) - 1, x << (1 << N), x);

The macro assigns to x either x << 2ᴺ or x, depending on whether the N^th bit is set in y or not.

And then the driver:

SHIFT(1); SHIFT(2); SHIFT(3); SHIFT(4); SHIFT(5)

Note that N is a macro variable and becomes constant.

Don't know though if this is going to be actually faster than the variable shift. If it would be, one wonders why the microcode wouldn't run this instead...

回复收藏 0 原文

弱骨蛰伏 2024-07-21 06:42:30

这让我头疼。我现在已经放弃了六个想法。它们都利用了这样的概念：向自身添加一个东西会左移 1，对结果做同样的操作会左移 4，依此类推。如果保留左移 0、1、2、4、8 和 16 的所有部分结果，则通过测试移位变量的位 0 到 4，您可以获得初始移位。现在再次执行此操作，对移位变量中的每 1 位执行一次。坦率地说，您不妨将处理器派出去喝咖啡。

我寻求真正帮助的一个地方是 Hank Warren 的 Hacker's Delight（其中是这个答案唯一有用的部分）。

回复收藏 0 原文

两相知 2024-07-21 06:42:30

这个怎么样：

int[] multiplicands = { 1, 2, 4, 8, 16, 32, ... etc ...};

int ShiftByVar( int x, int y )
{
    //return x << y;
    return x * multiplicands[y];
}

How about this:

int[] multiplicands = { 1, 2, 4, 8, 16, 32, ... etc ...};

int ShiftByVar( int x, int y )
{
    //return x << y;
    return x * multiplicands[y];
}

回复收藏 0 原文

会傲 2024-07-21 06:42:30

如果可以提前计算班次计数，那么我有两个可能可行的想法

使用自修改代码
只需修改指令中的移位量立即数即可。或者为具有变量移位的函数动态生成代码
如果可能，将具有相同移位计数的值组合在一起，并使用 Duff 的设备或函数指针一次性执行所有操作，以最大程度地减少分支错误预测
```
//按常量函数移位 
  typedef int (*shiftFunc)(int);   // 移位函数 
  #define SHL(n) int shl##n(int x) { return x <<   (n);   } 
  SHL(1) 
  SHL(2) 
  SHL(3) 
  ... 
  shiftFunc shiftLeft[] = { shl1, shl2, shl3... }; 

  int arr[最大值]；   // 所有需要移动相同量的值 
  shiftFunc shl = shiftLeft[3];   // 当你想移动 3 时 
  for (int i = 0; i < MAX; i++) 
      arr[i] = shl(arr[i]); 
  
```
此方法也可以与自修改或运行时代码生成结合使用，以消除对函数指针的需要。
编辑：正如所评论的，不幸的是根本没有关于跳转到注册的分支预测，因此唯一可行的方法是生成我上面所说的代码，或者如果可能的话，使用 SIMD 将

If值的范围很小，查找表是另一种可能的解决方案

#define S(x, n) ((x) + 0) << (n), ((x) + 1) << (n), ((x) + 2) << (n), ((x) + 3) << (n), \
                ((x) + 4) << (n), ((x) + 5) << (n), ((x) + 6) << (n), ((x) + 7 << (n)
#define S2(x, n)    S((x + 0)*8, n), S((x + 1)*8, n), S((x + 2)*8, n), S((x + 3)*8, n), \
                    S((x + 4)*8, n), S((x + 5)*8, n), S((x + 6)*8, n), S((x + 7)*8, n)
uint8_t shl[256][8] = {
    { S2(0U, 0), S2(8U, 0), S2(16U, 0), S2(24U, 0) },
    { S2(0U, 1), S2(8U, 1), S2(16U, 1), S2(24U, 1) },
    ...
    { S2(0U, 7), S2(8U, 7), S2(16U, 7), S2(24U, 7) },
}

现在 x << n 只是 shl[x][n]，其中 x 是一个 uint8_t。该表占用 2KB (8 × 256 B) 内存。然而，对于 16 位值，您需要一个 1MB 表 (16 × 64 KB)，这可能仍然可行，并且您可以通过将两个 16 位移位组合在一起来进行 32 位移位

If the shift count can be calculated far in advance then I have two ideas that might work

Using self-modifying code
Just modify the shift amount immediate in the instruction. Alternatively generate code dynamically for the functions with variable shift
Group the values with the same shift count together if possible, and do the operation all at once using Duff's device or function pointer to minimize branch misprediction
```
// shift by constant functions
typedef int (*shiftFunc)(int);    // the shift function
#define SHL(n) int shl##n(int x) { return x << (n); }
SHL(1)
SHL(2)
SHL(3)
...
shiftFunc shiftLeft[] = { shl1, shl2, shl3... };

int arr[MAX];       // all the values that need to be shifted with the same amount
shiftFunc shl = shiftLeft[3]; // when you want to shift by 3
for (int i = 0; i < MAX; i++)
    arr[i] = shl(arr[i]);
```
This method might also be done in combination with self-modifying or run-time code generation to remove the need for a function pointer.
Edit: As commented, unfortunately there's no branch prediction on jump to register at all, so the only way this could work is generating code as I said above, or using SIMD

If the range of the values is small, lookup table is another possible solution

#define S(x, n) ((x) + 0) << (n), ((x) + 1) << (n), ((x) + 2) << (n), ((x) + 3) << (n), \
                ((x) + 4) << (n), ((x) + 5) << (n), ((x) + 6) << (n), ((x) + 7 << (n)
#define S2(x, n)    S((x + 0)*8, n), S((x + 1)*8, n), S((x + 2)*8, n), S((x + 3)*8, n), \
                    S((x + 4)*8, n), S((x + 5)*8, n), S((x + 6)*8, n), S((x + 7)*8, n)
uint8_t shl[256][8] = {
    { S2(0U, 0), S2(8U, 0), S2(16U, 0), S2(24U, 0) },
    { S2(0U, 1), S2(8U, 1), S2(16U, 1), S2(24U, 1) },
    ...
    { S2(0U, 7), S2(8U, 7), S2(16U, 7), S2(24U, 7) },
}

Now x << n is simply shl[x][n] with x being an uint8_t. The table costs 2KB (8 × 256 B) of memory. However for 16-bit values you'll need a 1MB table (16 × 64 KB), which may still be viable and you can do a 32-bit shift by combining two 16-bit shifts together

回复收藏 0 原文

温柔女人霸气范 2024-07-21 06:42:30

这里有一些关于位操作黑魔法的好东西：
高级位操作符（Christer Ericson 的博客）

不知道其中是否有直接适用的，但如果有办法的话，很可能在某处有一些关于这种方式的提示。

回复收藏 0 原文

始终不够爱げ你 2024-07-21 06:42:30

这是一些根本无法展开的东西：

int result= value;

int shift_accumulator= value;

for (int i= 0; i<5; ++i)
{
    result += shift_accumulator & (-(k & 1)); // replace with isel if appropriate
    shift_accumulator += shift_accumulator;
    k >>= 1;
}

Here's something that is trivially unrollable:

int result= value;

int shift_accumulator= value;

for (int i= 0; i<5; ++i)
{
    result += shift_accumulator & (-(k & 1)); // replace with isel if appropriate
    shift_accumulator += shift_accumulator;
    k >>= 1;
}

回复收藏 0 原文

~没有更多了~