旋转并反射5x5位板

发布于 2025-01-25 15:50:40 字数 439 浏览 5 评论 0 原文

我正在尝试找到一种快速旋转的方法，并反射一个5x5板以将其存储在转置表中。由于非常快，板被表示为位。

比特板是这样的：

 20 21 22 23 24
 15 16 17 18 19
 10 11 12 13 14
 05 06 07 08 09
 00 01 02 03 04

我已经找到了一些针对8x8位板的解决方案但是我找不到适用于5x5比特板的解决方案，我也尝试过循环遍历所有位并切换它们，但这是一个非常缓慢的解决方案。（C ++）

原文

I am trying to find a fast way to rotate and reflect a 5x5 board to store it in a transposition table. The board is represented as a bitboard as they are very fast.

The bitboard is represented like this:

 20 21 22 23 24
 15 16 17 18 19
 10 11 12 13 14
 05 06 07 08 09
 00 01 02 03 04

I have have found some solutions for 8x8 bitboards https://www.chessprogramming.org/Flipping_Mirroring_and_Rotating but I can't find a solution that works for a 5x5 bitboard, I also have tried looping through all of the bits and switching them but that was a very slow solution. (C++)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

川水往事 2025-02-01 15:50:40

可以使用4个三角洲掉期进行转置，它不能像8x8转座那样很好地分解，因为5不是两个的力量。 4个三角洲掉期将花费大约24次操作和大约20个周期（由于指令级并行性，少于24个周期，但不小于24个周期，因为由于依赖关系，该代码主要是串行的）。

看起来就是这样：

board = bit_permute_step(board, 0x00006300, 16);
board = bit_permute_step(board, 0x020a080a, 4);
board = bit_permute_step(board, 0x0063008c, 8);
board = bit_permute_step(board, 0x00006310, 16);

where bit_permute_step_step_step_step 是熟悉的熟悉三角洲交换。

另一种方法是使用乘法提取列。例如 x＆amp; 0b000010000100100100100100001 选择一个列，然后我们可以选择一个常数以乘以乘以列的位，以便在32位单词的前5位中以连续的序列中的列列。其余的单词将包含这些位的其他流浪副本，但它们将被转移出去。例如（在C＃中显示，应易于移植）

static uint Tranpose_5x5(uint x)
{
    uint r0 = ((x & 0b0000100001000010000100001) * 0b00001000_10001000_10001000_00000000) >> 27;
    uint r1 = ((x & 0b0001000010000100001000010) * 0b00000100_01000100_01000100_00000000) >> 27;
    uint r2 = ((x & 0b0010000100001000010000100) * 0b00000010_00100010_00100010_00000000) >> 27;
    uint r3 = ((x & 0b0100001000010000100001000) * 0b00000001_00010001_00010001_00000000) >> 27;
    uint r4 = ((x & 0b1000010000100001000010000) * 0b00000000_10001000_10001000_10000000) >> 27;
    return r0 | (r1 << 5) | (r2 << 10) | (r3 << 15) | (r4 << 20);
}

在23个操作下，这似乎并没有保存任何东西，但是由于教学级别并行的机会增加，它的延迟可能会较低。

int r0 = _pext_u32(board, 0b0000100001000010000100001);
int r1 = _pext_u32(board, 0b0001000010000100001000010);
int r2 = _pext_u32(board, 0b0010000100001000010000100);
int r3 = _pext_u32(board, 0b0100001000010000100001000);
int r4 = _pext_u32(board, 0b1000010000100001000010000);
return r0 | (r1 << 5) | (r2 << 10) | (r3 << 15) | (r4 << 20);

pext同时两列，但是我们必须首先与板上串联，因为pext无法重新排序：（也未进行测试）

uint64_t b2 = board | (uint64_t(board) << 25);
int r01 = _pext_u64(b2, 0b0001000010000100001000010'0000100001000010000100001);
int r23 = _pext_u64(b2, 0b0100001000010000100001000'0010000100001000010000100);
int r4 = _pext_u32(board, 0b1000010000100001000010000);
return r01 | (r23 << 10) | (r4 << 20);

水平或垂直翻转很容易，而位小组移动则很容易。

Transpose could be done with 4 delta swaps, it does not decompose as nicely as an 8x8 transpose because 5 is not a power of two. 4 delta swaps would cost about 24 operations and about 20 cycles (less than 24 thanks to instruction level parallelism, but not a lot less than 24 because the code would be mostly serial due to dependencies).

That would look like this:

board = bit_permute_step(board, 0x00006300, 16);
board = bit_permute_step(board, 0x020a080a, 4);
board = bit_permute_step(board, 0x0063008c, 8);
board = bit_permute_step(board, 0x00006310, 16);

Where bit_permute_step is the familiar delta swap.

An alternative way is to use multiplication to extract columns. For example x & 0b0000100001000010000100001 selects one column, then we can choose a constant to multiply by such that the bits from the column line up in a contiguous sequence in the top 5 bits of a 32-bit word. The rest of the word would contain other stray copies of those bits, but they will be shifted out. For example (shown in C#, should be easy to port)

static uint Tranpose_5x5(uint x)
{
    uint r0 = ((x & 0b0000100001000010000100001) * 0b00001000_10001000_10001000_00000000) >> 27;
    uint r1 = ((x & 0b0001000010000100001000010) * 0b00000100_01000100_01000100_00000000) >> 27;
    uint r2 = ((x & 0b0010000100001000010000100) * 0b00000010_00100010_00100010_00000000) >> 27;
    uint r3 = ((x & 0b0100001000010000100001000) * 0b00000001_00010001_00010001_00000000) >> 27;
    uint r4 = ((x & 0b1000010000100001000010000) * 0b00000000_10001000_10001000_10000000) >> 27;
    return r0 | (r1 << 5) | (r2 << 10) | (r3 << 15) | (r4 << 20);
}

At 23 operations, this doesn't seem to have saved anything, but it could have a lower latency thanks to the increased opportunity for instruction level parallelism.

PEXT makes that nicer/simpler: (not tested)

int r0 = _pext_u32(board, 0b0000100001000010000100001);
int r1 = _pext_u32(board, 0b0001000010000100001000010);
int r2 = _pext_u32(board, 0b0010000100001000010000100);
int r3 = _pext_u32(board, 0b0100001000010000100001000);
int r4 = _pext_u32(board, 0b1000010000100001000010000);
return r0 | (r1 << 5) | (r2 << 10) | (r3 << 15) | (r4 << 20);

64-bit pext would enable extracting two columns at once, but we have to concatenate the board with itself first because pext cannot reorder: (also not tested)

uint64_t b2 = board | (uint64_t(board) << 25);
int r01 = _pext_u64(b2, 0b0001000010000100001000010'0000100001000010000100001);
int r23 = _pext_u64(b2, 0b0100001000010000100001000'0010000100001000010000100);
int r4 = _pext_u32(board, 0b1000010000100001000010000);
return r01 | (r23 << 10) | (r4 << 20);

Flipping horizontally or vertically would be easy with bit-group moving.

回复收藏 0 原文

~没有更多了~