如何有效地解交织位(逆莫顿)

发布于 2024-10-16 04:41:53 字数 1260 浏览 2 评论 0原文

这个问题:如何去交错位(UnMortonizing?)有一个很好的答案提取莫顿数的两半之一(仅奇数位)的答案,但我需要一个解决方案,以尽可能少的操作提取两个部分(奇数位和偶数位)。

对于我的使用,我需要采用 32 位 int 并提取两个 16 位 int,其中一个是偶数位,另一个是右移 1 位的奇数位,例如,

input,  z: 11101101 01010111 11011011 01101110

output, x: 11100001 10110111 // odd bits shifted right by 1
        y: 10111111 11011010 // even bits

似乎有很多使用移位和掩码的解决方案使用幻数生成莫顿数(即交织位),例如 按二进制幻数交织位,但我还没有找到任何可以做相反的事情(即去交错)。

更新

在重新阅读 Hacker's Delight 关于完美洗牌/非洗牌的部分后,我发现了一些有用的示例,我对其进行了如下修改:

// morton1 - extract even bits

uint32_t morton1(uint32_t x)
{
    x = x & 0x55555555;
    x = (x | (x >> 1)) & 0x33333333;
    x = (x | (x >> 2)) & 0x0F0F0F0F;
    x = (x | (x >> 4)) & 0x00FF00FF;
    x = (x | (x >> 8)) & 0x0000FFFF;
    return x;
}

// morton2 - extract odd and even bits

void morton2(uint32_t *x, uint32_t *y, uint32_t z)
{
    *x = morton1(z);
    *y = morton1(z >> 1);
}

我认为这仍然可以改进,无论是在当前的标量形式还是在通过利用 SIMD,所以我仍然对更好的解决方案感兴趣(标量或 SIMD)。

This question: How to de-interleave bits (UnMortonizing?) has a good answer for extracting one of the two halves of a Morton number (just the odd bits), but I need a solution which extracts both parts (the odd bits and the even bits) in as few operations as possible.

For my use I would need to take a 32 bit int and extract two 16 bit ints, where one is the even bits and the other is the odd bits shifted right by 1 bit, e.g.

input,  z: 11101101 01010111 11011011 01101110

output, x: 11100001 10110111 // odd bits shifted right by 1
        y: 10111111 11011010 // even bits

There seem to be plenty of solutions using shifts and masks with magic numbers for generating Morton numbers (i.e. interleaving bits), e.g. Interleave bits by Binary Magic Numbers, but I haven't yet found anything for doing the reverse (i.e. de-interleaving).

UPDATE

After re-reading the section from Hacker's Delight on perfect shuffles/unshuffles I found some useful examples which I adapted as follows:

// morton1 - extract even bits

uint32_t morton1(uint32_t x)
{
    x = x & 0x55555555;
    x = (x | (x >> 1)) & 0x33333333;
    x = (x | (x >> 2)) & 0x0F0F0F0F;
    x = (x | (x >> 4)) & 0x00FF00FF;
    x = (x | (x >> 8)) & 0x0000FFFF;
    return x;
}

// morton2 - extract odd and even bits

void morton2(uint32_t *x, uint32_t *y, uint32_t z)
{
    *x = morton1(z);
    *y = morton1(z >> 1);
}

I think this can still be improved on, both in its current scalar form and also by taking advantage of SIMD, so I'm still interested in better solutions (either scalar or SIMD).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

给不了的爱 2024-10-23 04:41:54

如果您的处理器有效地处理 64 位整数,您可以组合操作......

int64 w = (z &0xAAAAAAAA)<<31 | (z &0x55555555 )
w = (w | (w >> 1)) & 0x3333333333333333;
w = (w | (w >> 2)) & 0x0F0F0F0F0F0F0F0F; 
...

If your processor handles 64 bit ints efficiently, you could combine the operations...

int64 w = (z &0xAAAAAAAA)<<31 | (z &0x55555555 )
w = (w | (w >> 1)) & 0x3333333333333333;
w = (w | (w >> 2)) & 0x0F0F0F0F0F0F0F0F; 
...
凉世弥音 2024-10-23 04:41:54

Intel Haswell 及更高版本 CPU 的代码。您可以使用包含 pext 和 pdep 指令的 BMI2 指令集。这些(以及其他伟大的东西)可以用来构建您的函数。

#include <immintrin.h>
#include <stdint.h>

// on GCC, compile with option -mbmi2, requires Haswell or better.

uint64_t xy_to_morton (uint32_t x, uint32_t y)
{
    return _pdep_u32(x, 0x55555555) | _pdep_u32(y,0xaaaaaaaa);
}

uint64_t morton_to_xy (uint64_t m, uint32_t *x, uint32_t *y)
{
    *x = _pext_u64(m, 0x5555555555555555);
    *y = _pext_u64(m, 0xaaaaaaaaaaaaaaaa);
}

Code for the Intel Haswell and later CPUs. You can use the BMI2 instruction set which contains the pext and pdep instructions. These can (among other great things) be used to build your functions.

#include <immintrin.h>
#include <stdint.h>

// on GCC, compile with option -mbmi2, requires Haswell or better.

uint64_t xy_to_morton (uint32_t x, uint32_t y)
{
    return _pdep_u32(x, 0x55555555) | _pdep_u32(y,0xaaaaaaaa);
}

uint64_t morton_to_xy (uint64_t m, uint32_t *x, uint32_t *y)
{
    *x = _pext_u64(m, 0x5555555555555555);
    *y = _pext_u64(m, 0xaaaaaaaaaaaaaaaa);
}
十秒萌定你 2024-10-23 04:41:54

如果有人在 3d 中使用 morton 代码,那么他需要每 3 个读取一位,这里的 64 位是我使用的函数:

uint64_t morton3(uint64_t x) {
    x = x & 0x9249249249249249;
    x = (x | (x >> 2))  & 0x30c30c30c30c30c3;
    x = (x | (x >> 4))  & 0xf00f00f00f00f00f;
    x = (x | (x >> 8))  & 0x00ff0000ff0000ff;
    x = (x | (x >> 16)) & 0xffff00000000ffff;
    x = (x | (x >> 32)) & 0x00000000ffffffff;
    return x;
}
uint64_t bits; 
uint64_t x = morton3(bits)
uint64_t y = morton3(bits>>1)
uint64_t z = morton3(bits>>2)

In case someone is using morton codes in 3d, so he needs to read one bit every 3, and 64 bits here is the function I used:

uint64_t morton3(uint64_t x) {
    x = x & 0x9249249249249249;
    x = (x | (x >> 2))  & 0x30c30c30c30c30c3;
    x = (x | (x >> 4))  & 0xf00f00f00f00f00f;
    x = (x | (x >> 8))  & 0x00ff0000ff0000ff;
    x = (x | (x >> 16)) & 0xffff00000000ffff;
    x = (x | (x >> 32)) & 0x00000000ffffffff;
    return x;
}
uint64_t bits; 
uint64_t x = morton3(bits)
uint64_t y = morton3(bits>>1)
uint64_t z = morton3(bits>>2)
裂开嘴轻声笑有多痛 2024-10-23 04:41:54

您可以通过相乘来提取 8 个交错位,如下所示:

uint8_t deinterleave_even(uint16_t x) {
    return ((x & 0x5555) * 0xC00030000C0003 & 0x0600180060008001) * 0x0101010101010101 >> 56;
}
uint8_t deinterleave_odd(uint16_t x) {
    return ((x & 0xAAAA) * 0xC00030000C0003 & 0x03000C003000C000) * 0x0101010101010101 >> 56;
}

将它们合并为 32 位或更大应该很简单。

You can extract 8 interleaved bits by multiplying like so:

uint8_t deinterleave_even(uint16_t x) {
    return ((x & 0x5555) * 0xC00030000C0003 & 0x0600180060008001) * 0x0101010101010101 >> 56;
}
uint8_t deinterleave_odd(uint16_t x) {
    return ((x & 0xAAAA) * 0xC00030000C0003 & 0x03000C003000C000) * 0x0101010101010101 >> 56;
}

It should be trivial to combine them for 32 bits or larger.

孤独难免 2024-10-23 04:41:54

如果您需要速度,则可以使用表查找一次进行一字节转换(两字节表速度更快,但太大)。程序是在 Delphi IDE 下编写的,但汇编器/算法是相同的。

const
  MortonTableLookup : array[byte] of byte = ($00, $01, $10, $11, $12, ... ;

procedure DeinterleaveBits(Input: cardinal);
//In: eax
//Out: dx = EvenBits; ax = OddBits;
asm
  movzx   ecx, al                                     //Use 0th byte
  mov     dl, byte ptr[MortonTableLookup + ecx]
//
  shr     eax, 8
  movzx   ecx, ah                                     //Use 2th byte
  mov     dh, byte ptr[MortonTableLookup + ecx]
//
  shl     edx, 16
  movzx   ecx, al                                     //Use 1th byte
  mov     dl, byte ptr[MortonTableLookup + ecx]
//
  shr     eax, 8
  movzx   ecx, ah                                     //Use 3th byte
  mov     dh, byte ptr[MortonTableLookup + ecx]
//
  mov     ecx, edx  
  and     ecx, $F0F0F0F0
  mov     eax, ecx
  rol     eax, 12
  or      eax, ecx

  rol     edx, 4
  and     edx, $F0F0F0F0
  mov     ecx, edx
  rol     ecx, 12
  or      edx, ecx
end;

If you need speed than you can use table-lookup for one byte conversion at once (two bytes table is faster but to big). Procedure is made under Delphi IDE but the assembler/algorithem is the same.

const
  MortonTableLookup : array[byte] of byte = ($00, $01, $10, $11, $12, ... ;

procedure DeinterleaveBits(Input: cardinal);
//In: eax
//Out: dx = EvenBits; ax = OddBits;
asm
  movzx   ecx, al                                     //Use 0th byte
  mov     dl, byte ptr[MortonTableLookup + ecx]
//
  shr     eax, 8
  movzx   ecx, ah                                     //Use 2th byte
  mov     dh, byte ptr[MortonTableLookup + ecx]
//
  shl     edx, 16
  movzx   ecx, al                                     //Use 1th byte
  mov     dl, byte ptr[MortonTableLookup + ecx]
//
  shr     eax, 8
  movzx   ecx, ah                                     //Use 3th byte
  mov     dh, byte ptr[MortonTableLookup + ecx]
//
  mov     ecx, edx  
  and     ecx, $F0F0F0F0
  mov     eax, ecx
  rol     eax, 12
  or      eax, ecx

  rol     edx, 4
  and     edx, $F0F0F0F0
  mov     ecx, edx
  rol     ecx, 12
  or      edx, ecx
end;
梦言归人 2024-10-23 04:41:54

我不想局限于固定大小的整数并使用硬编码常量制作类似命令的列表,因此我开发了一个 C++11 解决方案,它利用模板元编程来生成函数和常量。使用 -O3 生成的汇编代码似乎在不使用 BMI 的情况下尽可能紧凑:

andl    $0x55555555, %eax
movl    %eax, %ecx
shrl    %ecx
orl     %eax, %ecx
andl    $0x33333333, %ecx
movl    %ecx, %eax
shrl    $2, %eax
orl     %ecx, %eax
andl    $0xF0F0F0F, %eax
movl    %eax, %ecx
shrl    $4, %ecx
orl     %eax, %ecx
movzbl  %cl, %esi
shrl    $8, %ecx
andl    $0xFF00, %ecx
orl     %ecx, %esi

TL;DR 源代码库现场演示


实现

基本上,morton1 函数中的每个步骤都是通过移位和添加到一系列常量来实现的,如下所示:

  1. 0b0101010101010101(交替 1 和 0)
  2. 0b00110011001100110b0011001100110011 code> (交替 2x 1 和 0)
  3. 0b0000111100001111 (交替 4x 1 和 0)
  4. 0b0000000011111111 (交替 8x 1 和 0)

如果我们要使用 D 维度,我们将得到一个包含 D-1 0 和 1 1 的模式。因此,要生成这些,生成连续的并应用一些按位“或”就足够了:

/// @brief Generates 0b1...1 with @tparam n ones
template <class T, unsigned n>
using n_ones = std::integral_constant<T, (~static_cast<T>(0) >> (sizeof(T) * 8 - n))>;

/// @brief Performs `@tparam input | (@tparam input << @tparam width` @tparam repeat times.
template <class T, T input, unsigned width, unsigned repeat>
struct lshift_add :
    public lshift_add<T, lshift_add<T, input, width, 1>::value, width, repeat - 1> {
};
/// @brief Specialization for 1 repetition, just does the shift-and-add operation.
template <class T, T input, unsigned width>
struct lshift_add<T, input, width, 1> : public std::integral_constant<T,
    (input & n_ones<T, width>::value) | (input << (width < sizeof(T) * 8 ? width : 0))> {
};

现在我们可以在编译时通过以下方式生成任意维度的常量:

template <class T, unsigned step, unsigned dimensions = 2u>
using mask = lshift_add<T, n_ones<T, 1 << step>::value, dimensions * (1 << step), sizeof(T) * 8 / (2 << step)>;

使用相同类型的递归,我们可以为每个步骤生成函数算法 x = (x | (x >> K)) & M:

template <class T, unsigned step, unsigned dimensions>
struct deinterleave {
    static T work(T input) {
        input = deinterleave<T, step - 1, dimensions>::work(input);
        return (input | (input >> ((dimensions - 1) * (1 << (step - 1))))) & mask<T, step, dimensions>::value;
    }
};
// Omitted specialization for step 0, where there is just a bitwise and

仍然需要回答“我们需要多少步?”这个问题。这还取决于维数。一般来说,k 步计算 2^k - 1 输出位;每个维度的最大有意义位数由 z = sizeof(T) * 8/dimensions 给出,因此采取 1 + log_2 z 步就足够了。现在的问题是我们需要将其作为 constexpr 以便将其用作模板参数。我发现解决此问题的最佳方法是通过元编程定义 log2

template <unsigned arg>
struct log2 : public std::integral_constant<unsigned, log2<(arg >> 1)>::value + 1> {};
template <>
struct log2<1u> : public std::integral_constant<unsigned, 0u> {};

/// @brief Helper constexpr which returns the number of steps needed to fully interleave a type @tparam T.
template <class T, unsigned dimensions>
using num_steps = std::integral_constant<unsigned, log2<sizeof(T) * 8 / dimensions>::value + 1>;

最后,我们可以执行一次调用:

/// @brief Helper function which combines @see deinterleave and @see num_steps into a single call.
template <class T, unsigned dimensions>
T deinterleave_first(T n) {
    return deinterleave<T, num_steps<T, dimensions>::value - 1, dimensions>::work(n);
}

I didn't want to be limited to a fixed size integer and making lists of similar commands with hardcoded constants, so I developed a C++11 solution which makes use of template metaprogramming to generate the functions and the constants. The assembly code generated with -O3 seems as tight as it can get without using BMI:

andl    $0x55555555, %eax
movl    %eax, %ecx
shrl    %ecx
orl     %eax, %ecx
andl    $0x33333333, %ecx
movl    %ecx, %eax
shrl    $2, %eax
orl     %ecx, %eax
andl    $0xF0F0F0F, %eax
movl    %eax, %ecx
shrl    $4, %ecx
orl     %eax, %ecx
movzbl  %cl, %esi
shrl    $8, %ecx
andl    $0xFF00, %ecx
orl     %ecx, %esi

TL;DR source repo and live demo.


Implementation

Basically every step in the morton1 function works by shifting and adding to a sequence of constants which look like this:

  1. 0b0101010101010101 (alternate 1 and 0)
  2. 0b0011001100110011 (alternate 2x 1 and 0)
  3. 0b0000111100001111 (alternate 4x 1 and 0)
  4. 0b0000000011111111 (alternate 8x 1 and 0)

If we were to use D dimensions, we would have a pattern with D-1 zeros and 1 one. So to generate these it's enough to generate consecutive ones and apply some bitwise or:

/// @brief Generates 0b1...1 with @tparam n ones
template <class T, unsigned n>
using n_ones = std::integral_constant<T, (~static_cast<T>(0) >> (sizeof(T) * 8 - n))>;

/// @brief Performs `@tparam input | (@tparam input << @tparam width` @tparam repeat times.
template <class T, T input, unsigned width, unsigned repeat>
struct lshift_add :
    public lshift_add<T, lshift_add<T, input, width, 1>::value, width, repeat - 1> {
};
/// @brief Specialization for 1 repetition, just does the shift-and-add operation.
template <class T, T input, unsigned width>
struct lshift_add<T, input, width, 1> : public std::integral_constant<T,
    (input & n_ones<T, width>::value) | (input << (width < sizeof(T) * 8 ? width : 0))> {
};

Now that we can generate the constants at compile time for arbitrary dimensions with the following:

template <class T, unsigned step, unsigned dimensions = 2u>
using mask = lshift_add<T, n_ones<T, 1 << step>::value, dimensions * (1 << step), sizeof(T) * 8 / (2 << step)>;

With the same type of recursion, we can generate functions for each of the steps of the algorithm x = (x | (x >> K)) & M:

template <class T, unsigned step, unsigned dimensions>
struct deinterleave {
    static T work(T input) {
        input = deinterleave<T, step - 1, dimensions>::work(input);
        return (input | (input >> ((dimensions - 1) * (1 << (step - 1))))) & mask<T, step, dimensions>::value;
    }
};
// Omitted specialization for step 0, where there is just a bitwise and

It remains to answer the question "how many steps do we need?". This depends also on the number of dimensions. In general, k steps compute 2^k - 1 output bits; the maximum number of meaningful bits for each dimension is given by z = sizeof(T) * 8 / dimensions, therefore it is enough to take 1 + log_2 z steps. The problem is now that we need this as constexpr in order to use it as a template parameter. The best way I found to work around this is to define log2 via metaprogramming:

template <unsigned arg>
struct log2 : public std::integral_constant<unsigned, log2<(arg >> 1)>::value + 1> {};
template <>
struct log2<1u> : public std::integral_constant<unsigned, 0u> {};

/// @brief Helper constexpr which returns the number of steps needed to fully interleave a type @tparam T.
template <class T, unsigned dimensions>
using num_steps = std::integral_constant<unsigned, log2<sizeof(T) * 8 / dimensions>::value + 1>;

And finally, we can perform one single call:

/// @brief Helper function which combines @see deinterleave and @see num_steps into a single call.
template <class T, unsigned dimensions>
T deinterleave_first(T n) {
    return deinterleave<T, num_steps<T, dimensions>::value - 1, dimensions>::work(n);
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文