将所有位从最低显着的钻头翻转到最重要的最后1位值的最有效方法是什么?

发布于 2025-01-24 00:49:59 字数 682 浏览 4 评论 0 原文

例如,我有一个可能具有任何值的 uint8_t ,我只想将所有位从最小显着的位倒入最重要的最后1位值?我将如何以最有效的方式做到这一点?是否可以避免使用循环的解决方案?

以下是一些情况:

左侧是原始位 - 翻转后的右侧。

  • 00011101 - > 00000010
  • 00000000 - > 00000000
  • 11111111 - > 00000000
  • 11110111 - > 00001000
  • 01000000 - > 00111111

[edit]

类型也可能大于 uint8_t ,它可以是 uint32_t uint64_t __ uint128_t 。我只使用 uint8_t ,因为它是示例情况下最容易显示的大小。

Say for example I have a uint8_t that can be of any value, and I only want to flip all the bits from the least significant bit up to the most significant last 1 bit value? How would I do that in the most efficient way?, Is there a solution where I can avoid using a loop?

here are some cases:

left side is the original bits - right side after the flips.

  • 00011101 -> 00000010
  • 00000000 -> 00000000
  • 11111111 -> 00000000
  • 11110111 -> 00001000
  • 01000000 -> 00111111

[EDIT]

The type could also be larger than uint8_t, It could be uint32_t, uint64_t and __uint128_t. I just use uint8_t because it's the easiest size to show in the example cases.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

爱*していゐ 2025-01-31 00:49:59

总的来说,我希望大多数解决方案都会大致具有这种形式:

  1. 来计算需要翻转XOR的位
  2. 掩码

按照评论中提到的

  • ,X64是感兴趣的目标,在X64上,您可以这样做类似的步骤1:通过领导零( _lzcnt_U64 ),找到最有意义的1的基于1个位置 p ,并从64(或适当的32)中减去。
  • 使用 p 创建一个掩码,从最小显着的位开始,可能使用 _BZHI_U64

有一些变化,例如使用bitscanreverse找到最有意义的1(但它的零案例为零),或者使用偏移而不是 bzhi (但它具有64个丑陋的情况) 。 lzcnt bzhi 是一个很好的组合,没有丑陋的情况。 BZHI 需要BMI2(Intel Haswell或更新,AMD ZEN或更新)。

将其放在一起:

x ^ _bzhi_u64(~(uint64_t)0, 64 - _lzcnt_u64(x))

可以进一步简化为

_bzhi_u64(~x,  64 - _lzcnt_u64(x))

如彼得所示。这不遵循原始的2步计划,而是所有的位都是翻转的,然后重置了最初引导零的位。

由于那些原始的前导零在 〜x 中形成了领先的序列的连续序列,因此 bzhi 的替代方案可能是将两个的适当功率添加到 〜x (尽管有时为零,这可能被认为是2 64 ,将sit位放在数字的顶部之外)。不幸的是,我们需要的两个力量有点烦人,至少我无法提出一种很好的方法,这似乎是我的死胡同。

步骤1也可以使用几个班次和位ORS以通用的方式(无特殊操作)实现,例如:

// Get all-ones below the leading 1
// On x86-64, this is probably slower than Paul R's method using BSR and shift
//   even though you have to special case x==0
m = x | (x >> 1);
m |= m >> 2;
m |= m >> 4;
m |= m >> 8;
m |= m >> 16;
m |= m >> 32;  // last step should be removed if x is 32-bit

AMD CPU的BSR缓慢(但是快速lzcnt; https://uops.info/ ),因此您可能需要此shift/或版本或 uint8_t uint16_t (其中> code>它需要最少的步骤),尤其是如果您需要与AMD上所有CPU 速度的兼容性比在英特尔上更重要。

该通用版本在SIMD元素(尤其是狭窄的版本)中也很有用,在狭窄的元素中,直到AVX-512,我们才有领先的零计数。

In general I expect that most solutions will have roughly this form:

  1. Compute the mask of bits that need to flipped
  2. XOR by that mask

As mentioned in the comments, x64 is a target of interest, and on x64 you can do step 1 like this:

  • Find the 1-based position p of the most significant 1, by leading zeroes (_lzcnt_u64) and subtracting that from 64 (or 32 whichever is appropriate).
  • Create a mask with p consecutive set bits starting from the least significant bit, probably using _bzhi_u64.

There are some variations, such as using BitScanReverse to find the most significant 1 (but it has an ugly case for zero), or using a shift instead of bzhi (but it has an ugly case for 64). lzcnt and bzhi is a good combination with no ugly cases. bzhi requires BMI2 (Intel Haswell or newer, AMD Zen or newer).

Putting it together:

x ^ _bzhi_u64(~(uint64_t)0, 64 - _lzcnt_u64(x))

Which could be further simplified to

_bzhi_u64(~x,  64 - _lzcnt_u64(x))

As shown by Peter. This doesn't follow the original 2-step plan, rather all bits are flipped, and then the bits that were originally leading zeroes are reset.

Since those original leading zeroes form a contiguous sequence of leading ones in ~x, an alternative to bzhi could be to add the appropriate power of two to ~x (though sometimes zero, which might be thought of as 264, putting the set bit just beyond the top of the number). Unfortunately the power of two that we need is a bit annoying to compute, at least I could not come up with a good way to do it, it seems like a dead end to me.

Step 1 could also be implemented in a generic way (no special operations) using a few shifts and bitwise ORs, like this:

// Get all-ones below the leading 1
// On x86-64, this is probably slower than Paul R's method using BSR and shift
//   even though you have to special case x==0
m = x | (x >> 1);
m |= m >> 2;
m |= m >> 4;
m |= m >> 8;
m |= m >> 16;
m |= m >> 32;  // last step should be removed if x is 32-bit

AMD CPUs have slowish BSR (but fast LZCNT; https://uops.info/), so you might want this shift/or version for uint8_t or uint16_t (where it takes fewest steps), especially if you need compatibility with all CPUs and speed on AMD is more important than on Intel.

This generic version is also useful within SIMD elements, especially narrow ones, where we don't have a leading-zero-count until AVX-512.

执手闯天涯 2025-01-31 00:49:59

TL:DR:使用 UINT64_T shift shift在编译具有 lzcnt 的64位计算机时,使用 uint32_t 有效地实现自哈斯韦尔(Haswell)以来)。没有 lzcnt (仅 bsr x86的基线) n == 0 案例仍然很特别。


对于 uint64_t 版本,困难的部分是您对最高设置位具有65个不同的可能位置,包括不存在的( lzcnt 在所有位均为零时产生64个) 。但是,在x86上具有64位操作数大小的单个偏移只能产生64个不同的值之一(假设输入恒定输入),因为x86将掩码移动,例如 foo>>> (c& 63)

使用Shift需要特殊定位的一个领先位置,通常为 n == 0 案例。如Harold的回答所示, bmi2 bmi2 bzhi 从0..64。

对于32位操作数大小的偏移相同:它们掩盖了 c& 31 但要生成 uint32_t 的掩码,我们可以在x86-64上有效使用64位移动。(或uint16_t和uint16_t和uint8_t。 ASM用8或16位操作数大小的移动仍然掩盖了他们的计数Mod 32,因此即使使用更宽的操作数,但32位操作数的尺寸也不需要,也无需与部分混乱。 - 注册写作。)

此策略比BZHI更窄,而对于寄存器宽度要狭窄。

// optimized for 64-bit mode, otherwise 32-bit bzhi or a cmov version of Paul R's is good

#ifdef __LZCNT__
#include <immintrin.h>
uint32_t flip_32_on_64(uint32_t n)
{
    uint64_t mask32 = 0xffffffff;  // (uint64_t)(uint32_t)-1u32
    // this needs to be _lzcnt_u32, not __builtin_clz; we need 32 for n==0
    // If lznct isn't available, we can't avoid handling n==0  zero specially
    uint32_t mask = mask32 >> _lzcnt_u32(n);
    return n ^ mask;
}
#endif

这与 uint8_t uint16_t (实际上是带有同一掩码的相同代码,在零extension上使用32位lzcnt)等效地工作。 。 但是不是 uint64_t (您可以使用 无符号__INT128 shift,但是 shrd maskss它的Shift Count Mod 64因此,编译器仍然需要一些有条件的行为来模拟它,因此您可以进行手册CMOV或其他行为,或者 sbb相同,相同生成 0 or -1 in a register as the mask to be shifted.)

Godbolt使用GCC和Clang。请注意,用 ______________clz 替换 _lzcnt_u32 是不安全的。 clang11和后来假设即使将其编译为 lzcnt 指令 1 ,也无法产生32代码> mask32&gt;&gt; clz(n)&amp; 31 。

# clang 14 -O3 -march=haswell  (or znver1 or bdver4 or other BMI2 CPUs)
flip_32_on_64:
        lzcnt   eax, edi           # skylake fixed the output false-dependency for lzcnt/tzcnt, but not popcnt.  Clang doesn't care, it's reckless about false deps except inside a loop in a single function.
        mov     ecx, 4294967295
        shrx    rax, rcx, rax
        xor     eax, edi
        ret

如果没有BMI2,例如 -March = BDVER1 BARCELONA (又称K10),我们获得了相同的代码,除了 shr rax,cl 。这些CPU确实具有 lzcnt ,否则不会编译。

(我很好奇是否Intel Skylake Pentium/Celeron Run lzcnt AS lzcnt bsf 。它们缺乏BMI1/BMI2,但是 lzcnt 有自己的功能标志。
尽管Tremont缺少 lzcnt ,但根据 pentium Silver N6005 Jasper Lake-D,Tremont Core 。我没有手动在最近的Pentium/Celeron的原始CPUID垃圾场中手动寻找功能,但是

a>确实 比较。

SHRX 版本可以将其 -1 在整个循环中的寄存器中保持不变。因此,如果编译器具有备用寄存器,则可以在内部循环后将 悬挂在循环中。最佳 bzhi 策略不需要掩码常数,因此无需获得。 _bzhi_U64(〜x,64- _lzcnt_u64(x))是5个UOPS,但可用于64位计算机上的64位整数。它的延迟临界路径长度与此相同。 (lzcnt / sub / bzhi)。


没有LZCNT,一个选项可能总是翻转作为获得CMOV设置标志的一种方式,并使用 -1&lt;&lt; bsr(n) XOR中的一些回到原始状态。这可以减少关键路径延迟。 IDK如果可以将C编译器哄骗发出。尤其是,如果您想利用真正的CPU,如果源为零,那么实际CPU将BSR目的地保持不变,而仅将AMD记录在此事实中。 (英特尔说这是一个“未定义”的结果。)

(todo:完成此手写的ASM想法。)


uint64_t case> case> CAD> cmov > cmov > cmov 与 lzcnt 同时,cmp/sbb (生成 0 -1 )以缩短关键路径延迟?请参阅我正在玩的Godbolt Link。

ARM/AARCH64饱和它们的移位计数,与标量的X86掩盖方式不同。如果一个人可以安全地利用这一点(没有C换算UB),那将是整洁的,允许这样的东西。

x86 simd 移动也使计数饱和,Paul R使用 vlzcnt 和可变切换利用了AVX-512答案。 (但是,不值得将数据复制到XMM reg并返回一个标量偏移;只有在您有多个元素要执行的情况下才有用。)

脚注1:clang codegen,带有 ______edin_clz 。 .ll

使用 __内置_clzll(n)将使clang使用64位操作数大小进行轮班,因为值是从32到63的值。但是,您实际上不能没有 lzcnt 来编译CPU。 63- bsr 编译器在没有LZCNT的情况下将使用 64 我们需要使用该案例。除非您做过 n&lt;&lt; = 1; / n | = 1; ; 或 bsr 之前的东西并调整了结果,但是将比 cmov 慢。

如果您使用的是64位 lzcnt ,则需要 uint64_t mask = -1ull ,因为在零扩展到 uint64_t之后将有32个超前的零。 Fortunately all-ones is relatively cheap to materialize on all ISAs, so use that instead of 0xffffffff00000000ULL

TL:DR: use a uint64_t shift to implement efficiently with uint32_t when compiling for 64-bit machines that have lzcnt (AMD since K10, Intel since Haswell). Without lzcnt (only bsr that's baseline for x86) the n==0 case is still special.


For the uint64_t version, the hard part is that you have 65 different possible positions for the highest set bit, including non-existent (lzcnt producing 64 when all bits are zero). But a single shift with 64-bit operand-size on x86 can only produce one of 64 different values (assuming a constant input), since x86 shifts mask the count like foo >> (c&63)

Using a shift requires special-casing one leading-bit-position, typically the n==0 case. As Harold's answer shows, BMI2 bzhi avoids that, allowing bit counts from 0..64.

Same for 32-bit operand-size shifts: they mask c&31. But to generate a mask for uint32_t, we can use a 64-bit shift efficiently on x86-64. (Or 32-bit for uint16_t and uint8_t. Fun fact: x86 asm shifts with 8 or 16-bit operand-size still mask their count mod 32, so they can shift out all the bits without even using a wider operand-size. But 32-bit operand size is efficient, no need to mess with partial-register writes.)

This strategy is even more efficient than bzhi for a type narrower than register width.

// optimized for 64-bit mode, otherwise 32-bit bzhi or a cmov version of Paul R's is good

#ifdef __LZCNT__
#include <immintrin.h>
uint32_t flip_32_on_64(uint32_t n)
{
    uint64_t mask32 = 0xffffffff;  // (uint64_t)(uint32_t)-1u32
    // this needs to be _lzcnt_u32, not __builtin_clz; we need 32 for n==0
    // If lznct isn't available, we can't avoid handling n==0  zero specially
    uint32_t mask = mask32 >> _lzcnt_u32(n);
    return n ^ mask;
}
#endif

This works equivalently for uint8_t and uint16_t (literally the same code with same mask, using a 32-bit lzcnt on them after zero-extension). But not uint64_t (You could use a unsigned __int128 shift, but shrd masks its shift count mod 64 so compilers still need some conditional behaviour to emulate it. So you might as well do a manual cmov or something, or sbb same,same to generate a 0 or -1 in a register as the mask to be shifted.)

Godbolt with gcc and clang. Note that it's not safe to replace _lzcnt_u32 with __builtin_clz; clang11 and later assume that can't produce 32 even when they compile it to an lzcnt instruction1, and optimize the shift operand-size down to 32 which will act as mask32 >> clz(n) & 31.

# clang 14 -O3 -march=haswell  (or znver1 or bdver4 or other BMI2 CPUs)
flip_32_on_64:
        lzcnt   eax, edi           # skylake fixed the output false-dependency for lzcnt/tzcnt, but not popcnt.  Clang doesn't care, it's reckless about false deps except inside a loop in a single function.
        mov     ecx, 4294967295
        shrx    rax, rcx, rax
        xor     eax, edi
        ret

Without BMI2, e.g. with -march=bdver1 or barcelona (aka k10), we get the same code-gen except with shr rax, cl. Those CPUs do still have lzcnt, otherwise this wouldn't compile.

(I'm curious if Intel Skylake Pentium/Celeron run lzcnt as lzcnt or bsf. They lack BMI1/BMI2, but lzcnt has its own feature flag.
It seems low-power uarches as recent as Tremont are missing lzcnt, though, according to InstLatx64 for a Pentium Silver N6005 Jasper Lake-D, Tremont core. I didn't manually look for the feature bit in the raw CPUID dumps of recent Pentium/Celeron, but Instlat does have those available if someone wants to check.)

Anyway, bzhi also requires BMI2, so if you're comparing against that for any size but uint64_t, this is the comparison.

This shrx version can keep its -1 constant around in a register across loops. So the mov reg,-1 can be hoisted out of a loop after inlining, if the compiler has a spare register. The best bzhi strategy doesn't need a mask constant so it has nothing to gain. _bzhi_u64(~x, 64 - _lzcnt_u64(x)) is 5 uops, but works for 64-bit integers on 64-bit machines. Its latency critical path length is the same as this. (lzcnt / sub / bzhi).


Without LZCNT, one option might be to always flip as a way to get FLAGS set for CMOV, and use -1 << bsr(n) to XOR some of them back to the original state. This could reduce critical path latency. IDK if a C compiler could be coaxed into emitting this. Especially not if you want to take advantage of the fact that real CPUs keep the BSR destination unchanged if the source was zero, but only AMD documents this fact. (Intel says it's an "undefined" result.)

(TODO: finish this hand-written asm idea.)


Other C ideas for the uint64_t case: cmov or cmp/sbb (to generate a 0 or -1) in parallel with lzcnt to shorten the critical path latency? See the Godbolt link where I was playing with that.

ARM/AArch64 saturate their shift counts, unlike how x86 masks for scalar. If one could take advantage of that safely (without C shift-count UB) that would be neat, allowing something about as good as this.

x86 SIMD shifts also saturate their counts, which Paul R took advantage of with an AVX-512 answer using vlzcnt and variable-shift. (It's not worth copying data to an XMM reg and back for one scalar shift, though; only useful if you have multiple elements to do.)

Footnote 1: clang codegen with __builtin_clz or ...ll

Using __builtin_clzll(n) will get clang to use 64-bit operand-size for the shift, since values from 32 to 63 become possible. But you can't actually use that to compile for CPUs without lzcnt. The 63-bsr a compiler would use without lzcnt available would not produce the 64 we need for that case. Not unless you did n<<=1; / n|=1; or something before the bsr and adjusted the result, but that would be slower than cmov.

If you were using a 64-bit lzcnt, you'd want uint64_t mask = -1ULL since there will be 32 extra leading zeros after zero-extending to uint64_t. Fortunately all-ones is relatively cheap to materialize on all ISAs, so use that instead of 0xffffffff00000000ULL

雨的味道风的声音 2025-01-31 00:49:59

这是与GCC和兼容编译器(Clangem et al )一起使用的32位INTS的一个简单示例,并且在大多数架构中都是可移植的。

uint32_t flip(uint32_t n)
{
    if (n == 0) return 0;
    uint32_t mask = ~0U >> __builtin_clz(n);
    return n ^ mask;
}

demo

如果我们使用的话,我们可以避免使用额外的支票0 lzcnt 在x86-64上(或 clz 在ARM上),我们使用的是允许计数为32的偏移。(在C中,在x86上,按类型宽度或更大的行为是不确定的行为,实际上,对于64位以外的换档,换档计数是&amp; 31 uint8_t 使用 uint32_t mask。)

请小心避免使用c不确定的行为,包括有关 ______endiN_clz 的任何假设,并带有输入0;现代C编译器不是便携式组件,即使我们有时希望这是当该语言不通知我们要利用的CPU功能时。例如,Clang假设 __内置_CLZ(n)即使将其编译为 lzcnt 也不能为32。

请参阅@petercordes的答案有关详细信息。

Here’s a simple example for 32 bit ints that works with gcc and compatible compilers (clang et al), and is portable across most architectures.

uint32_t flip(uint32_t n)
{
    if (n == 0) return 0;
    uint32_t mask = ~0U >> __builtin_clz(n);
    return n ^ mask;
}

DEMO

We could avoid the extra check for n==0 if we used lzcnt on x86-64 (or clz on ARM), and we were using a shift that allowed a count of 32. (In C, shifts by the type-width or larger are undefined behaviour. On x86, in practice the shift count is masked &31 for shifts other than 64-bit, so this could be usable for uint16_t or uint8_t using a uint32_t mask.)

Be careful to avoid C undefined behaviour, including any assumption about __builtin_clz with an input of 0; modern C compilers are not portable assemblers, even though we sometimes wish they were when the language doesn't portably expose the CPU features we want to take advantage of. For example, clang assumes that __builtin_clz(n) can't be 32 even when it compiles it to lzcnt.

See @PeterCordes's answer for details.

青巷忧颜 2025-01-31 00:49:59

如果您的用例是至关重要的,您可能还需要考虑使用SIMD实现,以在大量元素上执行翻转操作。这是一个使用AVX512用于32位元素的示例:

void flip(const uint32_t in[], uint32_t out[], size_t n)
{
    assert((n & 7) == 0); // for this example we only handle arrays which are vector multiples in size
    for (size_t i = 0; i + 8 <= n; i += 8)
    {
        __m512i vin = _mm512_loadu_si512(&in[i]);
        __m512i vlz = _mm512_lzcnt_epi32(vin);
        __m512i vmask = _mm512_srlv_epi32(_mm512_set1_epi32(-1), vlz);
        __m512i vout = _mm512_xor_si512(vin, vmask);
        _mm512_storeu_si512(&out[i], vout);
    }
}

这使用与其他解决方案相同的方法,即计数引导零,创建蒙版,XOR,但对于32位元素,它处理每个循环迭代的8个元素。您可以类似地实现该版本的64位版本,但不幸的是,对于元素尺寸&lt; 32位或&gt; 64位。

您可以在

If your use case is performance-critical you might also want to consider a SIMD implementation for performing the bit flipping operation on a large number of elements. Here's an example using AVX512 for 32 bit elements:

void flip(const uint32_t in[], uint32_t out[], size_t n)
{
    assert((n & 7) == 0); // for this example we only handle arrays which are vector multiples in size
    for (size_t i = 0; i + 8 <= n; i += 8)
    {
        __m512i vin = _mm512_loadu_si512(&in[i]);
        __m512i vlz = _mm512_lzcnt_epi32(vin);
        __m512i vmask = _mm512_srlv_epi32(_mm512_set1_epi32(-1), vlz);
        __m512i vout = _mm512_xor_si512(vin, vmask);
        _mm512_storeu_si512(&out[i], vout);
    }
}

This uses the same approach as other solutions, i.e. count leading zeroes, create mask, XOR, but for 32 bit elements it processes 8 elements per loop iteration. You could implement a 64 bit version of this similarly, but unfortunately there are no similar AVX512 intrinsics for element sizes < 32 bits or > 64 bits.

You can see the above 32 bit example in action on Compiler Explorer (note: you might need to hit the refresh button at the bottom of the assembly pane to get it to re-compile and run if you get "Program returned: 139" in the output pane - this seems to be due to a glitch in Compiler Explorer currently).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文