当前位置：文江博客话题详情

有效计算锤重

发布于 2025-01-24 09:07:50 字数 740 浏览 0 评论 0 原文

我在Apple M1处理器上。

我要做的是有效地计入Rust的大炭阵列中的1位。我查找了手臂霓虹灯的说明，我想我可以通过CNT指令（每8位块为1个），然后添加8位块中的每个addv。

首先，我认为我会以64位UINT喂食。

fn aarch64_hamming_weight(inp: u64) -> u64 {
    let mut outp: u64;
    unsafe {
        asm!(
            "dup v0.2d, {input}",
            "cnt v0.16b, v0.16b",
            "addv b1, v0.16b",// this adds all the bytes of the vector, how do i get the result?!
            //"fmov {output2}, b1",
            "fmov {output}, v0.d[1]",
            input = in(reg) inp,
            output = out(reg) outp,
        )
    }

    return outp;
}

这几乎有效，但是如果您的数字大于8位，那么结果不太正确。 225的二进制计数是4 425的二进制

260

计数
是

原文

I'm on an apple m1 processor.

What i'm trying to do is efficiently count the 1 bits in a large char array in rust. I looked up the arm neon instructions, and I think I can do it via a cnt instruction (which counts 1's per 8 bit block) and then a addv to add each of the 8 bit blocks.

to start with i figured i would feed in a 64 bit uint.

fn aarch64_hamming_weight(inp: u64) -> u64 {
    let mut outp: u64;
    unsafe {
        asm!(
            "dup v0.2d, {input}",
            "cnt v0.16b, v0.16b",
            "addv b1, v0.16b",// this adds all the bytes of the vector, how do i get the result?!
            //"fmov {output2}, b1",
            "fmov {output}, v0.d[1]",
            input = in(reg) inp,
            output = out(reg) outp,
        )
    }

    return outp;
}

this nearly works, but if your number is greater than 8 bits then the result isn't quite right;
225's binary count is 4
425's binary count is 260

I need a way to get the addv's output
I need a way for it to work with an arbitrary char array (which would need a loop, pulling 128 bits at a time)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

掩饰不了的爱 2025-01-31 09:07:50

当我开始编写答案时，我认为这将表明Inline ASM完全没有意义，Rustc将做一个很好的工作矢量化 u8.count_ones（）。不幸的是，事实并非如此，但是只有要编写整个循环，只有一个 u64 一次。（可能仍然是内联ASM，但是编写整个功能可能是合理的。）

如果Rust具有ARM/AARCH64的SIMD Intrinsics，那将是一个比Inline ASM更好的选择，并且绝对值得使用。

为了计算整个数组，您不想将每个 cnt 结果分别缩小为标量，尤其是一直到通用通用整数寄存器。尤其不是一次 u64 一次而不是完整的128位。添加到仅在最后减少的计数向量。

clang/llvm知道如何在C中的一个数组中的C中的C中自动化popcount __内置 ，因此我希望基于LLVM的RUSTC对AARCH64可以做得很好。对于 u64 ，这是可以的，但是不幸的是，对于 u8 而言，这是可怕的。如果您可以安全地点A u64 跨度在数据（或参考或生锈了？内联ASM。并以未来的编译器版本有望改善的方式。

pub fn hamming_weight_arr(inp: &[u64]) -> u64 {
    let mut sum : u64 = 0;
    for v in inp {
        sum += v.count_ones() as u64;  // maybe not terrible but a lot of hsumming.
        // doing ldp q2, q3 to load 32 bytes per iteration
    }
    return sum;
}

在 https://godbolt.org/z/7qp7cp7ck6bv AARCH64-INKNOWN-LINUX-GNU （带有默认的 -c -opt-level = 3 ）。

... some setup
.LBB1_5:                                 // inner loop
        ldp     q2, q3, [x8, #-16]       // load pair of 16-byte vectors = unroll by 4x u64
        add     x8, x8, #32              // pointer increment by 32 bytes
        subs    x12, x12, #4
        cnt     v2.16b, v2.16b
        cnt     v3.16b, v3.16b
        uaddlp  v2.8h, v2.16b            // hsum byte pairs to u16 halfwords
        uaddlp  v3.8h, v3.16b
        uaddlp  v2.4s, v2.8h             // hsum u16 pairs to u32 words
        uaddlp  v3.4s, v3.8h
        uadalp  v0.2d, v2.4s          // sum u32 pairs into accumulator u64 vectors (doublewords)
        uadalp  v1.2d, v3.4s
        b.ne    .LBB1_5
        add     v0.2d, v1.2d, v0.2d        // combine pair of accumulators
        cmp     x10, x11
        addp    d0, v0.2d                  // and reduce to one 64-bit
        fmov    x8, d0
        b.eq    .LBB1_9
.LBB1_7:
        add     x10, x0, x1, lsl #3
.LBB1_8:
        ldr     d0, [x9], #8               // scalar cleanup loop, one u64 at a time
        cmp     x9, x10
        cnt     v0.8b, v0.8b
        uaddlv  h0, v0.8b                  // slow instruction, as Jake mentioned.  Or at least high latency
        fmov    w11, s0
        add     x8, x11, x8
        b.ne    .LBB1_8
.LBB1_9:
        mov     x0, x8
        ret

您可能会认为 sum：u32 会有所帮助，需要较少的循环扩大。（如果您有可能溢出的巨大阵列，请使用外循环）。但是实际上，我认为Rustc仍然扩大到64位，但随后做了更多的工作以将这些计数截断至32位。几乎可以肯定的是错过优化。（也许是围绕x86 psadbw 构建的策略，它在一个步骤中确实将字节列入u64块； llvm自动矢量用pshufb和pshufb和x86上的avx2。）

而且，您会认为做同一件事对于 u8 数组应该在相同的代码上进行矢量化，只有一些额外的标量清理，对吗？ 应该，但实际上它仍然只能像LLVM那样展开4个，但这是4个 u8 元素，内部环成为总垃圾：

// &[u8] version  inner loop is a disaster

LBB2_5:
        ldurb   w12, [x8, #-2]            // scalar zero-extending byte load
        subs    x11, x11, #4
        ldrb    w13, [x8]                 // scalar sign-extending byte load
        fmov    d2, x12                   // copy it into vector reg
        ldurb   w12, [x8, #-1]
        fmov    d3, x13
        ldrb    w13, [x8, #1]
        add     x8, x8, #4
        mov     v2.d[1], x12              // get two more bytes into the high 64 bits of a vector
        mov     v3.d[1], x13
        cnt     v2.16b, v2.16b           // same cnt / uaddlp sequence as before
        cnt     v3.16b, v3.16b
        uaddlp  v2.8h, v2.16b
        uaddlp  v3.8h, v3.16b
        uaddlp  v2.4s, v2.8h
        uaddlp  v3.4s, v3.8h
        uadalp  v0.2d, v2.4s
        uadalp  v1.2d, v3.4s
        b.ne    .LBB2_5

因此，它是矢量化（v as u64）.count（），并使用不知道如何优化的罐装配方。（例如 uaddlp 如果 cnt 结果毫无意义代码>带有64位的块。）

与您从C编译器获得的内容进行比较，从 https://github.com/wojciechmula/sse-popcount/ 。如果ARM霓虹灯代码与同一仓库中的X86代码相同，那么根据编译器的最终ASM输出，您应该瞄准的目标，但是您可以到达那里。

我猜想，只能将每个字节计数扩大到16位的内部循环是好的，直到可能没有++溢出 u16 的迭代数量。 = 16 ，可以在馈入的一对字节中看到All-Onos。 IE 65535/16被四舍五入= 4095向量，然后扩大到64位的块。

vaddq_u8 累积到8位元素中。但是 .8h U8字节的水平对积累在U16半字元素中，仅适用于32位NEON，仅适用于AARCH64 ASIMD。

关键点是在内部最多的循环中完成最小工作量，理想情况下，每个向量仅使用 cnt> cnt 结果的一个便宜的指令。如果您可以进行一些扩大随着您的积累而免费（或不够便宜地不成为瓶颈），可以使执行在内部循环中停留更长的时间，而不会溢出。（这意味着外循环中后来的水平减少步骤更便宜。）

uadalp 的性能与纯垂直在某些CPU上添加的性能有所不同，但很可能值得使用。 sum += Pairs 使用 uadalp ，在循环中，循环依赖关系链只有1个周期。这是理想的。

整数向量上的常规添加是3个周期延迟，2/时钟吞吐量，因此它具有更好的吞吐量，但创建了一个3周期环的依赖关系链。（并且没有完成水平积累工作的步骤，并且会更早溢出，因为它仅使用8位总和。）

在Cortex-A57上， cnt 也只有1/时钟吞吐量，因此， uadalp 的1/时钟吞吐量不是整体瓶颈。除非他们争夺相同的执行端口。 uadalp 在F1上运行，Simd-Integer add 在F0/F1， CNT 上运行在F0/F1上。因此，无论哪种方式，添加操作都需要窃取一些 cnt 吞吐量，问题变成 cnt 是否可以有效地安排到大部分只是端口F0，而F1有很多未来 uadalp 操作排队。（在尚未准备就绪的数据上；排序外的Exec向前看。在大多数CPU中，操作都计划在端口上，因为它们从前端重命名/分配到后端。CPU不知道的队列长度不同。

哪些订单数据将准备就绪

// Probably not a good idea
c1 = cnt(load1)
c2 = cnt(load2)
c1 += c2    // plain SIMD vertical add
uadalp accumulator, c1

，但是可以看到端口，虽然仍然只使用 uadalp 将累加依据链保持在累加依据中（假设其他AARCH64 CPU也可以提早转发，

并且可以使短期独立依赖关系。在一个迭代中的链条略长（从负载到累积输入），而不是将其保留为两个单独的DEP链。）低端ARM CPU通常具有非常有限的序列执行功能（小调度程序缓冲区）以找到指令 - 层次的平行性和重叠跨循环迭代的工作。保持非环的DEP链短，使CPU更容易。而且，除非您要展开很多，否则对于像Cortex-A53这样的固定AARCH64，它完全很糟糕。

When I started writing an answer, I thought it was going to show that inline asm was totally pointless, and RustC was going to do a good job vectorizing u8.count_ones(). That is unfortunately not the case, but you should only use asm if you're going to write a whole loop, not one u64 at a time. (Possibly still inline asm, but writing a whole function might be reasonable.)

If Rust has SIMD intrinsics for ARM/AArch64, that would be a much better choice than inline asm, and definitely worth using.

For counting a whole array, you don't want to reduce every cnt result to scalar separately, especially not all the way to a general-purpose integer register. Especially not for a single u64 at a time instead of a full 128 bits. Add into a vector of counts that you only reduce at the end.

clang/LLVM knows how to auto-vectorize popcount __builtin_popcount() in C over an array in C, so I was hoping LLVM-based RustC would do ok for AArch64. It is somewhat ok for u64, but terrible for u8 unfortunately. If you can safely point a u64 span at your data (or reference or however Rust does things?), then you can get a good fraction of the benefit portably with no brittle inline asm. And in a way that future compiler versions can hopefully improve on.

pub fn hamming_weight_arr(inp: &[u64]) -> u64 {
    let mut sum : u64 = 0;
    for v in inp {
        sum += v.count_ones() as u64;  // maybe not terrible but a lot of hsumming.
        // doing ldp q2, q3 to load 32 bytes per iteration
    }
    return sum;
}

compiled on https://godbolt.org/z/7qP7cK6bv with -O --target aarch64-unknown-linux-gnu (With the default -C --opt-level=3).

... some setup
.LBB1_5:                                 // inner loop
        ldp     q2, q3, [x8, #-16]       // load pair of 16-byte vectors = unroll by 4x u64
        add     x8, x8, #32              // pointer increment by 32 bytes
        subs    x12, x12, #4
        cnt     v2.16b, v2.16b
        cnt     v3.16b, v3.16b
        uaddlp  v2.8h, v2.16b            // hsum byte pairs to u16 halfwords
        uaddlp  v3.8h, v3.16b
        uaddlp  v2.4s, v2.8h             // hsum u16 pairs to u32 words
        uaddlp  v3.4s, v3.8h
        uadalp  v0.2d, v2.4s          // sum u32 pairs into accumulator u64 vectors (doublewords)
        uadalp  v1.2d, v3.4s
        b.ne    .LBB1_5
        add     v0.2d, v1.2d, v0.2d        // combine pair of accumulators
        cmp     x10, x11
        addp    d0, v0.2d                  // and reduce to one 64-bit
        fmov    x8, d0
        b.eq    .LBB1_9
.LBB1_7:
        add     x10, x0, x1, lsl #3
.LBB1_8:
        ldr     d0, [x9], #8               // scalar cleanup loop, one u64 at a time
        cmp     x9, x10
        cnt     v0.8b, v0.8b
        uaddlv  h0, v0.8b                  // slow instruction, as Jake mentioned.  Or at least high latency
        fmov    w11, s0
        add     x8, x11, x8
        b.ne    .LBB1_8
.LBB1_9:
        mov     x0, x8
        ret

You'd think sum: u32 would help, requiring less widening inside the loop. (If you have huge arrays that could overflow that, use an outer loop). But actually, RustC still widens to 64-bit I think, but then does even more work to truncate those counts to 32-bit. Almost certainly a missed-optimization. (Perhaps a strategy built around x86 psadbw which does sum bytes into u64 chunks in one step; LLVM auto-vectorizes popcount with pshufb and that for AVX2 on x86.)

And you'd think doing the same thing for a u8 array should vectorize to the same code, with just some extra scalar cleanup, right? Well it should, but actually it still only unrolls by 4 like LLVM likes to do, but that's 4 u8 elements with the inner loop becoming total garbage:

// &[u8] version  inner loop is a disaster

LBB2_5:
        ldurb   w12, [x8, #-2]            // scalar zero-extending byte load
        subs    x11, x11, #4
        ldrb    w13, [x8]                 // scalar sign-extending byte load
        fmov    d2, x12                   // copy it into vector reg
        ldurb   w12, [x8, #-1]
        fmov    d3, x13
        ldrb    w13, [x8, #1]
        add     x8, x8, #4
        mov     v2.d[1], x12              // get two more bytes into the high 64 bits of a vector
        mov     v3.d[1], x13
        cnt     v2.16b, v2.16b           // same cnt / uaddlp sequence as before
        cnt     v3.16b, v3.16b
        uaddlp  v2.8h, v2.16b
        uaddlp  v3.8h, v3.16b
        uaddlp  v2.4s, v2.8h
        uaddlp  v3.4s, v3.8h
        uadalp  v0.2d, v2.4s
        uadalp  v1.2d, v3.4s
        b.ne    .LBB2_5

So it's vectorizing (v as u64).count(), and using canned recipes that it doesn't know how to optimize into. (e.g. the uaddlp widening is pointless if the cnt results are zero except in the low byte of each 64-bit chunk, they could just use vertical add with 64-bit chunks.)

Compare against what you get from a C compiler for manually vectorized ARM code from https://github.com/WojciechMula/sse-popcount/. If the ARM Neon code is as well-tuned as the x86 code in the same repo, that's what you should be aiming for in terms of final asm output from your compiler, however you get there.

I'd guess that an inner loop that only widens the per-byte counts to 16-bit would be good, running up to the number of iterations that are possible without overflowing a u16 with +=16 from seeing all-ones in the pair of bytes that feed into it. i.e. 65535/16 rounded down = 4095 vectors before widening out to 64-bit chunks.

Mula's version does only 31 inner loop iterations, accumulating into 8-bit elements with vaddq_u8. But uadalp v0.16b, v2.8h horizontal-pair accumulation of u8 bytes into u16 halfword elements wasn't available for 32-bit NEON, only for AArch64 ASIMD.

The key point is to do the minimum amount of work in the inner-most loop, ideally using only one cheap instruction per vector of cnt results. If you can get some widening for free as you accumulate (or cheap enough not to be a bottleneck), that lets execution stay in the inner loop for much longer without possible overflow. (And means the later horizontal reduction step in the outer loop is cheaper.)

uadalp has somewhat different performance from pure-vertical add on some CPUs, but very likely worth using. The Cortex-A57 optimization guide says it has 4 (1) cycle latency, 1/clock throughput. The (1) part is the accumulate latency for the destination operand; it allows late forwarding from a previous operation of the same type, after the horizontal-pair addition of source elements has already happened. So in a loop using sum += pairs with uadalp, the loop-carried dependency chain is only 1 cycle long. This is ideal.

Regular add on integer vectors is 3 cycle latency, 2/clock throughput, so it has better throughput but creates a 3-cycle loop-carried dependency chain. (And doesn't get a step of horizontal accumulation work done, and will overflow much sooner because it's only using 8-bit sums.)

On Cortex-A57, cnt is also only 1/clock throughput, so uadalp's 1/clock throughput isn't an overall bottleneck. Unless they compete for the same execution port. uadalp runs on F1, SIMD-integer add runs on F0/F1, cnt runs on F0/F1. So either way, addition operations need to steal some cnt throughput, and the question becomes whether cnt can get scheduled efficiently to mostly just port F0, when F1 has a bunch of future uadalp operations queued up. (On data that isn't ready yet; out-of-order exec looks ahead. On most CPUs, operations are scheduled to ports as they rename/allocate from the front-end into the back-end. The CPU doesn't know what order data will become ready, but it can see the queue lengths different for ports then.

It would be possible to do (pseudocode)

// Probably not a good idea
c1 = cnt(load1)
c2 = cnt(load2)
c1 += c2    // plain SIMD vertical add
uadalp accumulator, c1

That means only one uadalp in the loop, in case that's a throughput bottleneck, while still only using uadalp into the accumulator which keeps the loop-carried dependency chain through the accumulator short. (Assuming other AArch64 CPUs also do early forwarding for accumulate instructions).

And it makes the short independent dependency chain within one iteration slightly longer (from load to input of accumulation), instead of keeping it as two separate dep chains.) Low-end ARM CPUs generally have very limited out-of-order execution capabilities (small scheduler buffers) to find instruction-level parallelism and overlap work across loop iterations. Keeping the non-loop-carried dep chains short makes that easier for the CPU. And it totally sucks for an in-order AArch64 like Cortex-A53 unless you're unrolling a lot.

回复收藏 0 原文

~没有更多了~