有效计算锤重
我在Apple M1处理器上。
我要做的是有效地计入Rust的大炭阵列中的1位。我查找了手臂霓虹灯的说明,我想我可以通过CNT指令(每8位块为1个),然后添加8位块中的每个addv。
首先,我认为我会以64位UINT喂食。
fn aarch64_hamming_weight(inp: u64) -> u64 {
let mut outp: u64;
unsafe {
asm!(
"dup v0.2d, {input}",
"cnt v0.16b, v0.16b",
"addv b1, v0.16b",// this adds all the bytes of the vector, how do i get the result?!
//"fmov {output2}, b1",
"fmov {output}, v0.d[1]",
input = in(reg) inp,
output = out(reg) outp,
)
}
return outp;
}
这几乎有效,但是如果您的数字大于8位,那么结果不太正确。 225的二进制计数是4 425的二进制
260
- 计数
- 是
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当我开始编写答案时,我认为这将表明Inline ASM完全没有意义,Rustc将做一个很好的工作矢量化
u8.count_ones()
。不幸的是,事实并非如此,但是只有要编写整个循环,只有一个u64
一次。 (可能仍然是内联ASM,但是编写整个功能可能是合理的。)如果Rust具有ARM/AARCH64的SIMD Intrinsics,那将是一个比Inline ASM更好的选择,并且绝对值得使用。
为了计算整个数组,您不想将每个
cnt
结果分别缩小为标量,尤其是一直到通用通用整数寄存器。尤其不是一次u64
一次而不是完整的128位。添加到仅在最后减少的计数向量。clang/llvm知道如何在C中的一个数组中的C中的C中自动化popcount
__内置
,因此我希望基于LLVM的RUSTC对AARCH64可以做得很好。对于u64
,这是可以的,但是不幸的是,对于u8
而言,这是可怕的。如果您可以安全地 点Au64
跨度在数据(或参考或生锈了?内联ASM。并以未来的编译器版本有望改善的方式。在 https://godbolt.org/z/7qp7cp7ck6bv AARCH64-INKNOWN-LINUX-GNU (带有默认的
-c -opt-level = 3
)。您可能会认为
sum:u32
会有所帮助,需要较少的循环扩大。 (如果您有可能溢出的巨大阵列,请使用外循环)。但是实际上,我认为Rustc仍然扩大到64位,但随后做了更多的工作以将这些计数截断至32位。几乎可以肯定的是错过优化。 (也许是围绕x86psadbw
构建的策略,它在一个步骤中确实将字节列入u64块; llvm自动矢量用pshufb和pshufb和x86上的avx2。)而且,您会认为做同一件事对于
u8
数组应该在相同的代码上进行矢量化,只有一些额外的标量清理,对吗? 应该,但实际上它仍然只能像LLVM那样展开4个,但这是4个u8
元素,内部环成为总垃圾:因此,它是矢量化
(v as u64).count()
,并使用不知道如何优化的罐装配方。 (例如uaddlp
如果cnt
结果毫无意义代码>带有64位的块。)与您从C编译器获得的内容进行比较,从 https://github.com/wojciechmula/sse-popcount/ 。如果ARM霓虹灯代码与同一仓库中的X86代码相同,那么根据编译器的最终ASM输出,您应该瞄准的目标,但是您可以到达那里。
我猜想,只能将每个字节计数扩大到16位的内部循环是好的,直到可能没有
++溢出
,可以在馈入的一对字节中看到All-Onos。 IE 65535/16被四舍五入= 4095向量,然后扩大到64位的块。u16
的迭代数量。 = 16vaddq_u8 累积到8位元素中。但是 .8h U8字节的水平对积累在U16半字元素中,仅适用于32位NEON,仅适用于AARCH64 ASIMD。
关键点是在内部最多的循环中完成最小工作量,理想情况下,每个向量仅使用
cnt> cnt
结果的一个便宜的指令。如果您可以进行一些扩大随着您的积累而免费(或不够便宜地不成为瓶颈),可以使执行在内部循环中停留更长的时间,而不会溢出。 (这意味着外循环中后来的水平减少步骤更便宜。)uadalp
的性能与纯垂直在某些CPU上添加
的性能有所不同,但很可能值得使用。 sum += Pairs 使用uadalp
,在循环中,循环依赖关系链只有1个周期。这是理想的。整数向量上的常规
添加
是3个周期延迟,2/时钟吞吐量,因此它具有更好的吞吐量,但创建了一个3周期环的依赖关系链。 (并且没有完成水平积累工作的步骤,并且会更早溢出,因为它仅使用8位总和。)在Cortex-A57上,
cnt
也只有1/时钟吞吐量,因此,uadalp
的1/时钟吞吐量不是整体瓶颈。除非他们争夺相同的执行端口。uadalp
在F1上运行,Simd-Integeradd
在F0/F1,CNT
上运行在F0/F1上。因此,无论哪种方式,添加操作都需要窃取一些cnt
吞吐量,问题变成cnt
是否可以有效地安排到大部分只是端口F0,而F1有很多未来uadalp
操作排队。 (在尚未准备就绪的数据上;排序外的Exec向前看。在大多数CPU中,操作都计划在端口上,因为它们从前端重命名/分配到后端。CPU不知道 的队列长度不同。哪些订单数据将准备就绪
,但是可以看到端口 ,虽然仍然只使用
uadalp
将累加依据链保持在累加依据中(假设其他AARCH64 CPU也可以提早转发,并且可以使短期独立依赖关系 。在一个迭代中的链条略长(从负载到累积输入),而不是将其保留为两个单独的DEP链。)低端ARM CPU通常具有非常有限的序列执行功能(小调度程序缓冲区)以找到指令 - 层次的平行性和重叠跨循环迭代的工作。保持非环的DEP链短,使CPU更容易。而且,除非您要展开很多,否则对于像Cortex-A53这样的固定AARCH64,它完全很糟糕。
When I started writing an answer, I thought it was going to show that inline asm was totally pointless, and RustC was going to do a good job vectorizing
u8.count_ones()
. That is unfortunately not the case, but you should only use asm if you're going to write a whole loop, not oneu64
at a time. (Possibly still inline asm, but writing a whole function might be reasonable.)If Rust has SIMD intrinsics for ARM/AArch64, that would be a much better choice than inline asm, and definitely worth using.
For counting a whole array, you don't want to reduce every
cnt
result to scalar separately, especially not all the way to a general-purpose integer register. Especially not for a singleu64
at a time instead of a full 128 bits. Add into a vector of counts that you only reduce at the end.clang/LLVM knows how to auto-vectorize popcount
__builtin_popcount()
in C over an array in C, so I was hoping LLVM-based RustC would do ok for AArch64. It is somewhat ok foru64
, but terrible foru8
unfortunately. If you can safely point au64
span at your data (or reference or however Rust does things?), then you can get a good fraction of the benefit portably with no brittle inline asm. And in a way that future compiler versions can hopefully improve on.compiled on https://godbolt.org/z/7qP7cK6bv with
-O --target aarch64-unknown-linux-gnu
(With the default-C --opt-level=3
).You'd think
sum: u32
would help, requiring less widening inside the loop. (If you have huge arrays that could overflow that, use an outer loop). But actually, RustC still widens to 64-bit I think, but then does even more work to truncate those counts to 32-bit. Almost certainly a missed-optimization. (Perhaps a strategy built around x86psadbw
which does sum bytes into u64 chunks in one step; LLVM auto-vectorizes popcount with pshufb and that for AVX2 on x86.)And you'd think doing the same thing for a
u8
array should vectorize to the same code, with just some extra scalar cleanup, right? Well it should, but actually it still only unrolls by 4 like LLVM likes to do, but that's 4u8
elements with the inner loop becoming total garbage:So it's vectorizing
(v as u64).count()
, and using canned recipes that it doesn't know how to optimize into. (e.g. theuaddlp
widening is pointless if thecnt
results are zero except in the low byte of each 64-bit chunk, they could just use verticaladd
with 64-bit chunks.)Compare against what you get from a C compiler for manually vectorized ARM code from https://github.com/WojciechMula/sse-popcount/. If the ARM Neon code is as well-tuned as the x86 code in the same repo, that's what you should be aiming for in terms of final asm output from your compiler, however you get there.
I'd guess that an inner loop that only widens the per-byte counts to 16-bit would be good, running up to the number of iterations that are possible without overflowing a
u16
with+=16
from seeing all-ones in the pair of bytes that feed into it. i.e. 65535/16 rounded down = 4095 vectors before widening out to 64-bit chunks.Mula's version does only 31 inner loop iterations, accumulating into 8-bit elements with
vaddq_u8
. Butuadalp v0.16b, v2.8h
horizontal-pair accumulation of u8 bytes into u16 halfword elements wasn't available for 32-bit NEON, only for AArch64 ASIMD.The key point is to do the minimum amount of work in the inner-most loop, ideally using only one cheap instruction per vector of
cnt
results. If you can get some widening for free as you accumulate (or cheap enough not to be a bottleneck), that lets execution stay in the inner loop for much longer without possible overflow. (And means the later horizontal reduction step in the outer loop is cheaper.)uadalp
has somewhat different performance from pure-verticaladd
on some CPUs, but very likely worth using. The Cortex-A57 optimization guide says it has 4 (1) cycle latency, 1/clock throughput. The (1) part is the accumulate latency for the destination operand; it allows late forwarding from a previous operation of the same type, after the horizontal-pair addition of source elements has already happened. So in a loop usingsum += pairs
withuadalp
, the loop-carried dependency chain is only 1 cycle long. This is ideal.Regular
add
on integer vectors is 3 cycle latency, 2/clock throughput, so it has better throughput but creates a 3-cycle loop-carried dependency chain. (And doesn't get a step of horizontal accumulation work done, and will overflow much sooner because it's only using 8-bit sums.)On Cortex-A57,
cnt
is also only 1/clock throughput, souadalp
's 1/clock throughput isn't an overall bottleneck. Unless they compete for the same execution port.uadalp
runs on F1, SIMD-integeradd
runs on F0/F1,cnt
runs on F0/F1. So either way, addition operations need to steal somecnt
throughput, and the question becomes whethercnt
can get scheduled efficiently to mostly just port F0, when F1 has a bunch of futureuadalp
operations queued up. (On data that isn't ready yet; out-of-order exec looks ahead. On most CPUs, operations are scheduled to ports as they rename/allocate from the front-end into the back-end. The CPU doesn't know what order data will become ready, but it can see the queue lengths different for ports then.It would be possible to do (pseudocode)
That means only one
uadalp
in the loop, in case that's a throughput bottleneck, while still only usinguadalp
into the accumulator which keeps the loop-carried dependency chain through the accumulator short. (Assuming other AArch64 CPUs also do early forwarding for accumulate instructions).And it makes the short independent dependency chain within one iteration slightly longer (from load to input of accumulation), instead of keeping it as two separate dep chains.) Low-end ARM CPUs generally have very limited out-of-order execution capabilities (small scheduler buffers) to find instruction-level parallelism and overlap work across loop iterations. Keeping the non-loop-carried dep chains short makes that easier for the CPU. And it totally sucks for an in-order AArch64 like Cortex-A53 unless you're unrolling a lot.