在不同的ARM64微构造上优化FMA序列
为了优化大量使用的内部循环(Winograd域中的3x3xn张量卷积),我通过使用最大霓虹灯寄存器(32)(32)并尝试将其读取为少的系数/数据,而与算术操作的数量相比,我取得了一些良好的结果。
正如预期的那样,较大的内核在MacBook M1,iPhone(SE 2020,iPhone 8+)和Exynos 9820(Exynos M4,Cortex-A75 Micro Architector)上的第一种方法优于大约15-25%的方法。但是,令我惊讶的是,更大的内核在exynos 9611(Cortex-A73/Cortex-A53)速度慢100%。
我的第一个内核确实将卷积分为4种此类循环,每个循环处理两个输出,形成这样的卷积(并在两者之间重新组合累加器)。
3c0b50:
ldr q0, [x6] // loads 4 coefficients
ldp q25, q27, [x2] // loads 8 data
ldr q26, [x6, x16] // 4 more coefficients
add x6, x6, #16
subs w19, w19, #1
fmla v23.4s, v25.4s, v0.s[0]
fmla v19.4s, v25.4s, v26.s[0]
fmla v17.4s, v27.4s, v0.s[1]
fmla v18.4s, v27.4s, v26.s[1]
ldp q25, q27, [x2, #32] // 8 more coefficients
add x2, x2, #64
fmla v22.4s, v25.4s, v0.s[2]
fmla v20.4s, v25.4s, v26.s[2]
fmla v24.4s, v27.4s, v0.s[3]
fmla v21.4s, v27.4s, v26.s[3]
b.ne 0x3c0b50
在此变体中,我们有8个累加器,2个用于数据的寄存器和4个用于系数的寄存器,4个用于开销的说明,8个指令的算术说明和4个用于内存访问的说明。循环计数通常为8..64。
第二个变体有24个蓄能器,24个指令算术,从内存中加载7个说明和2个开销指令。
0x3c4110:
ldp q0, q1, [x4], #32
ldp q4, q5, [x5], #32
ldp q6, q7, [x5], #32
fmla v8.4s, v4.4s, v0.s[0]
fmla v9.4s, v4.4s, v0.s[1]
fmla v10.4s, v4.4s, v0.s[2]
ldp q2, q3, [x4], #32
fmla v11.4s, v5.4s, v0.s[3]
fmla v12.4s, v5.4s, v1.s[0]
fmla v13.4s, v5.4s, v1.s[1]
ldp q4, q5, [x5], #32 // reload q4,q5 just after they are consumed
fmla v14.4s, v6.4s, v1.s[2]
fmla v15.4s, v6.4s, v1.s[3]
fmla v16.4s, v6.4s, v2.s[0]
ldp q0, q1, [x4], #32 // reload q0,q1 just after they are consumed
fmla v17.4s, v7.4s, v2.s[1]
fmla v18.4s, v7.4s, v2.s[2]
fmla v19.4s, v7.4s, v2.s[3]
ldp q6, q7, [x5], #32 // reload q6,q7 just after they are consumed
add x3, x3, #1
fmla v20.4s, v4.4s, v3.s[0]
fmla v21.4s, v4.4s, v3.s[1]
fmla v22.4s, v4.4s, v3.s[2]
fmla v23.4s, v5.4s, v3.s[3]
fmla v24.4s, v5.4s, v0.s[0]
fmla v25.4s, v5.4s, v0.s[1]
fmla v26.4s, v6.4s, v0.s[2]
fmla v27.4s, v6.4s, v0.s[3]
fmla v28.4s, v6.4s, v1.s[0]
fmla v29.4s, v7.4s, v1.s[1]
fmla v30.4s, v7.4s, v1.s[2]
fmla v31.4s, v7.4s, v1.s[3]
tbz w3, #11, 0x3c4110
除了这些内部循环外,未公开的代码还初始化了累加器,并执行行和列的Winograd输出转换(溢出到内存)。我不想揭露所有这些代码,我希望这与表演无关;取而代之的是,我询问较大的内核是否很容易发现问题,从而使其在Cortex-A73处理器上的性能效率更高。
编辑
我可以从循环中发现的是,没有标签与缓存线保持一致。较小的循环是BTW,正好是16个说明,64个字节(或一个缓存线)。另一个循环是33个说明,有可能从本地临时数据寄存器tbz x5,#??,0x3C4110
从本地临时数据寄存器推断分支条件。这将使指令计数达到32,删除添加X3,X3,#1
。然后,将循环启动到缓存线边界也很有意义。
更新
通过在评论中应用建议,即使用LDP Q0,Q1,[X0],128,可以找到一些细微的改进; LDP Q2,Q3,[X0,#-112]
。 (在非常低端设备上,执行时间从194ms减少到190ms)。到目前为止,这表明问题本身不一定在内部循环中。两种方法之间的内存访问差异很小(算术操作的数量相同,系数读取的数量相同,但较大的内核共享数据略有更多)。缓存层次结构可能在所有A53或A73架构中都扮演技巧。
另一个未公开的因素是我们当然是多线程,而Big.Litter架构可能会在内核执行更快时矛盾的速度降低 - 至少如果输出同步到帧速率,则可能会放慢速度。在这种情况下,操作系统可以违反直觉检测到快速核心完成所有任务将操作切换到低功率核心的任务后,它花费了所有分配的时间。无论如何,这是一个问题(我们认为)是一个问题 - 请参见 。
In order to optimize a heavily used inner loop (3x3xN tensor convolution in winograd domain), I had some good results by using the maximum amount of neon registers (32) and trying to read as little coefficients/data compared to the number of arithmetic operations.
As expected, the larger kernel outperformed the first approach by some 15-25% on MacBook M1, on iPhones (SE 2020, iPhone 8+) and on Exynos 9820 (Exynos M4, Cortex-A75 micro-architecture). However, to my great surprise the larger kernel was up to 100% slower on Exynos 9611 (Cortex-A73/Cortex-A53).
My first kernels did split the convolution in 4 these kind of loops, each processing two outputs, formed like this (and recombining the accumulators in between).
3c0b50:
ldr q0, [x6] // loads 4 coefficients
ldp q25, q27, [x2] // loads 8 data
ldr q26, [x6, x16] // 4 more coefficients
add x6, x6, #16
subs w19, w19, #1
fmla v23.4s, v25.4s, v0.s[0]
fmla v19.4s, v25.4s, v26.s[0]
fmla v17.4s, v27.4s, v0.s[1]
fmla v18.4s, v27.4s, v26.s[1]
ldp q25, q27, [x2, #32] // 8 more coefficients
add x2, x2, #64
fmla v22.4s, v25.4s, v0.s[2]
fmla v20.4s, v25.4s, v26.s[2]
fmla v24.4s, v27.4s, v0.s[3]
fmla v21.4s, v27.4s, v26.s[3]
b.ne 0x3c0b50
In this variant we have 8 accumulators, 2 registers for data and 4 registers for coefficients, 4 instructions for overhead, 8 instructions for arithmetic and 4 instructions for memory access. The loop count is typically in the order of 8..64.
The second variant has 24 accumulators, 24 instructions for arithmetic, 7 instructions loading from memory and 2 instructions for overhead.
0x3c4110:
ldp q0, q1, [x4], #32
ldp q4, q5, [x5], #32
ldp q6, q7, [x5], #32
fmla v8.4s, v4.4s, v0.s[0]
fmla v9.4s, v4.4s, v0.s[1]
fmla v10.4s, v4.4s, v0.s[2]
ldp q2, q3, [x4], #32
fmla v11.4s, v5.4s, v0.s[3]
fmla v12.4s, v5.4s, v1.s[0]
fmla v13.4s, v5.4s, v1.s[1]
ldp q4, q5, [x5], #32 // reload q4,q5 just after they are consumed
fmla v14.4s, v6.4s, v1.s[2]
fmla v15.4s, v6.4s, v1.s[3]
fmla v16.4s, v6.4s, v2.s[0]
ldp q0, q1, [x4], #32 // reload q0,q1 just after they are consumed
fmla v17.4s, v7.4s, v2.s[1]
fmla v18.4s, v7.4s, v2.s[2]
fmla v19.4s, v7.4s, v2.s[3]
ldp q6, q7, [x5], #32 // reload q6,q7 just after they are consumed
add x3, x3, #1
fmla v20.4s, v4.4s, v3.s[0]
fmla v21.4s, v4.4s, v3.s[1]
fmla v22.4s, v4.4s, v3.s[2]
fmla v23.4s, v5.4s, v3.s[3]
fmla v24.4s, v5.4s, v0.s[0]
fmla v25.4s, v5.4s, v0.s[1]
fmla v26.4s, v6.4s, v0.s[2]
fmla v27.4s, v6.4s, v0.s[3]
fmla v28.4s, v6.4s, v1.s[0]
fmla v29.4s, v7.4s, v1.s[1]
fmla v30.4s, v7.4s, v1.s[2]
fmla v31.4s, v7.4s, v1.s[3]
tbz w3, #11, 0x3c4110
In addition to these inner loops, the undisclosed code initializes the accumulators and performs row and column-wise winograd output transformation (spilling to memory). I do not want to expose all that code, which I hope to be irrelevant to the performance; instead I'm asking if there's something easily spotted problem with the larger kernel making it perform much more inefficiently on the Cortex-A73 processors.
EDIT
What I can spot from the loops is that none of labels are aligned to a cache line. The smaller loop is btw exactly 16 instructions, 64 bytes (or a cache line). The other loop is 33 instructions, with a remote possibility to infer the branch condition from the local temporary data register tbz x5, #??, 0x3c4110
. This would bring the instruction count to 32, removing add x3, x3, #1
. Then it would make sense also to align the loop start to a cache line boundary.
Update
There are some slight improvements found by applying the suggestions in the comments, i.e. reading with ldp q0,q1,[x0], 128; ldp q2,q3,[x0, #-112]
. (Execution time reduced from 194ms to 190ms on a very low end device). So far this suggest the problem is not necessarily in the inner loops per se; the memory accesses differ very slightly between the two approaches (the number of arithmetic operations is the same, the number of coefficients read is the same, but the larger kernel shares the data slightly more). It's possible that the cache hierarchy plays tricks in all the A53 or A73 architectures alike.
Other undisclosed factor is that we are multithreading of course, and the BIG.little architecture can paradoxically slow down when the kernel executes faster -- at least if the output is synchronised to frame rate. In that case the OS can counterintuitively detect that a fast core is too idle after finishing all the tasks switching the operation to low power core, where it spends all the allocated time. This is anyway an issue (we thought) to have been resolved earlier -- see https://stackoverflow.com/a/64243494/1716339.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论