在不同的ARM64微构造上优化FMA序列

发布于 2025-01-26 19:35:55 字数 3348 浏览 1 评论 0原文

为了优化大量使用的内部循环（Winograd域中的3x3xn张量卷积），我通过使用最大霓虹灯寄存器（32）（32）并尝试将其读取为少的系数/数据，而与算术操作的数量相比，我取得了一些良好的结果。

正如预期的那样，较大的内核在MacBook M1，iPhone（SE 2020，iPhone 8+）和Exynos 9820（Exynos M4，Cortex-A75 Micro Architector）上的第一种方法优于大约15-25％的方法。但是，令我惊讶的是，更大的内核在exynos 9611（Cortex-A73/Cortex-A53）速度慢100％。

我的第一个内核确实将卷积分为4种此类循环，每个循环处理两个输出，形成这样的卷积（并在两者之间重新组合累加器）。

 3c0b50:
     ldr     q0, [x6]                  // loads 4 coefficients
     ldp     q25, q27, [x2]            // loads 8 data
     ldr     q26, [x6, x16]            // 4 more coefficients
     add     x6, x6, #16
     subs    w19, w19, #1
     fmla    v23.4s, v25.4s, v0.s[0]
     fmla    v19.4s, v25.4s, v26.s[0]
     fmla    v17.4s, v27.4s, v0.s[1]
     fmla    v18.4s, v27.4s, v26.s[1]
     ldp     q25, q27, [x2, #32]       // 8 more coefficients
     add     x2, x2, #64
     fmla    v22.4s, v25.4s, v0.s[2]
     fmla    v20.4s, v25.4s, v26.s[2]
     fmla    v24.4s, v27.4s, v0.s[3]
     fmla    v21.4s, v27.4s, v26.s[3]
     b.ne    0x3c0b50

在此变体中，我们有8个累加器，2个用于数据的寄存器和4个用于系数的寄存器，4个用于开销的说明，8个指令的算术说明和4个用于内存访问的说明。循环计数通常为8..64。

第二个变体有24个蓄能器，24个指令算术，从内存中加载7个说明和2个开销指令。

 0x3c4110:
     ldp     q0, q1, [x4], #32
     ldp     q4, q5, [x5], #32
     ldp     q6, q7, [x5], #32
     fmla    v8.4s, v4.4s, v0.s[0]
     fmla    v9.4s, v4.4s, v0.s[1]
     fmla    v10.4s, v4.4s, v0.s[2]
     ldp     q2, q3, [x4], #32
     fmla    v11.4s, v5.4s, v0.s[3]
     fmla    v12.4s, v5.4s, v1.s[0]
     fmla    v13.4s, v5.4s, v1.s[1]
     ldp     q4, q5, [x5], #32      // reload q4,q5 just after they are consumed
     fmla    v14.4s, v6.4s, v1.s[2]
     fmla    v15.4s, v6.4s, v1.s[3]
     fmla    v16.4s, v6.4s, v2.s[0]
     ldp     q0, q1, [x4], #32      // reload q0,q1 just after they are consumed
     fmla    v17.4s, v7.4s, v2.s[1]
     fmla    v18.4s, v7.4s, v2.s[2]
     fmla    v19.4s, v7.4s, v2.s[3]
     ldp     q6, q7, [x5], #32      // reload q6,q7 just after they are consumed
     add     x3, x3, #1
     fmla    v20.4s, v4.4s, v3.s[0]
     fmla    v21.4s, v4.4s, v3.s[1]
     fmla    v22.4s, v4.4s, v3.s[2]
     fmla    v23.4s, v5.4s, v3.s[3]
     fmla    v24.4s, v5.4s, v0.s[0]
     fmla    v25.4s, v5.4s, v0.s[1]
     fmla    v26.4s, v6.4s, v0.s[2]
     fmla    v27.4s, v6.4s, v0.s[3]
     fmla    v28.4s, v6.4s, v1.s[0]
     fmla    v29.4s, v7.4s, v1.s[1]
     fmla    v30.4s, v7.4s, v1.s[2]
     fmla    v31.4s, v7.4s, v1.s[3]
     tbz     w3, #11, 0x3c4110

除了这些内部循环外，未公开的代码还初始化了累加器，并执行行和列的Winograd输出转换（溢出到内存）。我不想揭露所有这些代码，我希望这与表演无关；取而代之的是，我询问较大的内核是否很容易发现问题，从而使其在Cortex-A73处理器上的性能效率更高。

编辑

我可以从循环中发现的是，没有标签与缓存线保持一致。较小的循环是BTW，正好是16个说明，64个字节（或一个缓存线）。另一个循环是33个说明，有可能从本地临时数据寄存器tbz x5，＃??，0x3C4110从本地临时数据寄存器推断分支条件。这将使指令计数达到32，删除添加X3，X3，＃1。然后，将循环启动到缓存线边界也很有意义。

更新

通过在评论中应用建议，即使用LDP Q0，Q1，[X0]，128，可以找到一些细微的改进； LDP Q2，Q3，[X0，＃-112]。（在非常低端设备上，执行时间从194ms减少到190ms）。到目前为止，这表明问题本身不一定在内部循环中。两种方法之间的内存访问差异很小（算术操作的数量相同，系数读取的数量相同，但较大的内核共享数据略有更多）。缓存层次结构可能在所有A53或A73架构中都扮演技巧。

另一个未公开的因素是我们当然是多线程，而Big.Litter架构可能会在内核执行更快时矛盾的速度降低 - 至少如果输出同步到帧速率，则可能会放慢速度。在这种情况下，操作系统可以违反直觉检测到快速核心完成所有任务将操作切换到低功率核心的任务后，它花费了所有分配的时间。无论如何，这是一个问题（我们认为）是一个问题 - 请参见。

原文

In order to optimize a heavily used inner loop (3x3xN tensor convolution in winograd domain), I had some good results by using the maximum amount of neon registers (32) and trying to read as little coefficients/data compared to the number of arithmetic operations.

As expected, the larger kernel outperformed the first approach by some 15-25% on MacBook M1, on iPhones (SE 2020, iPhone 8+) and on Exynos 9820 (Exynos M4, Cortex-A75 micro-architecture). However, to my great surprise the larger kernel was up to 100% slower on Exynos 9611 (Cortex-A73/Cortex-A53).

My first kernels did split the convolution in 4 these kind of loops, each processing two outputs, formed like this (and recombining the accumulators in between).

 3c0b50:
     ldr     q0, [x6]                  // loads 4 coefficients
     ldp     q25, q27, [x2]            // loads 8 data
     ldr     q26, [x6, x16]            // 4 more coefficients
     add     x6, x6, #16
     subs    w19, w19, #1
     fmla    v23.4s, v25.4s, v0.s[0]
     fmla    v19.4s, v25.4s, v26.s[0]
     fmla    v17.4s, v27.4s, v0.s[1]
     fmla    v18.4s, v27.4s, v26.s[1]
     ldp     q25, q27, [x2, #32]       // 8 more coefficients
     add     x2, x2, #64
     fmla    v22.4s, v25.4s, v0.s[2]
     fmla    v20.4s, v25.4s, v26.s[2]
     fmla    v24.4s, v27.4s, v0.s[3]
     fmla    v21.4s, v27.4s, v26.s[3]
     b.ne    0x3c0b50

In this variant we have 8 accumulators, 2 registers for data and 4 registers for coefficients, 4 instructions for overhead, 8 instructions for arithmetic and 4 instructions for memory access. The loop count is typically in the order of 8..64.

The second variant has 24 accumulators, 24 instructions for arithmetic, 7 instructions loading from memory and 2 instructions for overhead.

 0x3c4110:
     ldp     q0, q1, [x4], #32
     ldp     q4, q5, [x5], #32
     ldp     q6, q7, [x5], #32
     fmla    v8.4s, v4.4s, v0.s[0]
     fmla    v9.4s, v4.4s, v0.s[1]
     fmla    v10.4s, v4.4s, v0.s[2]
     ldp     q2, q3, [x4], #32
     fmla    v11.4s, v5.4s, v0.s[3]
     fmla    v12.4s, v5.4s, v1.s[0]
     fmla    v13.4s, v5.4s, v1.s[1]
     ldp     q4, q5, [x5], #32      // reload q4,q5 just after they are consumed
     fmla    v14.4s, v6.4s, v1.s[2]
     fmla    v15.4s, v6.4s, v1.s[3]
     fmla    v16.4s, v6.4s, v2.s[0]
     ldp     q0, q1, [x4], #32      // reload q0,q1 just after they are consumed
     fmla    v17.4s, v7.4s, v2.s[1]
     fmla    v18.4s, v7.4s, v2.s[2]
     fmla    v19.4s, v7.4s, v2.s[3]
     ldp     q6, q7, [x5], #32      // reload q6,q7 just after they are consumed
     add     x3, x3, #1
     fmla    v20.4s, v4.4s, v3.s[0]
     fmla    v21.4s, v4.4s, v3.s[1]
     fmla    v22.4s, v4.4s, v3.s[2]
     fmla    v23.4s, v5.4s, v3.s[3]
     fmla    v24.4s, v5.4s, v0.s[0]
     fmla    v25.4s, v5.4s, v0.s[1]
     fmla    v26.4s, v6.4s, v0.s[2]
     fmla    v27.4s, v6.4s, v0.s[3]
     fmla    v28.4s, v6.4s, v1.s[0]
     fmla    v29.4s, v7.4s, v1.s[1]
     fmla    v30.4s, v7.4s, v1.s[2]
     fmla    v31.4s, v7.4s, v1.s[3]
     tbz     w3, #11, 0x3c4110

In addition to these inner loops, the undisclosed code initializes the accumulators and performs row and column-wise winograd output transformation (spilling to memory). I do not want to expose all that code, which I hope to be irrelevant to the performance; instead I'm asking if there's something easily spotted problem with the larger kernel making it perform much more inefficiently on the Cortex-A73 processors.

EDIT

What I can spot from the loops is that none of labels are aligned to a cache line. The smaller loop is btw exactly 16 instructions, 64 bytes (or a cache line). The other loop is 33 instructions, with a remote possibility to infer the branch condition from the local temporary data register tbz x5, #??, 0x3c4110. This would bring the instruction count to 32, removing add x3, x3, #1. Then it would make sense also to align the loop start to a cache line boundary.

Update

There are some slight improvements found by applying the suggestions in the comments, i.e. reading with ldp q0,q1,[x0], 128; ldp q2,q3,[x0, #-112]. (Execution time reduced from 194ms to 190ms on a very low end device). So far this suggest the problem is not necessarily in the inner loops per se; the memory accesses differ very slightly between the two approaches (the number of arithmetic operations is the same, the number of coefficients read is the same, but the larger kernel shares the data slightly more). It's possible that the cache hierarchy plays tricks in all the A53 or A73 architectures alike.

Other undisclosed factor is that we are multithreading of course, and the BIG.little architecture can paradoxically slow down when the kernel executes faster -- at least if the output is synchronised to frame rate. In that case the OS can counterintuitively detect that a fast core is too idle after finishing all the tasks switching the operation to low power core, where it spends all the allocated time. This is anyway an issue (we thought) to have been resolved earlier -- see https://stackoverflow.com/a/64243494/1716339.

分享到QQ

分享到微博