NEON ASM 代码运行速度比 C 代码慢很多?

发布于 2024-11-07 15:52:37 字数 1533 浏览 0 评论 0原文

我正在尝试使用 NEON 在 iPhone ARM 上针对特定问题实现高斯牛顿优化。下面的第一个函数是我原来的 C 函数。第二个是我写的NEON asm代码。我每个都运行了 100,000 次,NEON 版本比 C 版本花费了 7-8 倍的时间。我认为加载(vld1.32)是花费大部分时间的。我尝试删除一些指令。

有人对这个问题有任何见解吗?谢谢!

template<class T>
inline void GaussNewtonOperationJtr8x8(T Jtr[8], const T J[8], T residual)
{
    Jtr[0] -= J[0]*residual;
    Jtr[1] -= J[1]*residual;
    Jtr[2] -= J[2]*residual;
    Jtr[3] -= J[3]*residual;
    Jtr[4] -= J[4]*residual;
    Jtr[5] -= J[5]*residual;
    Jtr[6] -= J[6]*residual;
    Jtr[7] -= J[7]*residual;    
}

inline void GaussNewtonOperationJtr8x8_NEON(NFloat Jtr[8], const NFloat J[8], NFloat residual)
{
    __asm__ volatile (
                      // load Jtr into registers
                      "vld1.32   {d0-d3}, [%0]\n\t"
                      // load J into registers
                      "vld1.32   {d4-d7}, [%1]\n\t"
                      // load residual in register
                      "vmov.f32  s16, %2\n\t"
                      // Jtr -= J*residual
                      "vmls.f32  q0, q2, d8[0]\n\t"
                      "vmls.f32  q1, q3, d8[0]\n\t"
                      // store result
                      "vst1.32   {d0-d3}, [%0]\n\t"
                      // output
                      :
                      // input
                      : "r"(Jtr), "r"(J), "r"(residual)
                      // registers
                      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14"
                      );
}

I'm trying to implement Gauss-Newton optimization for a specific problem on iPhone ARM using NEON. The first function below is my original C function. The second is the NEON asm code I wrote. I ran each one 100,000 times and the NEON version takes 7-8 times longer than C version. I think the loading (vld1.32) is what takes most of the time. I experimented by taking removing some instructions.

Does anyone have any insight into this problem? Thanks!

template<class T>
inline void GaussNewtonOperationJtr8x8(T Jtr[8], const T J[8], T residual)
{
    Jtr[0] -= J[0]*residual;
    Jtr[1] -= J[1]*residual;
    Jtr[2] -= J[2]*residual;
    Jtr[3] -= J[3]*residual;
    Jtr[4] -= J[4]*residual;
    Jtr[5] -= J[5]*residual;
    Jtr[6] -= J[6]*residual;
    Jtr[7] -= J[7]*residual;    
}

inline void GaussNewtonOperationJtr8x8_NEON(NFloat Jtr[8], const NFloat J[8], NFloat residual)
{
    __asm__ volatile (
                      // load Jtr into registers
                      "vld1.32   {d0-d3}, [%0]\n\t"
                      // load J into registers
                      "vld1.32   {d4-d7}, [%1]\n\t"
                      // load residual in register
                      "vmov.f32  s16, %2\n\t"
                      // Jtr -= J*residual
                      "vmls.f32  q0, q2, d8[0]\n\t"
                      "vmls.f32  q1, q3, d8[0]\n\t"
                      // store result
                      "vst1.32   {d0-d3}, [%0]\n\t"
                      // output
                      :
                      // input
                      : "r"(Jtr), "r"(J), "r"(residual)
                      // registers
                      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14"
                      );
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

不爱素颜 2024-11-14 15:52:37
  1. 不要使用 d8-d15。在使用之前必须将它们保存到堆栈中。并于事后恢复。编译器将放置指令来执行此操作,从而浪费宝贵的周期。
  2. 在 Jtr 之前加载 J。 Jtr 预计处于比 J 更晚的管道阶段。
  3. 使用 VLDMIA/VSTMIA 而不是 VLD/VST。 VLDMIA/VSTMIA 速度更快,并且在管道方面具有优势。
  4. 使用向量-向量乘法而不是向量-标量乘法。
  5. 如果创建循环版本,请将 pld 放在开头并展开循环,以便每次迭代从每个指针读取 64 个字节。

除了我上面提到的那些错误(这对于 NEON 新手来说是典型的)之外,您的方法非常好。您在 vmls 中找到了最合适的指令。

干得好。

{

__asm__ volatile (
    // load residual in register
    "vdup.32  q12, %2\n\t"
    // load J into registers
    "vldmia   %1, {q10-q11}\n\t"
     // load Jtr into registers
    "vldmia   %0, {q8-q9}\n\t"
    // Jtr -= J*residual
    "vmls.f32  q8, q10, q12\n\t"
    "vmls.f32  q9, q11, q12\n\t"
    // store result
    "vstmia   %0, {q8-q9}\n\t"
    // output
    :
    // input
    : "r"(Jtr), "r"(J), "r"(residual)
    // registers
    : "q8", "q9", "q10", "q11", "q12"
);
  1. Don't use d8-d15. They have to be conserved onto stack prior to use. And restored after. The compiler will put instructions doing this, wasting valuable cycles.
  2. Load J prior to Jtr. Jtr is expected at a later pipeline stage than J.
  3. Use VLDMIA/VSTMIA instead of VLD/VST. VLDMIA/VSTMIA is faster and has advantage pipeline-wise.
  4. Use vector-vector multiplication instead of vector-scalar multiplication.
  5. If you create a looped version, put pld at the beginning and unroll the loop so that 64bytes are read from each pointer per iteration.

Beside those faults I mentioned above - which is typical for people new to NEON - Your approach is very nice. You found the most appropriate instruction in vmls.

Well done.

{

__asm__ volatile (
    // load residual in register
    "vdup.32  q12, %2\n\t"
    // load J into registers
    "vldmia   %1, {q10-q11}\n\t"
     // load Jtr into registers
    "vldmia   %0, {q8-q9}\n\t"
    // Jtr -= J*residual
    "vmls.f32  q8, q10, q12\n\t"
    "vmls.f32  q9, q11, q12\n\t"
    // store result
    "vstmia   %0, {q8-q9}\n\t"
    // output
    :
    // input
    : "r"(Jtr), "r"(J), "r"(residual)
    // registers
    : "q8", "q9", "q10", "q11", "q12"
);
Hello爱情风 2024-11-14 15:52:37

编译器本身会优化 C 代码生成的汇编。它只是不会将一种代码转换为另一种代码。

你想要做的是做出比编译器更好的优化(哦哦)。您至少知道编译器为上面的 C 代码生成的汇编代码是什么吗?好吧,如果您希望汇编代码更好,您应该这样做。

编辑:

该帖子对此类内容进行了精彩的讨论:
为什么 ARM NEON 不比普通 C++ 更快?

The compiler itself optimizes the assembly generated by the C code. It just doesn't translate one code to another.

What you are trying to do is make a better optimization then the compiler (oh ow). Do you know at least what's the assembly code the compiler is generating for the C code above? Well, you should if you want your assembly code to be better.

EDIT:

This thread has a great discussion about this sort of stuff:
Why ARM NEON not faster than plain C++?

蓝色星空 2024-11-14 15:52:37

您正在 NEON 和 VFP 指令之间切换。在 Cortex-A8 和 A9 上这样做都会受到惩罚。摆脱 VFP vmov.f32 指令,并确保此代码不会内联到使用 VFP 代码的位置,除非有长时间运行此类代码来证明管道上下文切换的合理性。

You're switching between NEON and VFP instructions. There's a penalty for doing so on both the Cortex-A8 and A9. Get rid of that VFP vmov.f32 instruction and also make sure that this code isn't inlined into places that use VFP code unless there's a long run of such code to justify the pipeline context switch.

柠栀 2024-11-14 15:52:37

你的 C++ 版本实际上使用了浮点数吗?我无法判断,因为您只提供了模板,没有显示您使用的实例化。非常奇怪的是,对于这段代码,NEON 比 Cortex-A8 上的 VFP 慢得多,但对于 u32,我可以看到它可能是这样工作的。

我不知道 ABI 是什么,但残差的传递方式可能会产生一些开销(即编译器将其放入 %2 寄存器的操作)。尝试使用指针并在单个元素上使用 vld1 - 您可以通过这种方式在 NEON 中仅加载一个浮点数。

如果您使用 16 字节对齐的加载和存储,您将从数组中获得更好的性能,但您可能需要玩一些游戏才能让输入以这种方式工作。不幸的是,您永远无法从中获得真正出色的性能,因为您无法避免 vmls 指令的大部分延迟,该指令很长(由于端到端链接 NEON 乘法和加法管道)。更糟糕的是,由于依赖指令是一个存储,它需要在 NEON 管道的早期输入。理想情况下,您将能够一次执行其中多个操作,并且可以将多个实例交错在一起 - 寄存器中可以容纳尽可能多的实例。

Is your C++ version actually using floats? I can't tell because you only gave the template and didn't show which instantiation you used. It's very strange that NEON would be drastically slower than VFP on Cortex-A8 for this code, but for u32s I could see it possibly working out that way.

I don't know what the ABI is, but there could be some overhead for how the residual is passed (that is, what the compiler is doing to get it into that %2 register). Try using a pointer instead and use vld1 on single-element - you can load just one float in NEON this way.

You'll get better performance out of the arrays if you use 16-byte aligned loads and stores, but you may have to play some games to get the inputs to work this way. Unfortuantely, you'll never get really great performance out of this because you're not avoiding most of the latency of the vmls instruction which is lengthy (due to chaining the NEON multiply and add pipelines end to end). It's worse due to the dependent instruction being a store, which needs its input early in the NEON pipeline. Ideally you'll be able to do several of these operations at a time, and can interleave multiple instances together - as many as you can fit into registers.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文