NEON ASM 代码运行速度比 C 代码慢很多?
我正在尝试使用 NEON 在 iPhone ARM 上针对特定问题实现高斯牛顿优化。下面的第一个函数是我原来的 C 函数。第二个是我写的NEON asm代码。我每个都运行了 100,000 次,NEON 版本比 C 版本花费了 7-8 倍的时间。我认为加载(vld1.32)是花费大部分时间的。我尝试删除一些指令。
有人对这个问题有任何见解吗?谢谢!
template<class T>
inline void GaussNewtonOperationJtr8x8(T Jtr[8], const T J[8], T residual)
{
Jtr[0] -= J[0]*residual;
Jtr[1] -= J[1]*residual;
Jtr[2] -= J[2]*residual;
Jtr[3] -= J[3]*residual;
Jtr[4] -= J[4]*residual;
Jtr[5] -= J[5]*residual;
Jtr[6] -= J[6]*residual;
Jtr[7] -= J[7]*residual;
}
inline void GaussNewtonOperationJtr8x8_NEON(NFloat Jtr[8], const NFloat J[8], NFloat residual)
{
__asm__ volatile (
// load Jtr into registers
"vld1.32 {d0-d3}, [%0]\n\t"
// load J into registers
"vld1.32 {d4-d7}, [%1]\n\t"
// load residual in register
"vmov.f32 s16, %2\n\t"
// Jtr -= J*residual
"vmls.f32 q0, q2, d8[0]\n\t"
"vmls.f32 q1, q3, d8[0]\n\t"
// store result
"vst1.32 {d0-d3}, [%0]\n\t"
// output
:
// input
: "r"(Jtr), "r"(J), "r"(residual)
// registers
: "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14"
);
}
I'm trying to implement Gauss-Newton optimization for a specific problem on iPhone ARM using NEON. The first function below is my original C function. The second is the NEON asm code I wrote. I ran each one 100,000 times and the NEON version takes 7-8 times longer than C version. I think the loading (vld1.32) is what takes most of the time. I experimented by taking removing some instructions.
Does anyone have any insight into this problem? Thanks!
template<class T>
inline void GaussNewtonOperationJtr8x8(T Jtr[8], const T J[8], T residual)
{
Jtr[0] -= J[0]*residual;
Jtr[1] -= J[1]*residual;
Jtr[2] -= J[2]*residual;
Jtr[3] -= J[3]*residual;
Jtr[4] -= J[4]*residual;
Jtr[5] -= J[5]*residual;
Jtr[6] -= J[6]*residual;
Jtr[7] -= J[7]*residual;
}
inline void GaussNewtonOperationJtr8x8_NEON(NFloat Jtr[8], const NFloat J[8], NFloat residual)
{
__asm__ volatile (
// load Jtr into registers
"vld1.32 {d0-d3}, [%0]\n\t"
// load J into registers
"vld1.32 {d4-d7}, [%1]\n\t"
// load residual in register
"vmov.f32 s16, %2\n\t"
// Jtr -= J*residual
"vmls.f32 q0, q2, d8[0]\n\t"
"vmls.f32 q1, q3, d8[0]\n\t"
// store result
"vst1.32 {d0-d3}, [%0]\n\t"
// output
:
// input
: "r"(Jtr), "r"(J), "r"(residual)
// registers
: "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14"
);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
除了我上面提到的那些错误(这对于 NEON 新手来说是典型的)之外,您的方法非常好。您在 vmls 中找到了最合适的指令。
干得好。
{
Beside those faults I mentioned above - which is typical for people new to NEON - Your approach is very nice. You found the most appropriate instruction in vmls.
Well done.
{
编译器本身会优化 C 代码生成的汇编。它只是不会将一种代码转换为另一种代码。
你想要做的是做出比编译器更好的优化(哦哦)。您至少知道编译器为上面的 C 代码生成的汇编代码是什么吗?好吧,如果您希望汇编代码更好,您应该这样做。
编辑:
该帖子对此类内容进行了精彩的讨论:
为什么 ARM NEON 不比普通 C++ 更快?
The compiler itself optimizes the assembly generated by the C code. It just doesn't translate one code to another.
What you are trying to do is make a better optimization then the compiler (oh ow). Do you know at least what's the assembly code the compiler is generating for the C code above? Well, you should if you want your assembly code to be better.
EDIT:
This thread has a great discussion about this sort of stuff:
Why ARM NEON not faster than plain C++?
您正在 NEON 和 VFP 指令之间切换。在 Cortex-A8 和 A9 上这样做都会受到惩罚。摆脱 VFP vmov.f32 指令,并确保此代码不会内联到使用 VFP 代码的位置,除非有长时间运行此类代码来证明管道上下文切换的合理性。
You're switching between NEON and VFP instructions. There's a penalty for doing so on both the Cortex-A8 and A9. Get rid of that VFP vmov.f32 instruction and also make sure that this code isn't inlined into places that use VFP code unless there's a long run of such code to justify the pipeline context switch.
你的 C++ 版本实际上使用了浮点数吗?我无法判断,因为您只提供了模板,没有显示您使用的实例化。非常奇怪的是,对于这段代码,NEON 比 Cortex-A8 上的 VFP 慢得多,但对于 u32,我可以看到它可能是这样工作的。
我不知道 ABI 是什么,但残差的传递方式可能会产生一些开销(即编译器将其放入 %2 寄存器的操作)。尝试使用指针并在单个元素上使用 vld1 - 您可以通过这种方式在 NEON 中仅加载一个浮点数。
如果您使用 16 字节对齐的加载和存储,您将从数组中获得更好的性能,但您可能需要玩一些游戏才能让输入以这种方式工作。不幸的是,您永远无法从中获得真正出色的性能,因为您无法避免 vmls 指令的大部分延迟,该指令很长(由于端到端链接 NEON 乘法和加法管道)。更糟糕的是,由于依赖指令是一个存储,它需要在 NEON 管道的早期输入。理想情况下,您将能够一次执行其中多个操作,并且可以将多个实例交错在一起 - 寄存器中可以容纳尽可能多的实例。
Is your C++ version actually using floats? I can't tell because you only gave the template and didn't show which instantiation you used. It's very strange that NEON would be drastically slower than VFP on Cortex-A8 for this code, but for u32s I could see it possibly working out that way.
I don't know what the ABI is, but there could be some overhead for how the residual is passed (that is, what the compiler is doing to get it into that %2 register). Try using a pointer instead and use vld1 on single-element - you can load just one float in NEON this way.
You'll get better performance out of the arrays if you use 16-byte aligned loads and stores, but you may have to play some games to get the inputs to work this way. Unfortuantely, you'll never get really great performance out of this because you're not avoiding most of the latency of the vmls instruction which is lengthy (due to chaining the NEON multiply and add pipelines end to end). It's worse due to the dependent instruction being a store, which needs its input early in the NEON pipeline. Ideally you'll be able to do several of these operations at a time, and can interleave multiple instances together - as many as you can fit into registers.