ARM 汇编器 NEON - 提高性能
我已将部分算法从 C 转换为 ARM 汇编器(使用 NEON 指令), 但现在它比原来的 C 代码慢 2 倍。 我怎样才能提高性能?
目标是 ARM Cortex-A9。
该算法从数组中读取 64 位值。从该值中提取一个字节,然后将其用作另一个表的查找值。 这部分大约完成 10 次,每个结果表值与其他表值进行异或,并将最终结果写入另一个数组。
像这样的事情:
result[i] = T0[ GetByte0( a[i1] ) ] ^ T1[ GetByte1( a[i2] ) ] ^ ... ^ T10[ (...) ];
在我的方法中,我将整个数组“a”加载到Neon寄存器中,然后在arm寄存器中移动正确的字节,计算偏移量,然后从表中加载值:
vldm.64 r0, {d0-d7} //Load 8x64Bit from the input array
vmov.u8 r12, d0[0] //Mov the first Byte from d0 into r12
add r12, r2, r12, asl #3 // r12 = base_adress + r12 << 3
vldr.64 d8, [r12] // d8 = mem[r12]
.
.
.
veor d8, d8, d9 // d8 = d8 ^ d9
veor d8, d8, d10 // d8 = d8 ^d10 ...ect.
其中r2保存查找的基地址桌子。
adress = Table_adress + (8* value_fromByte);
这一步(除了开始时的加载)大约进行了100次。为什么这么慢?
另外,"vld"、"vldr" 和 "vldm" 之间有什么区别 - 以及哪一个最快。 如何仅在 Neon 寄存器内执行偏移计算? 谢谢。
I have converted part of an algorithm from C to ARM Assembler (using NEON instructions),
but now it is 2x slower than the original C Code.
How can I improve performance?
Target is a ARM Cortex-A9.
The algorithm reads 64Bit-values from an array. From this value one byte is extracted, which is then used as the lookup-value for another table.
This part is done about 10 times, and each resulting table value is XOR´d with the others and the final result written into another array.
Something like this:
result[i] = T0[ GetByte0( a[i1] ) ] ^ T1[ GetByte1( a[i2] ) ] ^ ... ^ T10[ (...) ];
In my approach i load the whole array "a" in Neon Registers and then move the right byte in an arm register, calculate the offset and then load the value from the table:
vldm.64 r0, {d0-d7} //Load 8x64Bit from the input array
vmov.u8 r12, d0[0] //Mov the first Byte from d0 into r12
add r12, r2, r12, asl #3 // r12 = base_adress + r12 << 3
vldr.64 d8, [r12] // d8 = mem[r12]
.
.
.
veor d8, d8, d9 // d8 = d8 ^ d9
veor d8, d8, d10 // d8 = d8 ^d10 ...ect.
Where r2 holds the base adress of the lookup table.
adress = Table_adress + (8* value_fromByte);
This step (except the loading at the beginning) is done like 100 times. Why is this so slow?
Also what are the differences between "vld", "vldr" and "vldm" - and which one is the fastest.
How can i perform the offset calculation only within Neon registers?
Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Neon 不太能够处理大于 VTBL 指令限制的查找(如果我没记错的话是 32 字节)。
查找表是如何创建的?如果只是计算,就让 Neon 进行数学计算,而不是求助于查找。
这样会快很多。
Neon isn't very capable of dealing with Lookups larger than the VTBL instruction's limits(32bytes if I remember correctly).
How's the lookup table created to start with? If it's just calculations, just let Neon do the math instead of resorting to lookups.
It will be much faster this way.
不要使用
将数据从 NEON 寄存器移动到 ARM 寄存器是你能做的最糟糕的事情。
也许你应该看看 VTBL 指令!
你的字节范围 0..255 是多少?
don't use
moving data from NEON register to the ARM register is the worst thing you can do.
Maybe you should see VTBL instruction !
What is you byte range 0..255 ?
也许你可以尝试
这不是最好的解决方案。之后,您可以通过重新排序指令来提高性能。
May be you can try
That will not be the best solution. After that you can increase performance by re ordering instruction.
尝试使用 NEON“内在函数”。基本上它们是编译为 NEON 指令的 C 函数。编译器仍然可以完成所有指令调度,并且您可以免费获得其他无聊的东西(移动数据)。
它并不总是完美地工作,但它可能比尝试手动编码更好。
查找
arm_neon.h
。Try it with NEON "intrinsics". Basically they're C functions that compile down to NEON instructions. The compiler still gets to do all the instruction scheduling, and you get the other boring stuff (moving data about) for free.
It doesn't always work perfectly, but it might be better than trying to hand code it.
Look for
arm_neon.h
.