ARM 汇编器 NEON - 提高性能

发布于 2025-01-08 05:48:56 字数 1030 浏览 2 评论 0原文

我已将部分算法从 C 转换为 ARM 汇编器（使用 NEON 指令），但现在它比原来的 C 代码慢 2 倍。我怎样才能提高性能？

目标是 ARM Cortex-A9。

该算法从数组中读取 64 位值。从该值中提取一个字节，然后将其用作另一个表的查找值。这部分大约完成 10 次，每个结果表值与其他表值进行异或，并将最终结果写入另一个数组。

像这样的事情：

result[i] = T0[ GetByte0( a[i1] ) ] ^ T1[ GetByte1( a[i2] ) ] ^ ... ^ T10[ (...) ];

在我的方法中，我将整个数组“a”加载到Neon寄存器中，然后在arm寄存器中移动正确的字节，计算偏移量，然后从表中加载值：

vldm.64 r0, {d0-d7}         //Load 8x64Bit from the input array

vmov.u8 r12, d0[0]          //Mov the first Byte from d0 into r12
add r12, r2, r12, asl #3    // r12 = base_adress + r12 << 3
vldr.64 d8, [r12]           // d8 = mem[r12]
.
.
.
veor d8, d8, d9             // d8 = d8 ^ d9
veor d8, d8, d10            // d8 = d8 ^d10      ...ect.

其中r2保存查找的基地址桌子。

adress = Table_adress + (8* value_fromByte);

这一步（除了开始时的加载）大约进行了100次。为什么这么慢？

另外，"vld"、"vldr" 和 "vldm" 之间有什么区别 - 以及哪一个最快。如何仅在 Neon 寄存器内执行偏移计算？谢谢。

原文

I have converted part of an algorithm from C to ARM Assembler (using NEON instructions),
but now it is 2x slower than the original C Code.
How can I improve performance?

Target is a ARM Cortex-A9.

The algorithm reads 64Bit-values from an array. From this value one byte is extracted, which is then used as the lookup-value for another table.
This part is done about 10 times, and each resulting table value is XOR´d with the others and the final result written into another array.

Something like this:

result[i] = T0[ GetByte0( a[i1] ) ] ^ T1[ GetByte1( a[i2] ) ] ^ ... ^ T10[ (...) ];

In my approach i load the whole array "a" in Neon Registers and then move the right byte in an arm register, calculate the offset and then load the value from the table:

vldm.64 r0, {d0-d7}         //Load 8x64Bit from the input array

vmov.u8 r12, d0[0]          //Mov the first Byte from d0 into r12
add r12, r2, r12, asl #3    // r12 = base_adress + r12 << 3
vldr.64 d8, [r12]           // d8 = mem[r12]
.
.
.
veor d8, d8, d9             // d8 = d8 ^ d9
veor d8, d8, d10            // d8 = d8 ^d10      ...ect.

Where r2 holds the base adress of the lookup table.

adress = Table_adress + (8* value_fromByte);

This step (except the loading at the beginning) is done like 100 times. Why is this so slow?

Also what are the differences between "vld", "vldr" and "vldm" - and which one is the fastest.
How can i perform the offset calculation only within Neon registers?
Thank you.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

岛徒 2025-01-15 05:48:56

Neon 不太能够处理大于 VTBL 指令限制的查找（如果我没记错的话是 32 字节）。
查找表是如何创建的？如果只是计算，就让 Neon 进行数学计算，而不是求助于查找。
这样会快很多。

回复收藏 0 原文

与往事干杯 2025-01-15 05:48:56

不要使用

vmov.u8 r12, d0[0]

将数据从 NEON 寄存器移动到 ARM 寄存器是你能做的最糟糕的事情。

也许你应该看看 VTBL 指令！
你的字节范围 0..255 是多少？

don't use

vmov.u8 r12, d0[0]

moving data from NEON register to the ARM register is the worst thing you can do.

Maybe you should see VTBL instruction !
What is you byte range 0..255 ?

回复收藏 0 原文

猫烠⑼条掵仅有一顆心 2025-01-15 05:48:56

也许你可以尝试

ldrb     r12, [r0], #1
add      r3, r2, r12, asl #3
vld1.64  {d0}, [r3]

ldrb     r12, [r0], #1
add      r3, r2, r12, asl #3
vld1.64  {d1}, [r3]
veor     d0, d0, d1         // d8 = d8 ^ d1

ldrb     r12, [r0], #1
add      r3, r2, r12, asl #3
vld1.64  {d1}, [r3]
veor     d0, d0, d1         // d8 = d8 ^ d1

...

这不是最好的解决方案。之后，您可以通过重新排序指令来提高性能。

May be you can try

ldrb     r12, [r0], #1
add      r3, r2, r12, asl #3
vld1.64  {d0}, [r3]

ldrb     r12, [r0], #1
add      r3, r2, r12, asl #3
vld1.64  {d1}, [r3]
veor     d0, d0, d1         // d8 = d8 ^ d1

ldrb     r12, [r0], #1
add      r3, r2, r12, asl #3
vld1.64  {d1}, [r3]
veor     d0, d0, d1         // d8 = d8 ^ d1

...

That will not be the best solution. After that you can increase performance by re ordering instruction.

回复收藏 0 原文