当前位置：文江博客话题详情

使用 NEON 对 ARM 汇编中的四字向量中的所有元素求和

发布于 2024-11-27 15:34:24 字数 123 浏览 2 评论 0原文

我对组装相当陌生，尽管手臂信息中心通常很有帮助，但有时这些说明可能会让新手感到有点困惑。基本上我需要做的就是对四字寄存器中的 4 个浮点值求和，并将结果存储在单个精度寄存器中。我认为 VPADD 指令可以满足我的需要，但我不太确定。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

命比纸薄 2024-12-04 15:34:24

您可以尝试这个（它不在 ASM 中，但您应该能够轻松转换它）：

float32x2_t r = vadd_f32(vget_high_f32(m_type), vget_low_f32(m_type));
return vget_lane_f32(vpadd_f32(r, r), 0);

在 ASM 中，它可能只有 VADD 和 VPADD。

我不确定这是否只是执行此操作的一种方法（也是最佳方法），但我还没有想到/找到更好的方法...

PS。我也是 NEON 新手

You might try this (it's not in ASM, but you should be able to convert it easily):

float32x2_t r = vadd_f32(vget_high_f32(m_type), vget_low_f32(m_type));
return vget_lane_f32(vpadd_f32(r, r), 0);

In ASM it would be probably only VADD and VPADD.

I'm not sure if this is only one method to do this (and most optimal), but I haven't figured/found better one...

PS. I'm new to NEON too

回复收藏 0 原文

悲歌长辞 2024-12-04 15:34:24

看来您想要获得一定长度的数组的总和，而不仅仅是四个浮点值。

在这种情况下，您的代码可以工作，但远未优化：

许多管道互锁
每次迭代不必要的32位添加

假设长度数组的大小是 8 的倍数且至少为 16：

  vldmia {q0-q1}, [pSrc]!
  sub count, count, #8
loop:
  pld [pSrc, #32]
  vldmia {q3-q4}, [pSrc]!
  subs count, count, #8
  vadd.f32 q0, q0, q3
  vadd.f32 q1, q1, q4
  bgt loop

  vadd.f32 q0, q0, q1
  vpadd.f32 d0, d0, d1
  vadd.f32 s0, s0, s1

pld - 虽然是 ARM 指令而不是 NEON - 对于性能至关重要。它极大地提高了缓存命中率。

我希望上面的其余代码是不言自明的。

您会注意到这个版本比您最初的版本快很多倍。

It seems that you want to get the sum of a certain length of array, and not only four float values.

In that case, your code will work, but is far from optimized :

many many pipeline interlocks
unnecessary 32bit addition per iteration

Assuming the length of the array is a multiple of 8 and at least 16 :

  vldmia {q0-q1}, [pSrc]!
  sub count, count, #8
loop:
  pld [pSrc, #32]
  vldmia {q3-q4}, [pSrc]!
  subs count, count, #8
  vadd.f32 q0, q0, q3
  vadd.f32 q1, q1, q4
  bgt loop

  vadd.f32 q0, q0, q1
  vpadd.f32 d0, d0, d1
  vadd.f32 s0, s0, s1

pld - while being an ARM instruction and not NEON - is crucial for performance. It drastically increases cache hit rate.

I hope the rest of the code above is self explanatory.

You will notice that this version is many times faster than your initial one.

回复收藏 0 原文

别在捏我脸啦 2024-12-04 15:34:24

这是 ASM 中的代码：

    vpadd.f32 d1,d6,d7    @ q3 is register that needs all of its contents summed          
    vadd.f32 s1,s2,s3     @ now we add the contents of d1 together (the sum)                
    vadd.f32 s0,s0,s1     @ sum += s1;

我可能忘记提及，在 C 中，代码看起来像这样：

float sum = 1.0f;
sum += number1 * number2;

我省略了这段 asm 代码中的乘法。

Here is the code in ASM:

    vpadd.f32 d1,d6,d7    @ q3 is register that needs all of its contents summed          
    vadd.f32 s1,s2,s3     @ now we add the contents of d1 together (the sum)                
    vadd.f32 s0,s0,s1     @ sum += s1;

I may have forgotten to mention that in C the code would look like this:

float sum = 1.0f;
sum += number1 * number2;

I have omitted the multiplication from this little piece asm of code.

回复收藏 0 原文

~没有更多了~

关于作者

梓梦

暂无简介

文章

26 人气

关注发私信

知足的幸福

文章 0 评论 0

关注

我一向站在原地

文章 0 评论 0

关注

慕烟庭风

文章 0 评论 0

关注

秉忠贞之诚守退让之实

文章 0 评论 0

关注

小兔几

文章 0 评论 0

关注

mb_3y7WUgWY

文章 0 评论 0

友情链接

文江博客

使用 NEON 对 ARM 汇编中的四字向量中的所有元素求和

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚守退让之实

小兔几

mb_3y7WUgWY

友情链接

使用 NEON 对 ARM 汇编中的四字向量中的所有元素求和

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚 守退让之实

小兔几

mb_3y7WUgWY

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

秉忠贞之诚守退让之实