ARM NEON：vld4_f32 和 vld4q_f32 有什么区别？

发布于 2024-09-25 01:29:06 字数 992 浏览 4 评论 0原文

我无法区分 ARM NEON 指令中的 vld4_f32 和 vld4q_f32 之间的区别。

当我提高编码水平并开始查看汇编指令而不是信息较少的内在函数时，混乱就开始了。

我需要在这里使用 vld4 变体指令的原因是，我想从每个 4 个位置捕获 4 float32_t 我的大数组。

alt text

vld4_f32 内在函数和相应的汇编指令如下所示 (从此链接)

float32x2x4_t vld4_f32 (const float32_t *) 
Form of expected instruction(s): vld4.32 {d0, d1, d2, d3}, [r0]

vld4q_f32 内在函数及其相应的程序集指令看起来像这样

float32x4x4_t vld4q_f32 (const float32_t *) 
Form of expected instruction(s): vld4.32 {d0, d1, d2, d3}, [r0]

好吧，在内在函数级别我看到的区别是返回类型，但是如果我查看汇编指令和寄存器数量，它们看起来都是相同的。编译器或汇编器如何知道两者之间的区别？

有人可以对此进行更多澄清，并解释我如何实现将4个 float32_t 值加载到单个寄存器中，这些值位于每第四个内存位置吗？

原文

I'm not in a position to make out the difference between vld4_f32 and vld4q_f32 in ARM NEON instructions.

The confusion started when I raised my coding levels and started looking at the assembly instructions rather than the less informative intrinsics.

The reason I need to use vld4 variant instruction here is because, I would like to capture 4 float32_t's from every 4th position of my large array.

alt text

The vld4_f32 intrinsics and the corresponding assembly instructions look like this (From this link)

float32x2x4_t vld4_f32 (const float32_t *) 
Form of expected instruction(s): vld4.32 {d0, d1, d2, d3}, [r0]

The vld4q_f32 intrinsics and its corresponding assembly instructions looks like this

float32x4x4_t vld4q_f32 (const float32_t *) 
Form of expected instruction(s): vld4.32 {d0, d1, d2, d3}, [r0]

Well, at the intrinsics level the difference I see is the return type, but if I look at the assembly instruction and the number of registers, they both look like the same. How will the compiler or the assembler know the difference between the two?

Can somebody clarify more on this and also explain how I can achieve loading 4 float32_t values which are positioned at every 4th memory location into a single register?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

椵侞 2024-10-02 01:29:06

是的，我发现了差异。我使用 CodeSourcery 查看所有加载指令的实际寄存器内容。我发布的链接没有提供有关 vld4q_f32 的完整详细信息。

好的，首先是vld4_f32，它加载4 d寄存器（例如d16-19），每个d寄存器是64位长，所以这条指令将加载间隔为 4 的前 8 个值，如下图所示。
alt text

在第二种情况下 vld4q_f32，这将加载 8 d 寄存器（例如d16-23）而不是四个。对于此链接的读者来说，根本不清楚将加载 8 个寄存器。当我查看 vld4qf32 的反汇编代码时，它使用了 8 个 d 寄存器。

该指令确实会执行我希望它执行的操作，即加载 4 个 float32_t 值，这些值的间隔为 4，如下图所示。
替代文本

回复收藏 0 原文

失与倦＂ 2024-10-02 01:29:06

我已经拆解了两个内在函数，也许对某人有帮助：

// C++
uint32x4x4_t r = vld4q_u32( ( uint32_t *) output );
// assembly
VLD4.32         {D16,D18,D20,D22}, [R0]!
VLD4.32         {D17,D19,D21,D23}, [R0]

// C++
uint32x2x4_t r = vld4_u32( ( uint32_t *) output );
// assembly
VLD4.32         {D20-D23}, [R0]

I have disassembled two intrinsics, maybe it helps to someone:

// C++
uint32x4x4_t r = vld4q_u32( ( uint32_t *) output );
// assembly
VLD4.32         {D16,D18,D20,D22}, [R0]!
VLD4.32         {D17,D19,D21,D23}, [R0]

// C++
uint32x2x4_t r = vld4_u32( ( uint32_t *) output );
// assembly
VLD4.32         {D20-D23}, [R0]

回复收藏 0 原文

~没有更多了~