ARM NEON:vld4_f32 和 vld4q_f32 有什么区别?
我无法区分 ARM NEON 指令中的 vld4_f32
和 vld4q_f32
之间的区别。
当我提高编码水平并开始查看汇编指令而不是信息较少的内在函数时,混乱就开始了。
我需要在这里使用 vld4 变体指令的原因是,我想从每个 4 个位置捕获 4 float32_t
我的大数组。
vld4_f32
内在函数和相应的汇编指令如下所示 (从此链接)
float32x2x4_t vld4_f32 (const float32_t *)
Form of expected instruction(s): vld4.32 {d0, d1, d2, d3}, [r0]
vld4q_f32
内在函数及其相应的程序集指令看起来像这样
float32x4x4_t vld4q_f32 (const float32_t *)
Form of expected instruction(s): vld4.32 {d0, d1, d2, d3}, [r0]
好吧,在内在函数级别我看到的区别是返回类型,但是如果我查看汇编指令和寄存器数量,它们看起来都是相同的。编译器或汇编器如何知道两者之间的区别?
有人可以对此进行更多澄清,并解释我如何实现将4个 float32_t 值加载到单个寄存器中,这些值位于每第四个内存位置吗?
I'm not in a position to make out the difference between vld4_f32
and vld4q_f32
in ARM NEON instructions.
The confusion started when I raised my coding levels and started looking at the assembly instructions rather than the less informative intrinsics.
The reason I need to use vld4 variant instruction here is because, I would like to capture 4 float32_t
's from every 4th position of my large array.
The vld4_f32
intrinsics and the corresponding assembly instructions look like this (From this link)
float32x2x4_t vld4_f32 (const float32_t *)
Form of expected instruction(s): vld4.32 {d0, d1, d2, d3}, [r0]
The vld4q_f32
intrinsics and its corresponding assembly instructions looks like this
float32x4x4_t vld4q_f32 (const float32_t *)
Form of expected instruction(s): vld4.32 {d0, d1, d2, d3}, [r0]
Well, at the intrinsics level the difference I see is the return type, but if I look at the assembly instruction and the number of registers, they both look like the same. How will the compiler or the assembler know the difference between the two?
Can somebody clarify more on this and also explain how I can achieve loading 4 float32_t values which are positioned at every 4th memory location into a single register?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
是的,我发现了差异。我使用 CodeSourcery 查看所有加载指令的实际寄存器内容。我发布的链接没有提供有关 vld4q_f32 的完整详细信息。
好的,首先是
vld4_f32
,它加载4 d寄存器(例如d16-19),每个d寄存器是64位长,所以这条指令将加载间隔为 4 的前 8 个值,如下图所示。在第二种情况下
vld4q_f32
,这将加载 8 d 寄存器(例如d16-23)而不是四个。对于此链接的读者来说,根本不清楚将加载 8 个寄存器。当我查看vld4qf32
的反汇编代码时,它使用了 8 个 d 寄存器。该指令确实会执行我希望它执行的操作,即加载 4 个
float32_t
值,这些值的间隔为 4,如下图所示。Yes, I found out the difference. I used CodeSourcery to see the actual register contents for all the load instructions. The link I have posted doesn't give the complete details on the vld4q_f32.
Okay, first comes the
vld4_f32
, this loads 4 d registers (e.g. d16-19) each d register is 64 bits long, so this instruction will load the first 8 values interleaved with an interval of 4 as shown in the figure below.In the second case the
vld4q_f32
, this loads 8 d registers (e.g. d16-23) instead of four. For a reader of this link, it is not at all clear that 8 registers will be loaded. When I looked at the dis-assembled code for avld4qf32
, it was making use of 8 d registers.This instruction will indeed do what I was hoping it to do i.e. to load 4
float32_t
values which are at the interval of 4 as shown in the figure below.我已经拆解了两个内在函数,也许对某人有帮助:
I have disassembled two intrinsics, maybe it helps to someone: