如何使用 NEON SIMD 合并 2 行元素?
我有一个
A = a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
d1 d2 d3 d4
我有 2 行,
float32x2_t a = a1 a2
float32x2_t b = b1 b2
从这些我如何得到 -
float32x4_t result = b1 a1 b2 a2
是否有任何单个 NEON SIMD 指令可以合并这两行? 或者我如何使用内在函数以尽可能少的步骤来实现这一目标?
我想过使用zip/unzip内在函数,但是zip函数返回的数据类型,即float32x2x2_t
,不适合我,我需要float32x4_t
数据类型。
float32x2x2_t vzip_f32 (float32x2_t, float32x2_t)
I have a
A = a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
d1 d2 d3 d4
I have 2 rows with me,
float32x2_t a = a1 a2
float32x2_t b = b1 b2
From these how can I get a -
float32x4_t result = b1 a1 b2 a2
Is there any single NEON SIMD instruction which can merge these two rows?
Or how can I achieve this using as minimum steps as possible using intrinsics?
I thought of using the zip/unzip intrinsics but the datatype the zip function returns, which is float32x2x2_t
, is not suitable for me, I need a float32x4_t
datatype.
float32x2x2_t vzip_f32 (float32x2_t, float32x2_t)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这很困难。没有一条指令可以做到这一点,最好的解决方案取决于您的数据是否在内存中或者是否已经在寄存器中。
您至少需要两个操作来进行转换。首先是一个向量轮,它会像这样排列您的参数:
然后您必须交换每个操作的参数。要么单独反转每个向量,要么将两个向量视为四边形向量并进行长反转。
如果从内存加载数据,则可以跳过第一个 vtrn.32 指令,因为 NEON 在使用 vld2.32 指令加载数据时可以执行此操作。这里有一个小汇编器函数,它就是这样做的:
顺便说一句,请注意:指令 vtrn.32、vzip.32 和 vuzp.32 是相同的(但前提是您正在使用 32 位实体)
并且使用 NEON 内在函数?好吧——只是说你完蛋了。正如您已经发现的,您不能直接从一种类型转换为另一种类型,也不能直接混合四重向量和双向量。
这是我使用内在函数想到的最好的方法(它不使用 vld2.32 技巧来提高可读性):
如果您使用 GCC,这将起作用,但 GCC 生成的代码将非常糟糕且缓慢。 NEON 的内在支持还很年轻。在这里使用直接的 C 代码可能会获得更好的性能。
This is difficult.. There is not a single instruction that can do this, and the best solution depends on if your data is in memory or if they are already in registers.
You need two operations at least to do the conversion.. First a vector turn which permutes your arguments like this:
And then you have to swap the arguments of each operation. Either by reversing each vector on it's own or by treating the two vectors as a quad vector and do a long reverse.
If you load your data from memory you can skip the first vtrn.32 instruction because NEON can do this while it loads the data using the vld2.32 instruction. Here is a little assembler function that does just that:
Btw, a little note: The instructions vtrn.32, vzip.32 and vuzp.32 are identical (but only if you're working with 32 bit entities)
And with NEON intrinsics? Well - simply said you're screwed. As you've already found out you can't directly cast from one type to another and you can't directly mix quad and double vectors.
This is the best I came up with using intrinsics (it does not use the vld2.32 trick for readability):
If you're using GCC this will work, but the code generated by GCC will be horrible and slow. NEON intrinsic support is still very young. You'll probably get better performance with a straight forward C-code here..