如何使用 NEON SIMD 合并 2 行元素？

发布于 2024-09-11 05:58:51 字数 566 浏览 5 评论 0原文

我有一个

A = a1 a2 a3 a4
    b1 b2 b3 b4
    c1 c2 c3 c4
    d1 d2 d3 d4

我有 2 行，

float32x2_t a = a1 a2
float32x2_t b = b1 b2

从这些我如何得到 -

float32x4_t result = b1 a1 b2 a2

是否有任何单个 NEON SIMD 指令可以合并这两行？或者我如何使用内在函数以尽可能少的步骤来实现这一目标？

我想过使用zip/unzip内在函数，但是zip函数返回的数据类型，即float32x2x2_t，不适合我，我需要float32x4_t 数据类型。

float32x2x2_t vzip_f32 (float32x2_t, float32x2_t)

原文

I have a

A = a1 a2 a3 a4
    b1 b2 b3 b4
    c1 c2 c3 c4
    d1 d2 d3 d4

I have 2 rows with me,

float32x2_t a = a1 a2
float32x2_t b = b1 b2

From these how can I get a -

float32x4_t result = b1 a1 b2 a2

Is there any single NEON SIMD instruction which can merge these two rows?
Or how can I achieve this using as minimum steps as possible using intrinsics?

I thought of using the zip/unzip intrinsics but the datatype the zip function returns, which is float32x2x2_t, is not suitable for me, I need a float32x4_t datatype.

float32x2x2_t vzip_f32 (float32x2_t, float32x2_t)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

司马昭之心 2024-09-18 05:58:51

这很困难。没有一条指令可以做到这一点，最好的解决方案取决于您的数据是否在内存中或者是否已经在寄存器中。

您至少需要两个操作来进行转换。首先是一个向量轮，它会像这样排列您的参数：

a = a1 a2
b = b1 b2

vtrn.32  a, b

a = a1 b1 
b = a2 b2

然后您必须交换每个操作的参数。要么单独反转每个向量，要么将两个向量视为四边形向量并进行长反转。

temp = {a, b} 
temp = a1 b1 a2 b2

vrev64.32 temp, temp

temp = b1 a1 b2 a2    <-- this is what you want.

如果从内存加载数据，则可以跳过第一个 vtrn.32 指令，因为 NEON 在使用 vld2.32 指令加载数据时可以执行此操作。这里有一个小汇编器函数，它就是这样做的：

.globl asmtest

asmtest:
        vld2.32     {d0-d1}, [r0]   # load two vectors and transose
        vrev64.32   q0, q0          # reverse within d0 and d1
        vst1.32     {d0-d1}, [r0]   # store result
        mov pc, lr                  # return from subroutine..

顺便说一句，请注意：指令 vtrn.32、vzip.32 和 vuzp.32 是相同的（但前提是您正在使用 32 位实体）

并且使用 NEON 内在函数？好吧——只是说你完蛋了。正如您已经发现的，您不能直接从一种类型转换为另一种类型，也不能直接混合四重向量和双向量。

这是我使用内在函数想到的最好的方法（它不使用 vld2.32 技巧来提高可读性）：

int main (int argc, char **args)
{
  const float32_t data[4] =
  {
    1, 2, 3, 4
  };

  float32_t     output[4];

  /* load test vectors */
  float32x2_t   a = vld1_f32 (data + 0);
  float32x2_t   b = vld1_f32 (data + 2);

  /* transpose and convert to float32x4_t */
  float32x2x2_t temp   = vzip_f32 (b,a);
  float32x4_t   result = vcombine_f32 (temp.val[0], temp.val[1]);

  /* store for printing */
  vst1q_f32 (output, result);

  /* print out the original and transposed result */
  printf ("%f %f %f %f\n", data[0],   data[1],   data[2],   data[3]);
  printf ("%f %f %f %f\n", output[0], output[1], output[2], output[3]);
}

如果您使用 GCC，这将起作用，但 GCC 生成的代码将非常糟糕且缓慢。 NEON 的内在支持还很年轻。在这里使用直接的 C 代码可能会获得更好的性能。

This is difficult.. There is not a single instruction that can do this, and the best solution depends on if your data is in memory or if they are already in registers.

You need two operations at least to do the conversion.. First a vector turn which permutes your arguments like this:

a = a1 a2
b = b1 b2

vtrn.32  a, b

a = a1 b1 
b = a2 b2

And then you have to swap the arguments of each operation. Either by reversing each vector on it's own or by treating the two vectors as a quad vector and do a long reverse.

temp = {a, b} 
temp = a1 b1 a2 b2

vrev64.32 temp, temp

temp = b1 a1 b2 a2    <-- this is what you want.

If you load your data from memory you can skip the first vtrn.32 instruction because NEON can do this while it loads the data using the vld2.32 instruction. Here is a little assembler function that does just that:

.globl asmtest

asmtest:
        vld2.32     {d0-d1}, [r0]   # load two vectors and transose
        vrev64.32   q0, q0          # reverse within d0 and d1
        vst1.32     {d0-d1}, [r0]   # store result
        mov pc, lr                  # return from subroutine..

Btw, a little note: The instructions vtrn.32, vzip.32 and vuzp.32 are identical (but only if you're working with 32 bit entities)

And with NEON intrinsics? Well - simply said you're screwed. As you've already found out you can't directly cast from one type to another and you can't directly mix quad and double vectors.

This is the best I came up with using intrinsics (it does not use the vld2.32 trick for readability):

int main (int argc, char **args)
{
  const float32_t data[4] =
  {
    1, 2, 3, 4
  };

  float32_t     output[4];

  /* load test vectors */
  float32x2_t   a = vld1_f32 (data + 0);
  float32x2_t   b = vld1_f32 (data + 2);

  /* transpose and convert to float32x4_t */
  float32x2x2_t temp   = vzip_f32 (b,a);
  float32x4_t   result = vcombine_f32 (temp.val[0], temp.val[1]);

  /* store for printing */
  vst1q_f32 (output, result);

  /* print out the original and transposed result */
  printf ("%f %f %f %f\n", data[0],   data[1],   data[2],   data[3]);
  printf ("%f %f %f %f\n", output[0], output[1], output[2], output[3]);
}

If you're using GCC this will work, but the code generated by GCC will be horrible and slow. NEON intrinsic support is still very young. You'll probably get better performance with a straight forward C-code here..

回复收藏 0 原文

~没有更多了~