霓虹灯：掉期float32x4中的4个标量

发布于 2025-02-06 15:26:30 字数 407 浏览 1 评论 0原文

我使用以下代码在Float32x4_t向量中交换4个标量。 {1,2,3,4} - ＆gt; {4,3,2,1}

float32x4_t Vec = {1,2,3,4};
float32x4_t Rev = vrev64q_f32 (Vec); //{2,1,4,3}
High = vget_high_f32 (Rev); //{4,3}
Low = vget_low_f32 (Rev); //{1,2}
float32x4_t Swap = vcombine_f32 (High, Low); //{4,3,2,1}

您可以建议更快的代码吗？

谢谢你， Zvika

原文

I used the following code to swap 4 scalars in float32x4_t vector.
{1,2,3,4} -> {4,3,2,1}

float32x4_t Vec = {1,2,3,4};
float32x4_t Rev = vrev64q_f32 (Vec); //{2,1,4,3}
High = vget_high_f32 (Rev); //{4,3}
Low = vget_low_f32 (Rev); //{1,2}
float32x4_t Swap = vcombine_f32 (High, Low); //{4,3,2,1}

Can you suggest a faster code ?

Thank you,
Zvika

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

扎心 2025-02-13 15:26:30

那可能是，就像它得到的一样好。

逆向工程的代码（对于AARCH64，GCC/CLANG -O3）将

vec = vrev64q_f32(vec);
return vextq_f32(vec,vec,2);

在ARMV7（GCC 11.2）上使用您的原始版本将

    vrev64.32       q0, q0
    vswp    d0, d1

其编译到其他更紧凑的版本中，

    vrev64.32       q0, q0
    vext.32 q0, q0, q0, #2

如果您喜欢vswp 方法（仅在ARMV7上）按原样保留代码，因为没有插入的固有信息。

在ARMV7上，您也可以在内部夹住时使用

float32x2_t lo = vrev64_f32(vget_high_f32(vec));
float32x2_t hi = vrev64_f32(vget_low_f32(vec));
return vcombine_f32(lo, hi);

，并且在另一个寄存器上产生结果时，这可以将其编译为两个指令，而它们之间没有依赖性。 Cortex-A7上的排列通常为1个周期 / 64位，具有4个周期潜伏期，因此这可能是其他方法的两倍。

That is possibly as good as it gets.

The reverse engineered code (for aarch64, gcc/clang -O3) would be

vec = vrev64q_f32(vec);
return vextq_f32(vec,vec,2);

On armv7 (gcc 11.2) your original version compiles to

    vrev64.32       q0, q0
    vswp    d0, d1

where as the other more compact version compiles to

    vrev64.32       q0, q0
    vext.32 q0, q0, q0, #2

If you prefer the vswp approach (only on armv7) keep your code as is, since there are no intrinsics for swaps.

On armv7 you could also use

float32x2_t lo = vrev64_f32(vget_high_f32(vec));
float32x2_t hi = vrev64_f32(vget_low_f32(vec));
return vcombine_f32(lo, hi);

When inlined and when the result can be produced on another register, this can compile just to two instructions with no dependency between them. Permutations on Cortex-A7 are typically 1 cycle / 64 bits, with 4 cycle latency, so this could be twice as fast as the other approaches.

回复收藏 0 原文

~没有更多了~