霓虹灯:掉期float32x4中的4个标量
我使用以下代码在Float32x4_t向量中交换4个标量。 {1,2,3,4} - > {4,3,2,1}
float32x4_t Vec = {1,2,3,4};
float32x4_t Rev = vrev64q_f32 (Vec); //{2,1,4,3}
High = vget_high_f32 (Rev); //{4,3}
Low = vget_low_f32 (Rev); //{1,2}
float32x4_t Swap = vcombine_f32 (High, Low); //{4,3,2,1}
您可以建议更快的代码吗?
谢谢你, Zvika
I used the following code to swap 4 scalars in float32x4_t vector.
{1,2,3,4} -> {4,3,2,1}
float32x4_t Vec = {1,2,3,4};
float32x4_t Rev = vrev64q_f32 (Vec); //{2,1,4,3}
High = vget_high_f32 (Rev); //{4,3}
Low = vget_low_f32 (Rev); //{1,2}
float32x4_t Swap = vcombine_f32 (High, Low); //{4,3,2,1}
Can you suggest a faster code ?
Thank you,
Zvika
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
那可能是 ,就像它得到的一样好。
逆向工程的代码(对于AARCH64,GCC/CLANG -O3)将
在ARMV7(GCC 11.2)上使用您的原始版本将
其编译到其他更紧凑的版本中,
如果您喜欢
vswp
方法(仅在ARMV7上)按原样保留代码,因为没有插入的固有信息。在ARMV7上,您也可以在内部夹住时使用
,并且在另一个寄存器上产生结果时,这可以将其编译为两个指令,而它们之间没有依赖性。 Cortex-A7上的排列通常为1个周期 / 64位,具有4个周期潜伏期,因此这可能是其他方法的两倍。
That is possibly as good as it gets.
The reverse engineered code (for aarch64, gcc/clang -O3) would be
On armv7 (gcc 11.2) your original version compiles to
where as the other more compact version compiles to
If you prefer the
vswp
approach (only on armv7) keep your code as is, since there are no intrinsics for swaps.On armv7 you could also use
When inlined and when the result can be produced on another register, this can compile just to two instructions with no dependency between them. Permutations on Cortex-A7 are typically 1 cycle / 64 bits, with 4 cycle latency, so this could be twice as fast as the other approaches.