使用 NEON 内在函数转置 8x8 浮点矩阵
我有一个程序需要对 8x8 float32 矩阵多次运行转置操作。我想使用 NEON SIMD 内在函数转置它们。我知道数组将始终包含 8x8 浮点元素。我有一个基线非内在解决方案如下:
void transpose(float *matrix, float *matrixT) {
for (int i = 0; i < 8; i++) {
for (int j = 0; j < 8; j++) {
matrixT[i*8+j] = matrix[j*8+i];
}
}
}
我还创建了一个内在解决方案,它转置 8x8 矩阵的每个 4x4 象限,并交换第二和第三象限的位置。该解决方案如下所示:
void transpose_4x4(float *matrix, float *matrixT, int store_index) {
float32x4_t r0, r1, r2, r3, c0, c1, c2, c3;
r0 = vld1q_f32(matrix);
r1 = vld1q_f32(matrix + 8);
r2 = vld1q_f32(matrix + 16);
r3 = vld1q_f32(matrix + 24);
c0 = vzip1q_f32(r0, r1);
c1 = vzip2q_f32(r0, r1);
c2 = vzip1q_f32(r2, r3);
c3 = vzip2q_f32(r2, r3);
r0 = vcombine_f32(vget_low_f32(c0), vget_low_f32(c2));
r1 = vcombine_f32(vget_high_f32(c0), vget_high_f32(c2));
r2 = vcombine_f32(vget_low_f32(c1), vget_low_f32(c3));
r3 = vcombine_f32(vget_high_f32(c1), vget_high_f32(c3));
vst1q_f32(matrixT + store_index, r0);
vst1q_f32(matrixT + store_index + 8, r1);
vst1q_f32(matrixT + store_index + 16, r2);
vst1q_f32(matrixT + store_index + 24, r3);
}
void transpose(float *matrix, float *matrixT) {
// Transpose top-left 4x4 quadrant and store the result in the top-left 4x4 quadrant
transpose_4x4(matrix, matrixT, 0);
// Transpose top-right 4x4 quadrant and store the result in the bottom-left 4x4 quadrant
transpose_4x4(matrix + 4, matrixT, 32);
// Transpose bottom-left 4x4 quadrant and store the result in the top-right 4x4 quadrant
transpose_4x4(matrix + 32, matrixT, 4);
// Transpose bottom-right 4x4 quadrant and store the result in the bottom-right 4x4 quadrant
transpose_4x4(matrix + 36, matrixT, 36);
}
但是,该解决方案的性能比基准非内在解决方案要慢。我正在努力寻找一种可以转置我的 8x8 矩阵的更快的解决方案(如果有的话)。任何帮助将不胜感激!
编辑:两个解决方案都是使用 -O1 标志编译的。
I have a program that needs to run a transpose operation on 8x8 float32 matrices many times. I want to transpose these using NEON SIMD intrinsics. I know that the array will always contain 8x8 float elements. I have a baseline non-intrinsic solution below:
void transpose(float *matrix, float *matrixT) {
for (int i = 0; i < 8; i++) {
for (int j = 0; j < 8; j++) {
matrixT[i*8+j] = matrix[j*8+i];
}
}
}
I also created an intrinsic solution that transposes each 4x4 quadrant of the 8x8 matrix, and swaps the positions of the second and third quadrants. This solution looks like this:
void transpose_4x4(float *matrix, float *matrixT, int store_index) {
float32x4_t r0, r1, r2, r3, c0, c1, c2, c3;
r0 = vld1q_f32(matrix);
r1 = vld1q_f32(matrix + 8);
r2 = vld1q_f32(matrix + 16);
r3 = vld1q_f32(matrix + 24);
c0 = vzip1q_f32(r0, r1);
c1 = vzip2q_f32(r0, r1);
c2 = vzip1q_f32(r2, r3);
c3 = vzip2q_f32(r2, r3);
r0 = vcombine_f32(vget_low_f32(c0), vget_low_f32(c2));
r1 = vcombine_f32(vget_high_f32(c0), vget_high_f32(c2));
r2 = vcombine_f32(vget_low_f32(c1), vget_low_f32(c3));
r3 = vcombine_f32(vget_high_f32(c1), vget_high_f32(c3));
vst1q_f32(matrixT + store_index, r0);
vst1q_f32(matrixT + store_index + 8, r1);
vst1q_f32(matrixT + store_index + 16, r2);
vst1q_f32(matrixT + store_index + 24, r3);
}
void transpose(float *matrix, float *matrixT) {
// Transpose top-left 4x4 quadrant and store the result in the top-left 4x4 quadrant
transpose_4x4(matrix, matrixT, 0);
// Transpose top-right 4x4 quadrant and store the result in the bottom-left 4x4 quadrant
transpose_4x4(matrix + 4, matrixT, 32);
// Transpose bottom-left 4x4 quadrant and store the result in the top-right 4x4 quadrant
transpose_4x4(matrix + 32, matrixT, 4);
// Transpose bottom-right 4x4 quadrant and store the result in the bottom-right 4x4 quadrant
transpose_4x4(matrix + 36, matrixT, 36);
}
This solution however, results in a slower performance than the baseline non-intrinsic solution. I am struggling to see, if there is one, a faster solution that can transpose my 8x8 matrix. Any help would be greatly appreciated!
Edit: both solutions are compiled using the -O1 flag.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先,您不应该期望一开始就会有巨大的性能提升:
总而言之,通过矢量化只需节省一点带宽 - 这就是全部
至于 4x4 转置,您甚至不需要单独的函数,而只需要一个宏:
就可以完成这项工作,因为 NEON 在 4x4 转置上执行当您使用
vld4
加载数据时会飞。但此时您应该问自己,如果 4x4 转置几乎不需要任何成本,您的方法(在实际计算之前转置所有矩阵)是否正确。此步骤最终可能纯粹浪费计算和带宽。优化不应该局限于最后一步,而应该从设计阶段就考虑。
8x8 转置是一种不同的动物:
它归结为:16 load + 32 trn + 16 store vs 64 load + 64 store
现在我们可以清楚地看到它确实不值得。上面的霓虹灯例程可能会快一点,但我怀疑它最终会产生影响。
不,您无法进一步优化它。没有人可以。只需确保指针是 64 字节对齐,进行测试,然后自行决定。
上面是手工优化的汇编版本,它很可能更短(尽可能短),但并不比下面的纯 C 版本快得多:
下面是我要解决的纯 C 版本:
或
PS:它可以带来一些性能增益/功耗如果你声明了
pDst
和pSrc
uint32_t *
,因为编译器肯定会生成具有多种寻址模式的纯整数机器代码,并且只使用w
寄存器而不是s
寄存器。只需将float *
键入uint32_t *
PS2:当 GCC 正在运行时,Clang 已经使用了
w
寄存器而不是s
寄存器GCC...GNU-shills什么时候才能最终承认GCC对于ARM来说是一个极其糟糕的选择?godbolt
PS3:下面是汇编中的非霓虹灯版本(零延迟),因为我非常失望(甚至震惊)在上面的 Clang 和 GCC 中:
如果您仍然坚持进行纯 8x8 转置,这可以说是您将获得的最好版本。它可能比霓虹灯组件版本慢一点,但消耗的电量要少得多。
First off, you shouldn't expect a huge performance boost to start with:
to sum it up, just a little bit saving in bandwidth by vectorizing - that's all
As for the 4x4 transpose, you don't even need a separate function, but just a macro:
will do the job since NEON does the 4x4 transpose on the fly when you load the data with
vld4
.But you should ask yourself at this point if your approach - transposing all the matrice prior to actual computation - is the right one if 4x4 transpose costs virtually nothing. This step could end up being a pure waste of computation and bandwidth. Optimization shouldn't be limited to the final step, but should be considered from the designing phase.
8x8 transpose is a different animal though:
It boils down to : 16 load + 32 trn + 16 store vs 64 load + 64 store
Now we can clearly see it really isn't worth it. The neon routine above might be a little faster, but I doubt it will make a difference in the end.
No, you can't optimize it any further. Nobody can. Just make sure the pointers are 64byte aligned, test it, and decide for yourself.
above is the hand optimized assembly version that's most probably shorter (as short as it can get), but not exactly meaningfully faster than:
Below is the pure C version that I'd settle with:
or
PS: It could bring some gain in performance/power consumption if you declared
pDst
andpSrc
uint32_t *
, because the compiler would definitely generate pure integer machine code which has most various addressing modes, and only usew
registers instead ofs
ones. Just typecasefloat *
touint32_t *
PS2: Clang already utilizes
w
registers instead ofs
ones while GCC is being GCC.... When will GNU-shills finally admit the fact that GCC is an extremely bad choice for ARM?godbolt
PS3: Below is the non-neon version in assembly (zero latency) since I was very disappointed (even shocked) in both Clang and GCC above:
It's arguably the best version you will ever get if you still insist on doing pure 8x8 transpose. It might be a little slower than the neon assembly version, but consume considerably less power.
可以优化其他答案中提供的 8x8 neon 代码; 8x8 转置不仅可以被认为是
[AB;CD]' == [A' C'; 的递归版本B' D']
也可重复应用 zip 或 unzip。对于 8x8 矩阵,我们需要应用该算法 3 次,并通过 vld4 读取数据,其中两次已经完成。
人们还应该能够通过从
vld1q_f32_x4
开始,然后uzpq
并以vst4q_f32
结束来执行转置。It's possible to optimise the 8x8 neon code presented in the other answer; 8x8 transpose can be not only thought of as recursive version of
[A B;C D]' == [A' C'; B' D']
but also as repeated application of zip or unzip.For 8x8 matrix we need to apply this algorithm 3 times and reading the data by vld4 two of those passes have been already done.
One should be able to perform the transpose also by starting with
vld1q_f32_x4
, thenuzpq
and finish withvst4q_f32
.