NEON 向量化无符号字节的乘积之和: (a[i]-int1) * (b[i]-int2)
我需要改进循环,因为我的应用程序调用了数千次。我想我需要用 Neon 来做这件事,但我不知道从哪里开始。
假设/前提条件:
w
始终为 320(16/32 的倍数)。pa
和pb
是 16 字节对齐的,ma
和mb
是正数。
int whileInstruction (const unsigned char *pa,const unsigned char *pb,int ma,int mb,int w)
{
int sum=0;
do {
sum += ((*pa++)-ma)*((*pb++)-mb);
} while(--w);
return sum;
}
这种矢量化的尝试效果不佳,并且不安全(缺少破坏),但演示了我正在尝试做的事情:
int whileInstruction (const unsigned char *pa,const unsigned char *pb,int ma,int mb,int w)
{
asm volatile("lsr %2, %2, #3 \n"
".loop: \n"
"# load 8 elements: \n"
"vld4.8 {d0-d3}, [%1]! \n"
"vld4.8 {d4-d7}, [%2]! \n"
"# do the operation: \n"
"vaddl.u8 q7, d0, r7 \n"
"vaddl.u8 q8, d1, d8 \n"
"vmlal.u8 q7, q7, q8 \n"
"# Sum the vector a save in sum (this is wrong):\n"
"vaddl.u8 q7, d0, r7 \n"
"subs %2, %2, #1 \n" // Decrement iteration count
"bne .loop \n" // Repeat unil iteration count is not zero
:
: "r"(pa), "r"(pb), "r"(w),"r"(ma),"r"(mb),"r"(sum)
: "r4", "r5", "r6","r7","r8","r9"
);
return sum;
}
I need to improve a loop, because is called by my application thousands of times. I suppose I need to do it with Neon, but I don´t know where to begin.
Assumptions / pre-conditions:
w
is always 320 (multiple of 16/32).pa
andpb
are 16-byte alignedma
andmb
are positive.
int whileInstruction (const unsigned char *pa,const unsigned char *pb,int ma,int mb,int w)
{
int sum=0;
do {
sum += ((*pa++)-ma)*((*pb++)-mb);
} while(--w);
return sum;
}
This attempt at vectorizing it is not working well, and isn't safe (missing clobbers), but demonstrates what I'm trying to do:
int whileInstruction (const unsigned char *pa,const unsigned char *pb,int ma,int mb,int w)
{
asm volatile("lsr %2, %2, #3 \n"
".loop: \n"
"# load 8 elements: \n"
"vld4.8 {d0-d3}, [%1]! \n"
"vld4.8 {d4-d7}, [%2]! \n"
"# do the operation: \n"
"vaddl.u8 q7, d0, r7 \n"
"vaddl.u8 q8, d1, d8 \n"
"vmlal.u8 q7, q7, q8 \n"
"# Sum the vector a save in sum (this is wrong):\n"
"vaddl.u8 q7, d0, r7 \n"
"subs %2, %2, #1 \n" // Decrement iteration count
"bne .loop \n" // Repeat unil iteration count is not zero
:
: "r"(pa), "r"(pb), "r"(w),"r"(ma),"r"(mb),"r"(sum)
: "r4", "r5", "r6","r7","r8","r9"
);
return sum;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一个简单的 NEON 实现。我已经针对标量代码对此进行了测试,以确保它有效。请注意,为了获得最佳性能,
pa
和pb
都应按 16 字节对齐。Here is a simple NEON implementation. I have tested this against the scalar code to make sure that it works. Note that for best performance both
pa
andpb
should be 16 byte aligned.好吧,我的问题的另一个解决方案采用了 Paul R 的完美解决方案,在 w 等于 8 的情况下,通常会发生什么,可以使用这个函数:
也许可以改进它。
Well another solution for my problem taken the perfect solution by Paul R, in the case the w is equal to 8, what happens usually it is possible to use this function:
Maybe it is possible to improve it.