iPhone上的VFP单位矩阵乘法问题
我正在尝试使用 iPhone 上的矢量浮点编写 Matrix3x3 乘法,但是我遇到了一些问题。这是我第一次尝试编写任何 ARM 程序集,因此它可能是一个我没有看到的失败的简单解决方案。
我目前正在使用我编写的数学库运行一个小型应用程序。我正在研究使用向量浮点单元所提供的好处,因此我将矩阵相乘并将其转换为 asm。以前,应用程序可以毫无问题地运行,但现在我的对象将全部随机消失。这似乎是由我的矩阵乘法在某个时刻变成 NAN 的结果引起的。
这是代码
IMatrix3x3 operator*(IMatrix3x3 & _A, IMatrix3x3 & _B)
{
IMatrix3x3 C;
//C++ code for the simulator
#if TARGET_IPHONE_SIMULATOR == true
C.A0 = _A.A0 * _B.A0 + _A.A1 * _B.B0 + _A.A2 * _B.C0;
C.A1 = _A.A0 * _B.A1 + _A.A1 * _B.B1 + _A.A2 * _B.C1;
C.A2 = _A.A0 * _B.A2 + _A.A1 * _B.B2 + _A.A2 * _B.C2;
C.B0 = _A.B0 * _B.A0 + _A.B1 * _B.B0 + _A.B2 * _B.C0;
C.B1 = _A.B0 * _B.A1 + _A.B1 * _B.B1 + _A.B2 * _B.C1;
C.B2 = _A.B0 * _B.A2 + _A.B1 * _B.B2 + _A.B2 * _B.C2;
C.C0 = _A.C0 * _B.A0 + _A.C1 * _B.B0 + _A.C2 * _B.C0;
C.C1 = _A.C0 * _B.A1 + _A.C1 * _B.B1 + _A.C2 * _B.C1;
C.C2 = _A.C0 * _B.A2 + _A.C1 * _B.B2 + _A.C2 * _B.C2;
//VPU ARM asm for the device
#else
//create a pointer to the Matrices
IMatrix3x3 * pA = &_A;
IMatrix3x3 * pB = &_B;
IMatrix3x3 * pC = &C;
//asm code
asm volatile(
//turn on a vector depth of 3
"fmrx r0, fpscr \n\t"
"bic r0, r0, #0x00370000 \n\t"
"orr r0, r0, #0x00020000 \n\t"
"fmxr fpscr, r0 \n\t"
//load matrix B into the vector bank
"fldmias %1, {s8-s16} \n\t"
//load the first row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.A0, C.A1 and C.A2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//load the second row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.B0, C.B1 and C.B2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//load the third row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.C0, C.C1 and C.C2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//set the vector depth back to 1
"fmrx r0, fpscr \n\t"
"bic r0, r0, #0x00370000 \n\t"
"orr r0, r0, #0x00000000 \n\t"
"fmxr fpscr, r0 \n\t"
//pass the inputs and set the clobber list
: "+r"(pA), "+r"(pB), "+r" (pC) :
:"cc", "memory","s0", "s1", "s2", "s8", "s9", "s10", "s11", "s12", "s13", "s14", "s15", "s16", "s17", "s18", "s19"
);
#endif
return C;
}
据我所知, ,这是有道理的。在调试时我注意到,如果我在返回之前和 ASM 之后说 _A = C
,_A
不一定等于 C
这只增加了我的困惑。我原以为这可能是由于我给 VFPU 的指针被诸如 "fldmias %0!, {s0-s2} \n\t"
之类的行所影响,但是我的理解asm 不足以正确理解问题,也不足以找到该行代码的替代方法。
不管怎样,我希望比我更了解的人能够找到解决方案,任何帮助将不胜感激,谢谢:-)
编辑:我发现 pC
似乎尽管已设置 pC = &C,但当命中 asm 代码时,该值仍为 NULL。我假设这是由于编译器重新排列了破坏它的庄园中的代码?我已经尝试了我见过的各种方法来阻止这种情况的发生(比如在输入列表中添加所有相关的内容 - 认为这甚至不应该是必要的,因为我在破坏列表中列出了“内存”)并且我仍然得到同样的问题。
编辑#2:是的,内存问题似乎是由我在破坏列表中不包括 "r0"
引起的,但是修复该问题(如果确实已修复)似乎并没有解决问题。我注意到将旋转矩阵乘以单位矩阵不能正确工作,而是给出 0.88 作为矩阵中的最后一个条目而不是 1:
| 0.88 0.48 0 | | 1 0 0 | | 0.88 0.48 0 |
|-0.48 0.88 0 | * | 0 1 0 | = |-0.48 0.88 0 |
| 0 0 1 | | 0 0 1 | | 0 0 0.88|
我认为我的逻辑一定在某个地方是错误的,所以我逐步完成了程序集。一切看起来都很好,直到最后一个“fmacs s17, s14, s2 \n\t”,其中:
s0 = 0 s14 = 0 s17 = 0
s1 = 0 s15 = 0 s18 = 0
s2 = 1 s16 = 1 s19 = 0
所以 fmacs 肯定正在执行操作:
s17 = s17 + s14 * s2 = 0 + 0 * 1 = 0
s18 = s18 + s15 * s2 = 0 + 0 * 1 = 0
s19 = s19 + s16 * s2 = 0 + 1 * 1 = 1
但是结果给出了 s19 = 0.88
这让我更加困惑:我是否误解了 fmacs
的工作原理? (PS,对于现在已经成为一个非常长的问题感到抱歉:-P)
I'm trying to write a Matrix3x3 multiply using the Vector Floating Point on the iPhone, however i'm encountering some problems. This is my first attempt at writing any ARM assembly, so it could be a faily simple solution that i'm not seeing.
I've currently got a small application running using a maths library that i've written. I'm investigating into the benifits using the Vector Floating Point Unit would provide so i've taken my matrix multiply and converted it to asm. Previously the application would run without a problem, however now my objects will all randomly disappear. This seems to be caused by the results from my matrix multiply becoming NAN at some point.
Heres the code
IMatrix3x3 operator*(IMatrix3x3 & _A, IMatrix3x3 & _B)
{
IMatrix3x3 C;
//C++ code for the simulator
#if TARGET_IPHONE_SIMULATOR == true
C.A0 = _A.A0 * _B.A0 + _A.A1 * _B.B0 + _A.A2 * _B.C0;
C.A1 = _A.A0 * _B.A1 + _A.A1 * _B.B1 + _A.A2 * _B.C1;
C.A2 = _A.A0 * _B.A2 + _A.A1 * _B.B2 + _A.A2 * _B.C2;
C.B0 = _A.B0 * _B.A0 + _A.B1 * _B.B0 + _A.B2 * _B.C0;
C.B1 = _A.B0 * _B.A1 + _A.B1 * _B.B1 + _A.B2 * _B.C1;
C.B2 = _A.B0 * _B.A2 + _A.B1 * _B.B2 + _A.B2 * _B.C2;
C.C0 = _A.C0 * _B.A0 + _A.C1 * _B.B0 + _A.C2 * _B.C0;
C.C1 = _A.C0 * _B.A1 + _A.C1 * _B.B1 + _A.C2 * _B.C1;
C.C2 = _A.C0 * _B.A2 + _A.C1 * _B.B2 + _A.C2 * _B.C2;
//VPU ARM asm for the device
#else
//create a pointer to the Matrices
IMatrix3x3 * pA = &_A;
IMatrix3x3 * pB = &_B;
IMatrix3x3 * pC = &C;
//asm code
asm volatile(
//turn on a vector depth of 3
"fmrx r0, fpscr \n\t"
"bic r0, r0, #0x00370000 \n\t"
"orr r0, r0, #0x00020000 \n\t"
"fmxr fpscr, r0 \n\t"
//load matrix B into the vector bank
"fldmias %1, {s8-s16} \n\t"
//load the first row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.A0, C.A1 and C.A2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//load the second row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.B0, C.B1 and C.B2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//load the third row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.C0, C.C1 and C.C2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//set the vector depth back to 1
"fmrx r0, fpscr \n\t"
"bic r0, r0, #0x00370000 \n\t"
"orr r0, r0, #0x00000000 \n\t"
"fmxr fpscr, r0 \n\t"
//pass the inputs and set the clobber list
: "+r"(pA), "+r"(pB), "+r" (pC) :
:"cc", "memory","s0", "s1", "s2", "s8", "s9", "s10", "s11", "s12", "s13", "s14", "s15", "s16", "s17", "s18", "s19"
);
#endif
return C;
}
As far as i can see that makes sence. While debugging i've managed to notice that if i were to say _A = C
prior to the return and after the ASM, _A
will not necessarily be equal to C
which has only increased my confusion. I had thought it was possibly due to the pointers I'm giving to the VFPU being incrimented by lines such as "fldmias %0!, {s0-s2} \n\t"
however my understanding of asm is not good enough to properly understand the problem, nor to see an alternative approach to that line of code.
Anyway, I was hoping someone with a greater understanding than me would be able to see a solution, and any help would be greatly appreciated, thank you :-)
Edit: I've found that pC
seems to be NULL when the asm code is hit despite being set pC = &C
. I'm assuming this is due to the compiler rearranging the code in a manor thats breaking it? I've tried the various methods I've seen for stopping this happening (like adding everything relevent in the input list - thought this shouldnt even be nessisary since i'm listing "memory" in the clobber list) and I'm still getting the same problems.
Edit #2: Right, the memory issue seems to have been caused by me not including "r0"
in the clobber list, however fixing that (if it is indeed fixed) doesnt seem to have fixed the problem. I noticed that multiplying a rotation matrix by the identity matrix doesn't work correctly and instead gives 0.88 as the last entry in the matrix instead of 1:
| 0.88 0.48 0 | | 1 0 0 | | 0.88 0.48 0 |
|-0.48 0.88 0 | * | 0 1 0 | = |-0.48 0.88 0 |
| 0 0 1 | | 0 0 1 | | 0 0 0.88|
I figured then that my logic must be wrong somewhere so i stepped through the assembly. everything seems fine up until the last "fmacs s17, s14, s2 \n\t" where:
s0 = 0 s14 = 0 s17 = 0
s1 = 0 s15 = 0 s18 = 0
s2 = 1 s16 = 1 s19 = 0
so surely the fmacs
is performing the operation:
s17 = s17 + s14 * s2 = 0 + 0 * 1 = 0
s18 = s18 + s15 * s2 = 0 + 0 * 1 = 0
s19 = s19 + s16 * s2 = 0 + 1 * 1 = 1
However the result gives s19 = 0.88
which has left me even more confused :S am i misunderstanding how fmacs
works? (P.S sorry for what has now become a really long question :-P)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
解决了问题!我不知道矢量库是“圆形的”。
存储体 0-7、8-15、16-23 和 24-31 可以包含长度最多为 8 的向量,并且可以通过简单地说明您正在使用长度为 4 的 s16 来用作向量。然而,就我而言,我一直使用长度为 3 的 s14,假设这会给我 s14、s15 和 s16,但相反,因为它是循环的,所以它会回滚到 s8 - 换句话说,我正在使用 s14、s15 和 s8 。
我花了很长时间才看到这一点,所以希望如果其他人有类似的问题,他们会发现这个:-)
Solved the problem! i was unaware that the vector banks were "circular".
The banks 0-7, 8-15, 16-23 and 24-31 can contain vectors of up to a length of 8, and can be used as vectors by simply stating you are using s16 with a length of 4 for example. However, in my case i had been using s14 with a length of 3, assuming this would get me s14,s15 and s16, but instead because its circular it would roll back to s8 - in other words i was using s14, s15 and s8.
Took my a long time to see that, so hopefully if anyone else has a similar problem they will find this :-)