以下 ARM 指令集会产生停顿吗?
对 ARM11MP Vfpu 进行编程时,我查看了文档,并担心在进行 4 分量点积(作为 4x4 矩阵乘法的一部分)时,以下内容会严重停顿,
fmuls s0, s0, s4
fmacs s0, s1, s5
fmacs s0, s2, s6
fmacs s0, s3, s7
累积步骤是否会在此处产生停顿?如果是这样,我将不得不真正改变周围的东西,因为我只有 32 个单寄存器可以使用,然后按原样使用 9 个。另外,我可以设置向量寄存器以在 1 条指令中执行此操作,但我想知道 3 个指令周期是否值得,因为我必须几乎立即取消设置才能存储回内存,除非我溢出到 ARM 寄存器。从家里发帖,没有我的真实 SO 帐户…
Programming the ARM11MP Vfpu, I've looked over the docs and am concerned that the following will stall badly when doing a 4-component dot product (as part of a 4x4 matrix multiply)
fmuls s0, s0, s4
fmacs s0, s1, s5
fmacs s0, s2, s6
fmacs s0, s3, s7
Does the accumuate step generate stalls here? If so, I will have to really change stuff around as I only get 32 single registers to work with and then takes 9 as it is. Also, I could setup the vector register to do this in 1 instruction, but am wondering if the 3 instruction cycles will be worth it as I'd have to unset it nearly immediately for a store back to memory unless I overflowed to the ARM registers. Posting from home without my real SO account here...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我对 ARM 一点也不熟悉,所以你应该对此持保留态度。这个答案只是基于我在手机上搜索文档大约 20 分钟的结果。我可能遗漏了一些东西,所以这可能不正确。
无论如何,我相信是的,这应该会导致管道停顿。 VFP 协处理器有 8 级流水线,但由于“转发”(每条指令取决于前一条指令的结果),每条指令的停顿周期数应减少到 7 个。尽管如此,考虑到您拥有的 4 条指令,您将停滞大约 28 个周期,这并不是很好。这也没有考虑到加载寄存器所需的时间,这可能会加剧问题。
您可以通过将“fld 指令”与 fmacs 指令交错来提高性能。
请查看以下内容以获取更多信息:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0360f/CACBBDCE.html
“fld”指令的结果应该在 4 个周期内可用,这意味着如果您可以执行以下操作:
那么您可以将停滞周期的总数减少到 17。
假设您在循环中执行此操作,您可能可以通过尝试开始“下一个”循环迭代来进一步减少停滞当前迭代正在执行(即循环展开)。此外,根据数据的存储方式,一旦进行循环展开,您可能可以通过使用 fldm 而不是 fld 指令来进一步改进。
无论如何,手动优化管道行为都是很困难的。有什么原因不能让编译器为你进行指令调度吗?
I'm not in any way familiar with ARM, so you should take this with a grain of salt. This answer is just based on about 20 mins of searching around for documentation on my phone. There could be some things I'm missing, so this may not be correct.
In any case, I believe yes, this should cause pipeline stalls. The VFP coprocessor has an 8 stage pipeline, but because of "forwarding" (each instruction depends on the result of the previous instruction) the number of stalled cycles should be reduced to 7 for each instruction. Still, given the 4 instructions you have you would be stalled for about 28 cycles, which isn't very good. This also doesn't account for time required to load the registers, which could exacerbate the problem.
You can probably improve performance by interleaving the "fld instructions" with the fmacs instructions.
Check out the following for more info:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0360f/CACBBDCE.html
The results of an "fld" instruction should be available within 4 cycles, which means if you could do something like:
Then you could reduce the total number of stalled cycles down to 17.
Assuming you are doing this in a loop, you could probably further reducing stalling by trying to start work on the "next" loop iteration while the current iteration is executing (i.e. loop unrolling). Also, depending on how your data is stored, once you are doing loop unrolling you can probably improve things even more by using fldm instead of fld instructions.
In any case optimizing the pipeline behavior by hand is difficult. Is there are a reason you can't let the compiler do instruction scheduling for you?