如何在 ARM Cortex-a8 中使用乘法和累加内在函数?
如何使用GCC提供的乘累加内在函数?
float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);
谁能解释一下我必须传递给这个函数的三个参数。我的意思是源寄存器和目标寄存器以及函数返回什么?
帮助!!!
how to use the Multiply-Accumulate intrinsics provided by GCC?
float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);
Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers and what the function returns?
Help!!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
简单地说,vmla 指令执行以下操作:
所有这些都编译成单个汇编程序指令 :-)
您可以在 3D 图形的典型 4x4 矩阵乘法中使用这个 NEON 汇编程序内在函数,如下所示:
这可以节省几个周期因为你不必将乘法后的结果相加。加法的使用非常频繁,以至于乘法累加 hsa 如今已成为主流(甚至 x86 也在最近的一些 SSE 指令集中添加了它们)。
另外值得一提的是:像这样的乘法累加运算在线性代数和 DSP(数字信号处理)应用中非常很常见。 ARM 非常聪明,在 Cortex-A8 NEON-Core 内部实现了快速路径。如果 VMLA 指令的第一个参数(累加器)是前面的 VML 或 VMLA 指令的结果,则此快速路径启动。我可以详细说明,但简而言之,这样的指令系列的运行速度比 VML / VADD / VML / VADD 系列快四倍。
看看我的简单矩阵乘法:我就是这样做的。由于这种快速路径,它的运行速度比使用 VML 和 ADD 而不是 VMLA 编写的实现快大约四倍。
Simply said the vmla instruction does the following:
And all this compiles into a singe assembler instruction :-)
You can use this NEON-assembler intrinsic among other things in typical 4x4 matrix multiplications for 3D-graphics like this:
This saves a couple of cycles because you don't have to add the results after multiplication. The addition is so often used that multiply-accumulates hsa become mainstream these days (even x86 has added them in some recent SSE instruction set).
Also worth mentioning: Multiply-accumulate operations like this are very common in linear algebra and DSP (digital signal processing) applications. ARM was very smart and implemented a fast-path inside the Cortex-A8 NEON-Core. This fast-path kicks in if the first argument (the accumulator) of a VMLA instruction is the result of a preceding VML or VMLA instruction. I could go into detail but in a nutshell such an instruction series runs four times faster than a VML / VADD / VML / VADD series.
Take a look at my simple matrix-multiply: I did exactly that. Due to this fast-path it will run roughly four times faster than implementation written using VML and ADD instead of VMLA.
Google 搜索
vmlaq_f32
,出现 RVCT 编译器工具的参考。它是这样说的:并且
IOW,函数的返回值将是一个包含 4 个 32 位浮点数的向量,向量的每个元素都是通过将
b
和c
的相应元素相乘来计算的,并添加a
的内容。华泰
Google'd for
vmlaq_f32
, turned up the reference for the RVCT compiler tools. Here's what it says:AND
IOW, the return value from the function will be a vector containing 4 32-bit floats, and each element of the vector is calculated by multiplying the corresponding elements of
b
andc
, and adding the contents ofa
.HTH
但这个顺序是行不通的。问题是 x 分量仅累积由矩阵行调制的 x,并且可以表示为:
...
正确的序列是:
...
NEON 和 SSE 没有内置的字段选择(这需要指令编码中的 8 位,每个向量寄存器)。例如,GLSL/HLSL 确实具有此类功能,因此大多数 GPU 也具有此类功能。
实现此目的的替代方法是:
... // 当然,矩阵将被转置以产生相同的结果
mul,madd,madd,madd 序列通常是首选,因为它不需要目标的写入掩码寄存器字段。
否则代码看起来不错。 =)
This sequence won't work, though. The problem is that x component accumulates only x modulated by the matrix rows and can be expressed as:
...
The correct sequence would be:
...
NEON and SSE don't have built-in selection for the fields (this would require 8 bits in instruction incoding, per vector register). GLSL/HLSL for example does have this kind of facilities so most GPUs have also.
Alternative way to achieve this would be:
... // and of course, the matrix would be transpose for this to yield same result
The mul,madd,madd,madd sequence is usually preferred as it does not require write mask for the target register fields.
Otherwise the code looks good. =)