如何在 ARM Cortex-a8 中使用乘法和累加内在函数?

发布于 2024-09-09 15:58:39 字数 185 浏览 8 评论 0原文

如何使用GCC提供的乘累加内在函数?

float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);

谁能解释一下我必须传递给这个函数的三个参数。我的意思是源寄存器和目标寄存器以及函数返回什么?

帮助!!!

how to use the Multiply-Accumulate intrinsics provided by GCC?

float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);

Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers and what the function returns?

Help!!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

日记撕了你也走了 2024-09-16 15:58:39

简单地说,vmla 指令执行以下操作:

struct 
{
  float val[4];
} float32x4_t


float32x4_t vmla (float32x4_t a, float32x4_t b, float32x4_t c)
{
  float32x4 result;

  for (int i=0; i<4; i++)
  {
    result.val[i] =  b.val[i]*c.val[i]+a.val[i];
  }

  return result;
}

所有这些都编译成单个汇编程序指令 :-)

您可以在 3D 图形的典型 4x4 矩阵乘法中使用这个 NEON 汇编程序内在函数,如下所示:

float32x4_t transform (float32x4_t * matrix, float32x4_t vector)
{
  /* in a perfect world this code would compile into just four instructions */
  float32x4_t result;

  result = vml (matrix[0], vector);
  result = vmla (result, matrix[1], vector);
  result = vmla (result, matrix[2], vector);
  result = vmla (result, matrix[3], vector);

  return result;
}

这可以节省几个周期因为你不必将乘法后的结果相加。加法的使用非常频繁,以至于乘法累加 hsa 如今已成为主流(甚至 x86 也在最近的一些 SSE 指令集中添加了它们)。

另外值得一提的是:像这样的乘法累加运算在线性代数和 DSP(数字信号处理)应用中非常很常见。 ARM 非常聪明,在 Cortex-A8 NEON-Core 内部实现了快速路径。如果 VMLA 指令的第一个参数(累加器)是前面的 VML 或 VMLA 指令的结果,则此快速路径启动。我可以详细说明,但简而言之,这样的指令系列的运行速度比 VML / VADD / VML / VADD 系列快四倍。

看看我的简单矩阵乘法:我就是这样做的。由于这种快速路径,它的运行速度比使用 VML 和 ADD 而不是 VMLA 编写的实现快大约四倍。

Simply said the vmla instruction does the following:

struct 
{
  float val[4];
} float32x4_t


float32x4_t vmla (float32x4_t a, float32x4_t b, float32x4_t c)
{
  float32x4 result;

  for (int i=0; i<4; i++)
  {
    result.val[i] =  b.val[i]*c.val[i]+a.val[i];
  }

  return result;
}

And all this compiles into a singe assembler instruction :-)

You can use this NEON-assembler intrinsic among other things in typical 4x4 matrix multiplications for 3D-graphics like this:

float32x4_t transform (float32x4_t * matrix, float32x4_t vector)
{
  /* in a perfect world this code would compile into just four instructions */
  float32x4_t result;

  result = vml (matrix[0], vector);
  result = vmla (result, matrix[1], vector);
  result = vmla (result, matrix[2], vector);
  result = vmla (result, matrix[3], vector);

  return result;
}

This saves a couple of cycles because you don't have to add the results after multiplication. The addition is so often used that multiply-accumulates hsa become mainstream these days (even x86 has added them in some recent SSE instruction set).

Also worth mentioning: Multiply-accumulate operations like this are very common in linear algebra and DSP (digital signal processing) applications. ARM was very smart and implemented a fast-path inside the Cortex-A8 NEON-Core. This fast-path kicks in if the first argument (the accumulator) of a VMLA instruction is the result of a preceding VML or VMLA instruction. I could go into detail but in a nutshell such an instruction series runs four times faster than a VML / VADD / VML / VADD series.

Take a look at my simple matrix-multiply: I did exactly that. Due to this fast-path it will run roughly four times faster than implementation written using VML and ADD instead of VMLA.

書生途 2024-09-16 15:58:39

Google 搜索 vmlaq_f32,出现 RVCT 编译器工具的参考。它是这样说的:

Vector multiply accumulate: vmla -> Vr[i] := Va[i] + Vb[i] * Vc[i]
...
float32x4_t vmlaq_f32 (float32x4_t a, float32x4_t b, float32x4_t c);

并且

定义以下类型来表示向量。 NEON 矢量数据类型根据以下模式命名:
<类型><大小>x<车道数>_t
例如,int16x4_t 是一个包含四个通道的向量,每个通道包含一个带符号的 16 位整数。表 E.1 列出了矢量数据类型。

IOW,函数的返回值将是一个包含 4 个 32 位浮点数的向量,向量的每个元素都是通过将 bc 的相应元素相乘来计算的,并添加a的内容。

华泰

Google'd for vmlaq_f32, turned up the reference for the RVCT compiler tools. Here's what it says:

Vector multiply accumulate: vmla -> Vr[i] := Va[i] + Vb[i] * Vc[i]
...
float32x4_t vmlaq_f32 (float32x4_t a, float32x4_t b, float32x4_t c);

AND

The following types are defined to represent vectors. NEON vector data types are named according to the following pattern:
<type><size>x<number of lanes>_t
For example, int16x4_t is a vector containing four lanes each containing a signed 16-bit integer. Table E.1 lists the vector data types.

IOW, the return value from the function will be a vector containing 4 32-bit floats, and each element of the vector is calculated by multiplying the corresponding elements of b and c, and adding the contents of a.

HTH

︶葆Ⅱㄣ 2024-09-16 15:58:39
result = vml (matrix[0], vector);
result = vmla (result, matrix[1], vector);
result = vmla (result, matrix[2], vector);
result = vmla (result, matrix[3], vector);

但这个顺序是行不通的。问题是 x 分量仅累积由矩阵行调制的 x,并且可以表示为:

result.x = vector.x * (matrix[0][0] + matrix[1][0] + matrix[2][0] + matrix[3][0]);

...

正确的序列是:

result = vml (matrix[0], vector.xxxx);
result = vmla(result, matrix[1], vector.yyyy);

...

NEON 和 SSE 没有内置的字段选择(这需要指令编码中的 8 位,每个向量寄存器)。例如,GLSL/HLSL 确实具有此类功能,因此大多数 GPU 也具有此类功能。

实现此目的的替代方法是:

result.x = dp4(vector, matrix[0]);
result.y = dp4(vector, matrix[1]);

... // 当然,矩阵将被转置以产生相同的结果

mul,madd,madd,madd 序列通常是首选,因为它不需要目标的写入掩码寄存器字段。

否则代码看起来不错。 =)

result = vml (matrix[0], vector);
result = vmla (result, matrix[1], vector);
result = vmla (result, matrix[2], vector);
result = vmla (result, matrix[3], vector);

This sequence won't work, though. The problem is that x component accumulates only x modulated by the matrix rows and can be expressed as:

result.x = vector.x * (matrix[0][0] + matrix[1][0] + matrix[2][0] + matrix[3][0]);

...

The correct sequence would be:

result = vml (matrix[0], vector.xxxx);
result = vmla(result, matrix[1], vector.yyyy);

...

NEON and SSE don't have built-in selection for the fields (this would require 8 bits in instruction incoding, per vector register). GLSL/HLSL for example does have this kind of facilities so most GPUs have also.

Alternative way to achieve this would be:

result.x = dp4(vector, matrix[0]);
result.y = dp4(vector, matrix[1]);

... // and of course, the matrix would be transpose for this to yield same result

The mul,madd,madd,madd sequence is usually preferred as it does not require write mask for the target register fields.

Otherwise the code looks good. =)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文