iOS 4 使用 4x4 矩阵加速 Cblas

发布于 2024-09-27 23:03:02 字数 845 浏览 7 评论 0原文

我一直在研究 iOS 4 中提供的 Accelerate 框架。具体来说,我尝试在 C 语言的线性代数库中使用 Cblas 例程。现在我无法使用这些函数来给我提供帮助与非常基本的例程相比的任何性能提升。具体来说,是4x4矩阵乘法的情况。每当我无法利用矩阵的仿射或齐次性质时,我就一直使用这个例程(删节):

float *mat4SetMat4Mult(const float *m0, const float *m1, float *target) {
    target[0] = m0[0] * m1[0] + m0[4] * m1[1] + m0[8] * m1[2] + m0[12] * m1[3];
    target[1] = ...etc...
    ...
    target[15] = m0[3] * m1[12] + m0[7] * m1[13] + m0[11] * m1[14] + m0[15] * m1[15];
    return target;
}

Cblas 的等效函数调用是:

cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
   4, 4, 4, 1.f, m0, 4, m1, 4, 0.f, target, 4);

通过使它们运行大量预演来比较两者。计算矩阵填充随机数(每个函数每次都获得完全相同的输入),当使用 C Clock() 函数计时时,Cblas 例程的执行速度大约慢 4 倍。

这对我来说似乎不对,我感觉我在某个地方做错了什么。我是否必须以某种方式启用设备的 NEON 单元和 SIMD 功能?或者我不应该希望用这么小的矩阵获得更好的性能吗?

非常感谢,

巴斯蒂安

I’ve been looking into the Accelerate framework that was made available in iOS 4. Specifically, I made some attempts to use the Cblas routines in my linear algebra library in C. Now I can’t get the use of these functions to give me any performance gain over very basic routines. Specifically, the case of 4x4 matrix multiplication. Wherever I couldn’t make use of affine or homogeneous properties of the matrices, I’ve been using this routine (abridged):

float *mat4SetMat4Mult(const float *m0, const float *m1, float *target) {
    target[0] = m0[0] * m1[0] + m0[4] * m1[1] + m0[8] * m1[2] + m0[12] * m1[3];
    target[1] = ...etc...
    ...
    target[15] = m0[3] * m1[12] + m0[7] * m1[13] + m0[11] * m1[14] + m0[15] * m1[15];
    return target;
}

The equivalent function call for Cblas is:

cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
   4, 4, 4, 1.f, m0, 4, m1, 4, 0.f, target, 4);

Comparing the two, by making them run through a large number of pre-computed matrices filled with random numbers (each function gets the exact same input every time), the Cblas routine performs about 4x slower, when timed with the C clock() function.

This doesn’t seem right to me, and I’m left with the feeling that I’m doing something wrong somewhere. Do I have to to enable the device’s NEON unit and SIMD functionality somehow? Or shouldn’t I hope for better performance with such small matrices?

Very much appreciated,

Bastiaan

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

两仪 2024-10-04 23:03:02

Apple WWDC2010 演示文稿称 Accelerate 仍应为 3x3 矩阵运算提供加速,因此我假设您应该会看到 4x4 矩阵运算的轻微改进。但您需要考虑的是加速和加速。 NEON 的设计目的是极大地加速整数运算,但不一定是浮点运算。您没有提到您的CPU处理器,并且Accelerate似乎将根据您的CPU使用NEON或VFP进行浮点运算。如果它使用 NEON 指令进行 32 位浮点运算,那么它应该运行得很快,但如果它使用 VFP 进行 32 位浮点或 64 位双精度运算,那么它会运行得很慢(因为 VFP 实际上不是 SIMD)。因此,您应该确保在 Accelerate 中使用 32 位浮点运算,并确保它将使用 NEON 而不是 VFP。

另一个问题是,即使它确实使用 NEON,也不能保证您的 C 编译器会生成比没有 NEON 指令的简单 C 函数更快的 NEON 代码,因为 GCC 等 C 编译器经常生成糟糕的 SIMD 代码,可能运行速度更慢比标准代码。这就是为什么测试生成的代码的速度总是很重要,并且可能需要手动查看生成的汇编代码以查看编译器是否生成了错误的代码。

The Apple WWDC2010 presentations say that Accelerate should still give a speedup for even a 3x3 matrix operation, so I would have assumed you should see a slight improvement for 4x4. But something you need to consider is that Accelerate & NEON are designed to greatly speed up integer operations but not necessarily floating-point operations. You didn't mention your CPU processor, and it seems that Accelerate will use either NEON or VFP for floating-point operations depending on your CPU. If it uses NEON instructions for 32-bit float operations then it should run fast, but if it uses VFP for 32-bit float or 64-bit double operations, then it will run very slowly (since VFP is not actually SIMD). So you should make sure that you are using 32-bit float operations with Accelerate, and make sure it will use NEON instead of VFP.

And another issue is that even if it does use NEON, there is no guarantee that your C compiler will generate faster NEON code than your simple C function does without NEON instructions, because C compilers such as GCC often generate terrible SIMD code, potentially running slower than standard code. Thats why its always important to test the speed of the generated code, and possibly to manually look at the generated assembly code to see if your compiler generated bad code.

完美的未来在梦里 2024-10-04 23:03:02

BLAS 和 LAPACK 库设计用于我认为的“中型到大型矩阵”(一侧从数万到数万个矩阵)。它们将为较小的矩阵提供正确的结果,但性能不会那么好。

造成这种情况的原因有几个:

  • 为了提供最佳性能,3x3 和 4x4 矩阵运算必须内联,而不是在库中;当要做的工作很少时,进行函数调用的开销太大而无法克服。
  • 要提供顶级性能,需要一组完全不同的接口。矩阵乘法的 BLAS 接口使用变量来指定计算中涉及的矩阵的大小和主维数,更不用说是否转置矩阵和存储布局。所有这些参数使库变得强大,并且不会损害大型矩阵的性能。然而,当它确定您正在执行 4x4 计算时,专用于执行 4x4 矩阵运算的函数已经完成,其他任何操作都尚未完成。

这对您意味着什么:如果您希望提供专用的小矩阵运算,请访问 bugreport.apple.com 并提交请求此功能的错误。

The BLAS and LAPACK libraries are designed for use with what I would consider "medium to large matrices" (from tens to tens of thousands on a side). They will deliver correct results for smaller matrices, but the performance will not be as good as it could be.

There are several reasons for this:

  • In order to deliver top performance, 3x3 and 4x4 matrix operations must be inlined, not in a library; the overhead of making a function call is simply too large to overcome when there is so little work to be done.
  • An entirely different set of interfaces is necessary to deliver top performance. The BLAS interface for matrix multiply takes variables to specify the sizes and leading dimensions of the matrices involved in the computation, not to mention whether or not to transpose the matrices and the storage layout. All those parameters make the library powerful, and don't hurt performance for large matrices. However, by the time it has finished determining that you are doing a 4x4 computation, a function dedicated to doing 4x4 matrix operations and nothing else is already finished.

What this means for you: if you would like to have dedicated small matrix operations provided, please go to bugreport.apple.com and file a bug requesting this feature.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文