LAPACK/BLAS 与简单的“for”比较循环

发布于 2024-10-18 11:20:14 字数 208 浏览 4 评论 0原文

我想将一段涉及大量向量和矩阵计算的代码迁移到 C 或 C++,目标是尽可能加快代码速度。

C 代码中使用 for 循环进行的线性代数计算是否与使用 LAPACK/BLAS 的计算一样快,或者使用这些库有一些独特的加速?

换句话说,简单的 C 代码(使用 for 循环等)是否可以像使用 LAPACK/BLAS 的代码一样快地执行线性代数计算?

I want to migrate a piece of code that involves a number of vector and matrix calculations to C or C++, the objective being to speed up the code as much as possible.

Are linear algebra calculations with for loops in C code as fast as calculations using LAPACK/BLAS, or there is some unique speedup from using those libraries?

In other words, could simple C code (using for loops and the like) perform linear algebra calculations as fast as code that utilizes LAPACK/BLAS?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

云淡风轻 2024-10-25 11:20:14

供应商提供的 LAPACK / BLAS 库(已经提到了 Intel 的 IPP/MKL,但也有 AMD 的 ACML,其他 CPU 供应商如 IBM/Power 或 Oracle/SPARC 也提供了等效的库)通常针对特定的 CPU 功能进行了高度优化显着提高大型数据集的性能。

不过,通常情况下,您需要对非常特定的小数据进行操作(例如,4x4 矩阵或 4D 点积,即 3D 几何处理中使用的操作),对于此类事物,BLAS/LAPACK是多余的,因为这些子例程完成了初始测试,根据数据集的属性选择哪些代码路径。在这些情况下,简单的 C/C++ 源代码(可能使用 SSE2...4 内在函数和/或编译器生成的向量化)可能会击败 BLAS/LAPACK。
例如,这就是为什么英特尔拥有两个库 - 用于大型线性代数数据集的 MKL 和用于小型(图形向量)数据集的 IPP。

从这个意义上说,

  • 您的数据集到底是什么?
  • 矩阵/向量大小是多少?
  • 什么线性代数运算?

另外,关于“简单的 for 循环”:让编译器有机会为您进行矢量化。即:

for (i = 0; i < DIM_OF_MY_VECTOR; i += 4) {
    vecmul[i] = src1[i] * src2[i];
    vecmul[i+1] = src1[i+1] * src2[i+1];
    vecmul[i+2] = src1[i+2] * src2[i+2];
    vecmul[i+3] = src1[i+3] * src2[i+3];
}
for (i = 0; i < DIM_OF_MY_VECTOR; i += 4)
    dotprod += vecmul[i] + vecmul[i+1] + vecmul[i+2] + vecmul[i+3];

可能比纯表达式更好地提供给向量化编译器

for (i = 0; i < DIM_OF_MY_VECTOR; i++) dotprod += src1[i]*src2[i];

。在某些方面,您所说的使用 for 循环进行计算将会产生重大影响。
如果你的向量维度足够大,BLAS 版本

dotprod = CBLAS.ddot(DIM_OF_MY_VECTOR, src1, 1, src2, 1);

将是更清晰的代码并且可能更快。

在参考方面,您可能会感兴趣:

Vendor-provided LAPACK / BLAS libraries (Intel's IPP/MKL have been mentioned, but there's also AMD's ACML, and other CPU vendors like IBM/Power or Oracle/SPARC provide equivalents as well) are often highly optimized for specific CPU abilities that'll significantly boost performance on large datasets.

Often, though, you've got very specific small data to operate on (say, 4x4 matrices or 4D dot products, i.e. operations used in 3D geometry processing) and for those sort of things, BLAS/LAPACK are overkill, because of initial tests done by these subroutines which codepaths to choose, depending on properties of the data set. In those situations, simple C/C++ sourcecode, maybe using SSE2...4 intrinsics and/or compiler-generated vectorization, may beat BLAS/LAPACK.
That's why, for example, Intel has two libraries - MKL for large linear algebra datasets, and IPP for small (graphics vectors) data sets.

In that sense,

  • what exactly is your data set ?
  • What matrix/vector sizes ?
  • What linear algebra operations ?

Also, regarding "simple for loops": Give the compiler the chance to vectorize for you. I.e. something like:

for (i = 0; i < DIM_OF_MY_VECTOR; i += 4) {
    vecmul[i] = src1[i] * src2[i];
    vecmul[i+1] = src1[i+1] * src2[i+1];
    vecmul[i+2] = src1[i+2] * src2[i+2];
    vecmul[i+3] = src1[i+3] * src2[i+3];
}
for (i = 0; i < DIM_OF_MY_VECTOR; i += 4)
    dotprod += vecmul[i] + vecmul[i+1] + vecmul[i+2] + vecmul[i+3];

might be a better feed to a vectorizing compiler than the plain

for (i = 0; i < DIM_OF_MY_VECTOR; i++) dotprod += src1[i]*src2[i];

expression. In some ways, what you mean by calculations with for loops will have a significant impact.
If your vector dimensions are large enough though, the BLAS version,

dotprod = CBLAS.ddot(DIM_OF_MY_VECTOR, src1, 1, src2, 1);

will be cleaner code and likely faster.

On the reference side, these might be of interest:

燃情 2024-10-25 11:20:14

可能不会。人们投入了大量精力来确保 lapack/BLAS 例程得到优化且数值稳定。虽然代码通常有些复杂,但通常是有原因的。

根据您的预期目标,您可能需要查看英特尔数学内核库。至少如果您的目标是英特尔处理器,它可能是您能找到的最快的。

Probably not. People quite a bit of work into ensuring that lapack/BLAS routines are optimized and numerically stable. While the code is often somewhat on the complex side, it's usually that way for a reason.

Depending on your intended target(s), you might want to look at the Intel Math Kernel Library. At least if you're targeting Intel processors, it's probably the fastest you're going to find.

深爱不及久伴 2024-10-25 11:20:14

数值分析很难。至少,您需要密切了解浮点运算的局限性,并知道如何对运算进行排序,以便在速度与数值稳定性之间取得平衡。这一点很重要。

您实际上需要了解一些关于您实际需要的速度和稳定性之间的平衡的线索。在更一般的软件开发中,过早的优化是万恶之源。在数值分析中,它就是游戏的名称。如果你第一次没有取得正确的平衡,你将不得不或多或少地重写所有内容。

当你尝试将线性代数证明应用到算法中时,事情会变得更加困难。您需要真正理解代数,以便可以将其重构为稳定(或足够稳定)的算法。

如果我是您,我会瞄准 LAPACK/BLAS API 并四处寻找适合您的数据集的库。

您有很多选择:LAPACK/BLAS、GSL 和其他自优化库、供应商库。

Numerical analysis is hard. At the very least, you need to be intimately aware of the limitations of floating point arithmetic, and know how to sequence operations so that you balance speed with numerical stability. This is non-trivial.

You need to actually have some clue about the balance between speed and stability you actually need. In more general software development, premature optimization is the root of all evil. In numerical analysis, it is the name of the game. If you don't get the balance right the first time, you will have to re-write more-or-less all of it.

And it gets harder when you try to adapt linear algebra proofs into algorithms. You need to actually understand the algebra, so that you can refactor it into a stable (or stable enough) algorithm.

If I were you, I'd target the LAPACK/BLAS API and shop around for the library that works for your data set.

You have plenty of options: LAPACK/BLAS, GSL and other self-optimizing libraries, vender libraries.

垂暮老矣 2024-10-25 11:20:14

我不太熟悉这个库。但是你应该考虑到库通常会在参数中进行一些测试,它们对错误有一个“通信系统”,甚至在调用函数时对新变量的归因......如果计算很微不足道,也许你可以尝试自己做,适应您的需要......

I dont meet this libraries very well. But you should consider that libraries usually make a couple of tests in parameters, they have a "sistem of comunication" to errors, and even the attribution to new variables when you call a function... If the calcs are trivial, maybe you can try do it by yourself, adaptating whith your necessities...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文