当前位置：文江博客话题详情

BLAS是如何获得如此极致的性能的呢？

发布于 2024-08-01 22:25:44 字数 1023 浏览 19 评论 0 原文

出于好奇，我决定对我自己的矩阵乘法函数与 BLAS 实现进行基准测试...我对结果最不感到惊讶的是：

定制实施，10 次试验 1000x1000 矩阵乘法：
耗时：15.76542 秒。 
  
BLAS 实施，10 次试验 1000x1000 矩阵乘法：
耗时：1.32432 秒。 
  

这是使用单精度浮点数。

我的实现：

template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
    if ( ADim2!=BDim1 )
        throw std::runtime_error("Error sizes off");

    memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
    int cc2,cc1,cr1;
    for ( cc2=0 ; cc2<BDim2 ; ++cc2 )
        for ( cc1=0 ; cc1<ADim2 ; ++cc1 )
            for ( cr1=0 ; cr1<ADim1 ; ++cr1 )
                C[cc2*ADim2+cr1] += A[cc1*ADim1+cr1]*B[cc2*BDim1+cc1];
}

我有两个问题：

假设矩阵-矩阵乘法说： nxm * mxn 需要 n*n*m 乘法，因此在上述 1000^3 或 1e9 运算的情况下。在我的 2.6Ghz 处理器上，BLAS 如何在 1.32 秒内完成 10*1e9 操作？即使乘法是单个操作并且没有执行任何其他操作，也应该需要大约 4 秒。
为什么我的执行速度这么慢？

原文

Out of curiosity I decided to benchmark my own matrix multiplication function versus the BLAS implementation... I was to say the least surprised at the result:

Custom Implementation, 10 trials of
1000x1000 matrix multiplication:
Took: 15.76542 seconds.
BLAS Implementation, 10 trials of
1000x1000 matrix multiplication:
Took: 1.32432 seconds.

This is using single precision floating point numbers.

My Implementation:

template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
    if ( ADim2!=BDim1 )
        throw std::runtime_error("Error sizes off");

    memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
    int cc2,cc1,cr1;
    for ( cc2=0 ; cc2<BDim2 ; ++cc2 )
        for ( cc1=0 ; cc1<ADim2 ; ++cc1 )
            for ( cr1=0 ; cr1<ADim1 ; ++cr1 )
                C[cc2*ADim2+cr1] += A[cc1*ADim1+cr1]*B[cc2*BDim1+cc1];
}

I have two questions:

Given that a matrix-matrix multiplication say: nxm * mxn requires n*n*m multiplications, so in the case above 1000^3 or 1e9 operations. How is it possible on my 2.6Ghz processor for BLAS to do 10*1e9 operations in 1.32 seconds? Even if multiplcations were a single operation and there was nothing else being done, it should take ~4 seconds.
Why is my implementation so much slower?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

玩世 2024-08-08 22:25:44

第二个问题的大多数论据——汇编器、分割成块等（但不是少于 N^3 的算法，它们确实是过度开发的）——发挥了作用。但算法的低速度本质上是由矩阵大小和三个嵌套循环的不幸排列引起的。您的矩阵太大，无法立即放入高速缓存中。您可以重新排列循环，以便在缓存中的一行上完成尽可能多的操作，这样可以显着减少缓存刷新（顺便说一句，分割成小块具有模拟效果，如果块上的循环排列类似，则效果最好）。方阵的模型实现如下。在我的计算机上，与标准实现（如您的）相比，其时间消耗约为 1:10。
换句话说：永远不要按照我们在学校学到的“行乘列”方案来编写矩阵乘法。
重新排列循环后，通过展开循环、汇编代码等可以获得更多改进。

    void vector(int m, double ** a, double ** b, double ** c) {
      int i, j, k;
      for (i=0; i<m; i++) {
        double * ci = c[i];
        for (k=0; k<m; k++) ci[k] = 0.;
        for (j=0; j<m; j++) {
          double aij = a[i][j];
          double * bj = b[j];
          for (k=0; k<m; k++)  ci[k] += aij*bj[k];
        }
      }
    }

再补充一点：在我的计算机上，此实现甚至比用 BLAS 例程 cblas_dgemm 替换所有内容更好（在您的计算机上尝试一下！）。但直接调用 Fortran 库的 dgemm_ 速度要快得多 (1:4)。我认为这个例程实际上不是 Fortran，而是汇编代码（我不知道库中有什么，我没有源代码）。我完全不清楚为什么 cblas_dgemm 没有那么快，因为据我所知，它只是 dgemm_ 的包装。

Most arguments to the second question -- assembler, splitting into blocks etc. (but not the less than N^3 algorithms, they are really overdeveloped) -- play a role. But the low velocity of your algorithm is caused essentially by matrix size and the unfortunate arrangement of the three nested loops. Your matrices are so large that they do not fit at once in cache memory. You can rearrange the loops such that as much as possible will be done on a row in cache, this way dramatically reducing cache refreshes (BTW splitting into small blocks has an analogue effect, best if loops over the blocks are arranged similarly). A model implementation for square matrices follows. On my computer its time consumption was about 1:10 compared to the standard implementation (as yours).
In other words: never program a matrix multiplication along the "row times column" scheme that we learned in school.
After having rearranged the loops, more improvements are obtained by unrolling loops, assembler code etc.

    void vector(int m, double ** a, double ** b, double ** c) {
      int i, j, k;
      for (i=0; i<m; i++) {
        double * ci = c[i];
        for (k=0; k<m; k++) ci[k] = 0.;
        for (j=0; j<m; j++) {
          double aij = a[i][j];
          double * bj = b[j];
          for (k=0; k<m; k++)  ci[k] += aij*bj[k];
        }
      }
    }

One more remark: This implementation is even better on my computer than replacing all by the BLAS routine cblas_dgemm (try it on your computer!). But much faster (1:4) is calling dgemm_ of the Fortran library directly. I think this routine is in fact not Fortran but assembler code (I do not know what is in the library, I don't have the sources). Totally unclear to me is why cblas_dgemm is not as fast since to my knowledge it is merely a wrapper for dgemm_.

回复收藏 0 原文

铜锣湾横着走 2024-08-08 22:25:44

相对于MM乘法的原始代码，大多数操作的内存引用是性能不佳的主要原因。内存的运行速度比缓存慢 100-1000 倍。

大部分速度提升来自于对 MM 乘法中的三重循环函数采用循环优化技术。使用了两种主要循环优化技术；展开和阻塞。关于展开，我们展开最外面的两个循环并阻止它以便在缓存中重用数据。外循环展开通过减少整个操作期间不同时间对相同数据的内存引用数量，有助于暂时优化数据访问。在特定数字处阻止循环索引有助于将数据保留在缓存中。您可以选择针对 L2 缓存或 L3 缓存进行优化。

https://en.wikipedia.org/wiki/Loop_nest_optimization

回复收藏 0 原文

花之痕靓丽 2024-08-08 22:25:44

这是一个现实的加速。有关通过 C++ 代码使用 SIMD 汇编器执行哪些操作的示例，请参阅一些示例 iPhone 矩阵函数 - 这些比 C 版本快 8 倍以上，甚至没有“优化”汇编 - 还没有管道衬里，并且存在不必要的堆栈操作。

另外，您的代码不是“限制正确” -编译器如何知道当它修改C时，它没有修改A和B？

回复收藏 0 原文

内心旳酸楚 2024-08-08 22:25:44

出于很多原因。

首先，Fortran 编译器经过高度优化，并且该语言也允许它们如此。 C 和C++ 在数组处理方面非常宽松（例如，指针引用同一内存区域的情况）。这意味着编译器无法提前知道要做什么，并且被迫创建通用代码。在Fortran中，你的情况更加精简，编译器可以更好地控制发生的事情，允许他进行更多优化（例如使用寄存器）。

另一件事是，Fortran 按列存储数据，而 C 按行存储数据。我还没有检查你的代码，但请注意你如何执行该产品。在 C 中，您必须按行扫描：这样您就可以沿着连续内存扫描数组，从而减少缓存未命中。缓存未命中是低效率的第一个根源。

第三，这取决于您使用的 blas 实现。某些实现可能是用汇编程序编写的，并针对您正在使用的特定处理器进行了优化。 netlib版本是用fortran 77编写的。

此外，您正在执行大量操作，其中大多数操作都是重复且多余的。所有这些获得索引的乘法都会损害性能。我真的不知道 BLAS 中这是如何完成的，但是有很多技巧可以防止昂贵的操作。

例如，您可以通过这种方式重新编写代码，

template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
if ( ADim2!=BDim1 ) throw std::runtime_error("Error sizes off");

memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
int cc2,cc1,cr1, a1,a2,a3;
for ( cc2=0 ; cc2<BDim2 ; ++cc2 ) {
    a1 = cc2*ADim2;
    a3 = cc2*BDim1
    for ( cc1=0 ; cc1<ADim2 ; ++cc1 ) {
          a2=cc1*ADim1;
          ValT b = B[a3+cc1];
          for ( cr1=0 ; cr1<ADim1 ; ++cr1 ) {
                    C[a1+cr1] += A[a2+cr1]*b;
           }
     }
  }
}

尝试一下，我相信您会节省一些东西。

关于第一个问题，原因是如果您使用简单的算法，矩阵乘法的缩放比例为 O(n^3)。有些算法可扩展性更好。

For many reasons.

First, Fortran compilers are highly optimized, and the language allows them to be as such. C and C++ are very loose in terms of array handling (e.g. the case of pointers referring to the same memory area). This means that the compiler cannot know in advance what to do, and is forced to create generic code. In Fortran, your cases are more streamlined, and the compiler has better control of what happens, allowing him to optimize more (e.g. using registers).

Another thing is that Fortran store stuff columnwise, while C stores data row-wise. I havent' checked your code, but be careful of how you perform the product. In C you must scan row wise: this way you scan your array along contiguous memory, reducing the cache misses. Cache miss is the first source of inefficiency.

Third, it depends of the blas implementation you are using. Some implementations might be written in assembler, and optimized for the specific processor you are using. The netlib version is written in fortran 77.

Also, you are doing a lot of operations, most of them repeated and redundant. All those multiplications to obtain the index are detrimental for the performance. I don't really know how this is done in BLAS, but there are a lot of tricks to prevent expensive operations.

For example, you could rework your code this way

template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
if ( ADim2!=BDim1 ) throw std::runtime_error("Error sizes off");

memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
int cc2,cc1,cr1, a1,a2,a3;
for ( cc2=0 ; cc2<BDim2 ; ++cc2 ) {
    a1 = cc2*ADim2;
    a3 = cc2*BDim1
    for ( cc1=0 ; cc1<ADim2 ; ++cc1 ) {
          a2=cc1*ADim1;
          ValT b = B[a3+cc1];
          for ( cr1=0 ; cr1<ADim1 ; ++cr1 ) {
                    C[a1+cr1] += A[a2+cr1]*b;
           }
     }
  }
}

Try it, I am sure you will save something.

On you #1 question, the reason is that matrix multiplication scales as O(n^3) if you use a trivial algorithm. There are algorithms that scale much better.

回复收藏 0 原文

淤浪 2024-08-08 22:25:44

所以首先BLAS只是一个大约有50个函数的接口。该接口有许多相互竞争的实现。

首先，我将提到基本上不相关的事情：

Fortran 与 C，没有什么区别
高级矩阵算法，如 Strassen，实现不使用它们，因为它们在实践中没有帮助

大多数实现将每个操作分解为小维矩阵或向量操作或者不太明显的方式。例如，大型 1000x1000 矩阵乘法可能会分解为 50x50 矩阵乘法序列。

这些固定大小的小维操作（称为内核）使用其目标的多个 CPU 功能硬编码在特定于 CPU 的汇编代码中：

SIMD 样式指令
指令级并行性
缓存感知

此外，这些内核可以相对于每个内核并行执行其他使用多线程（CPU 核心），采用典型的 Map-Reduce 设计模式。

看一下 ATLAS，它是最常用的开源 BLAS 实现。它有许多不同的竞争内核，在 ATLAS 库构建过程中，它会在它们之间进行竞争（有些甚至是参数化的，因此同一内核可以有不同的设置）。它尝试不同的配置，然后选择最适合特定目标系统的配置。

（提示：这就是为什么如果您使用 ATLAS，您最好为您的特定机器手动构建和调整库，然后使用预构建的库。）

回复收藏 0 原文

但可醉心 2024-08-08 22:25:44

首先，有比您正在使用的算法更有效的矩阵乘法算法。

其次，您的 CPU 一次可以执行多于一条指令。

您的 CPU 每个周期执行 3-4 条指令，如果使用 SIMD 单元，则每条指令处理 4 个浮点数或 2 个双精度数。（当然这个数字也不准确，因为 CPU 通常每个周期只能处理一条 SIMD 指令）

第三，您的代码远非最佳：

您使用的是原始指针，这意味着编译器必须假设它们可能别名。您可以指定特定于编译器的关键字或标志来告诉编译器它们不使用别名。或者，您应该使用原始指针以外的其他类型，这可以解决该问题。
您通过对输入矩阵的每一行/列执行简单遍历来破坏缓存。您可以使用分块在矩阵的较小块（适合 CPU 缓存）上执行尽可能多的工作，然后再转到下一个块。
对于纯数值任务，Fortran 几乎是无与伦比的，而 C++ 需要大量的引导才能达到类似的速度。这是可以做到的，并且有一些库演示了它（通常使用表达式模板），但这并不是微不足道的，而且它不会刚刚发生。

回复收藏 0 原文

陌路黄昏 2024-08-08 22:25:44

我具体不了解 BLAS 实现，但有更有效的矩阵乘法算法，其复杂度优于 O(n3)。一个众所周知的算法是 Strassen 算法

回复收藏 0 原文

瑾夏年华 2024-08-08 22:25:44

一个很好的起点是伟大的书矩阵计算编程科学作者：Robert A. van de Geijn 和 Enrique S. Quintana-Ortí。他们提供免费下载版本。

BLAS 分为三个级别：

级别 1 定义一组仅对向量进行操作的线性代数函数。这些函数受益于矢量化（例如使用 SIMD，如 SSE）。
2 级函数是矩阵向量运算，例如某些矩阵向量乘积。这些功能可以按照 1 级功能来实现。但是，如果您可以提供利用某些带有共享内存的多处理器架构的专用实现，则可以提高这些函数的性能。
第 3 级函数是类似于矩阵-矩阵乘积的运算。同样，您可以根据 2 级函数来实现它们。但是 3 级函数对 O(N^2) 数据执行 O(N^3) 操作。因此，如果您的平台具有缓存层次结构，那么如果您提供缓存优化/缓存友好的专用实现，则可以提高性能。这在书中有很好的描述。 Level 3 功能的主要提升来自于缓存优化。这一提升明显超过了并行性和其他硬件优化带来的第二次提升。

顺便说一句，大多数（甚至全部）高性能 BLAS 实现都不是用 Fortran 实现的。 ATLAS 是用 C 语言实现的。GotoBLAS/OpenBLAS 是用 C 语言实现的，其性能关键部分是用汇编语言实现的。仅 BLAS 的参考实现是用 Fortran 实现的。然而，所有这些 BLAS 实现都提供了一个 Fortran 接口，以便它可以与 LAPACK 链接（LAPACK 从 BLAS 获得所有性能）。

优化编译器在这方面起着次要作用（对于 GotoBLAS/OpenBLAS 来说，编译器根本不重要）。

恕我直言，没有 BLAS 实现使用像 Coppersmith-Winograd 算法或 Strassen 算法这样的算法。可能的原因是：

也许无法提供这些算法的缓存优化实现（即，您失去的会比获胜的多）
这些算法在数值上不稳定。由于 BLAS 是 LAPACK 的计算内核，因此这是不行的。
尽管这些算法在纸面上具有很好的时间复杂度，但 Big O 表示法隐藏了一个大常数，因此它只开始适用于极大的矩阵。

编辑/更新：

本主题的新的开创性论文是 BLIS 论文。它们写得非常好。在我的讲座“高性能计算的软件基础知识”中，我按照他们的论文实现了矩阵-矩阵乘积。实际上我实现了矩阵-矩阵乘积的几个变体。最简单的变体完全用纯 C 语言编写，代码少于 450 行。所有其他变体仅优化循环。

    for (l=0; l<MR*NR; ++l) {
        AB[l] = 0;
    }
    for (l=0; l<kc; ++l) {
        for (j=0; j<NR; ++j) {
            for (i=0; i<MR; ++i) {
                AB[i+j*MR] += A[i]*B[j];
            }
        }
        A += MR;
        B += NR;
    }

矩阵-矩阵乘积的整体性能仅取决于这些循环。大约99.9%的时间都花在这里。在其他变体中，我使用内部函数和汇编代码来提高性能。您可以在此处查看教程，了解所有变体：

ulmBLAS：教程GEMM（矩阵-矩阵积）

结合 BLIS 论文，我们可以很容易地理解像英特尔 MKL 这样的库如何获得这样的性能。为什么使用行主存储还是列主存储并不重要！

最终基准在这里（我们将我们的项目称为 ulmBLAS）：

基准ulmBLAS、BLIS、MKL、openBLAS 和 Eigen

另一个编辑/更新：

我还写了一些关于如何使用 BLAS 解决数值线性代数问题（例如求解线性方程组）的教程：

高性能 LU 分解

（此 LU 分解例如Matlab 使用它来求解线性方程组。）

~~我希望找到时间~~来扩展教程来描述和演示如何实现 LU 分解的高度可扩展的并行实现，如血浆。

好的，给你：编码缓存优化并行 LU 分解

PS：我也做了一些关于提高 uBLAS 性能的实验。实际上，提升（是的，玩文字游戏:)）uBLAS 的性能非常简单：

uBLAS 上的实验。

这是一个与 BLAZE 类似的项目：

BLAZE 实验。

A good starting point is the great book The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-Ortí. They provide a free download version.

BLAS is divided into three levels:

Level 1 defines a set of linear algebra functions that operate on vectors only. These functions benefit from vectorization (e.g. from using SIMD such as SSE).
Level 2 functions are matrix-vector operations, e.g. some matrix-vector product. These functions could be implemented in terms of Level 1 functions. However, you can boost the performance of these functions if you can provide a dedicated implementation that makes use of some multiprocessor architecture with shared memory.
Level 3 functions are operations like the matrix-matrix product. Again you could implement them in terms of Level 2 functions. But Level 3 functions perform O(N^3) operations on O(N^2) data. So if your platform has a cache hierarchy then you can boost performance if you provide a dedicated implementation that is cache optimized/cache friendly. This is nicely described in the book. The main boost of Level 3 functions comes from cache optimization. This boost significantly exceeds the second boost from parallelism and other hardware optimizations.

By the way, most (or even all) of the high performance BLAS implementations are NOT implemented in Fortran. ATLAS is implemented in C. GotoBLAS/OpenBLAS is implemented in C and its performance-critical parts in Assembler. Only the reference implementation of BLAS is implemented in Fortran. However, all these BLAS implementations provide a Fortran interface such that it can be linked against LAPACK (LAPACK gains all its performance from BLAS).

Optimized compilers play a minor role in this respect (and for GotoBLAS/OpenBLAS the compiler does not matter at all).

IMHO no BLAS implementation uses algorithms like the Coppersmith–Winograd algorithm or the Strassen algorithm. The likely reasons are:

Maybe it's not possible to provide a cache-optimized implementation of these algorithms (i.e. you would lose more than you would win)
These algorithms are numerically not stable. As BLAS is the computational kernel of LAPACK this is a no-go.
Although these algorithms have a nice time complexity on paper, the Big O notation hides a large constant, so it only starts to become viable for extremely large matrices.

Edit/Update:

The new and groundbreaking papers for this topic are the BLIS papers. They are exceptionally well written. For my lecture "Software Basics for High Performance Computing" I implemented the matrix-matrix product following their paper. Actually I implemented several variants of the matrix-matrix product. The simplest variant is entirely written in plain C and has less than 450 lines of code. All the other variants merely optimize the loops

    for (l=0; l<MR*NR; ++l) {
        AB[l] = 0;
    }
    for (l=0; l<kc; ++l) {
        for (j=0; j<NR; ++j) {
            for (i=0; i<MR; ++i) {
                AB[i+j*MR] += A[i]*B[j];
            }
        }
        A += MR;
        B += NR;
    }

The overall performance of the matrix-matrix product only depends on these loops. About 99.9% of the time is spent here. In the other variants I used intrinsics and assembler code to improve the performance. You can see the tutorial going through all the variants here:

ulmBLAS: Tutorial on GEMM (Matrix-Matrix Product)

Together with the BLIS papers it becomes fairly easy to understand how libraries like Intel MKL can gain such performance. And why it does not matter whether you use row or column major storage!

The final benchmarks are here (we called our project ulmBLAS):

Benchmarks for ulmBLAS, BLIS, MKL, openBLAS and Eigen

Another Edit/Update:

I also wrote some tutorials on how BLAS is used for numerical linear algebra problems like solving a system of linear equations:

High Performance LU Factorization

(This LU factorization is for example used by Matlab for solving a system of linear equations.)

~~I hope to find time~~ to extend the tutorial to describe and demonstrate how to realise a highly scalable parallel implementation of the LU factorization like in PLASMA.

Ok, here you go: Coding a Cache Optimized Parallel LU Factorization

P.S.: I also did make some experiments on improving the performance of uBLAS. It actually is pretty simple to boost (yeah, play on words :) ) the performance of uBLAS:

Experiments on uBLAS.

Here a similar project with BLAZE:

Experiments on BLAZE.

回复收藏 0 原文

~没有更多了~