GotoBLAS2 性能

发布于 2024-12-23 07:53:34 字数 1297 浏览 2 评论 0原文

我有一些代码，使用 LAPACK 例程 DPPTRF、DPPTRI 和 DSPMV 执行压缩对称矩阵求逆和乘法。这里是一个较旧的主题，您可以在其中看到我用来调用 LAPACK 例程的 C++ 代码。

我的代码当前组装了一个对称矩阵，该矩阵主要沿对角线填充。

我正在测试不同的 BLAS 和 LAPACK 实现，并将 GotoBLAS2 与 netlib 中的参考 LAPACK 实现进行比较。

以下是我编译 netlib LAPACK 代码的方法。我从源代码中选择 .f 代码文件，并将它们全部编译成一个紧凑的静态库，如下所示：

$ ls
ddot.f   dpptrf.f  dscal.f  dspr.f   dtpsv.f   lsame.f
dgemm.f  dpptri.f  dspmv.f  dtpmv.f  dtptri.f  xerbla.f
$ gfortran -c *.f
$ ar rcs liblapack_lite.a *.o

然后我可以使用 -llapack_lite -lgfortran。

然后我尝试使用 GotoBLAS2。我从这里得到它。该软件包包含自动编译大量 19MB 静态库的脚本。通过链接它，它可以与我现有的代码完美配合：-lgoto2_nehalemp-r1.13。

一开始我觉得这很顺利。使用 GotoBLAS2，在处理大型问题集（反转 1000x1000 或更大的矩阵）时，我发现性能提高了约 6 倍。由于 GotoBLAS 对于我的架构来说是线程化的，而参考 LAPACK 是单线程的，我认为这是合理的。系统监视器还显示 CPU 使用率 >300% 来证实。

这就是奇怪的地方。我想，如果我优化参考实现呢？

我像这样重新编译我的 lapack_lite lib：gfortran -c -O3 *.f

我的 lapack_lite lib 现在即使在 3200x3200 矩阵求逆上也优于 GotoBLAS2，仅使用一个线程。它还消耗约 80MB 的 RAM。

然而，使用 GotoBLAS 后，后续的打包矩阵向量乘法确实会发生得更快。

这怎么可能呢？ GotoBLAS 包的 make 配置是否无法使用 gfortran 的优化开关？

原文

I've got some code which performs a packed symmetric matrix inversion and multiplication using the LAPACK routines DPPTRF, DPPTRI, and DSPMV. Here is an older topic in which you can see the C++ code I use to invoke the LAPACK routines.

My code currently assembles a symmetric matrix which is mostly populated along the diagonal.

I am testing different BLAS and LAPACK implementations and I am comparing GotoBLAS2 with the reference LAPACK implementation from netlib.

Here is how I compile the netlib LAPACK code. I select the .f code files from source, and compile them all into a compact static library like this:

$ ls
ddot.f   dpptrf.f  dscal.f  dspr.f   dtpsv.f   lsame.f
dgemm.f  dpptri.f  dspmv.f  dtpmv.f  dtptri.f  xerbla.f
$ gfortran -c *.f
$ ar rcs liblapack_lite.a *.o

I can then link this lib to my C++ application using -llapack_lite -lgfortran.

I then tried using GotoBLAS2. I got it from here. The package contained scripts that compiled a massive 19MB static lib automatically. It works great with my existing code by linking it: -lgoto2_nehalemp-r1.13.

I felt that this went well at first. With GotoBLAS2, on large problem sets (inverting 1000x1000 or larger matrices) I saw about a 6x performance increase. Since GotoBLAS is threaded for my architecture and reference LAPACK is single threaded I thought this was reasonable. System monitor also showed >300% CPU usage to corroborate.

Here's where it gets weird. I think, what if I optimize the reference implementation?

I recompile my lapack_lite lib like this: gfortran -c -O3 *.f

My lapack_lite lib now outperforms GotoBLAS2 even on a 3200x3200 matrix inversion, using only one thread. It also consumes ~80MB less RAM.

The subsequent packed matrix-vector multiply does happen faster with GotoBLAS, however.

How is this even remotely possible? Did the make configuration of the GotoBLAS package fail to use an optimization switch with gfortran?

分享到QQ

分享到微博