为什么 MATLAB 的矩阵乘法如此快?
我正在使用 CUDA、C++、C#、Java 进行一些基准测试,并使用 MATLAB 进行验证和矩阵生成。当我使用 MATLAB 执行矩阵乘法时,2048x2048
甚至更大的矩阵几乎立即相乘。
1024x1024 2048x2048 4096x4096
--------- --------- ---------
CUDA C (ms) 43.11 391.05 3407.99
C++ (ms) 6137.10 64369.29 551390.93
C# (ms) 10509.00 300684.00 2527250.00
Java (ms) 9149.90 92562.28 838357.94
MATLAB (ms) 75.01 423.10 3133.90
只有CUDA有竞争力,但我认为至少C++会有点接近,不会慢60倍。我也不知道如何看待 C# 结果。该算法与 C++ 和 Java 相同,但与 1024
相比有一个巨大的跳跃 2048
。
MATLAB 为何能够如此快地执行矩阵乘法?
C++代码:
float temp = 0;
timer.start();
for(int j = 0; j < rozmer; j++)
{
for (int k = 0; k < rozmer; k++)
{
temp = 0;
for (int m = 0; m < rozmer; m++)
{
temp = temp + matice1[j][m] * matice2[m][k];
}
matice3[j][k] = temp;
}
}
timer.stop();
I am making some benchmarks with CUDA, C++, C#, Java, and using MATLAB for verification and matrix generation. When I perform matrix multiplication with MATLAB, 2048x2048
and even bigger matrices are almost instantly multiplied.
1024x1024 2048x2048 4096x4096
--------- --------- ---------
CUDA C (ms) 43.11 391.05 3407.99
C++ (ms) 6137.10 64369.29 551390.93
C# (ms) 10509.00 300684.00 2527250.00
Java (ms) 9149.90 92562.28 838357.94
MATLAB (ms) 75.01 423.10 3133.90
Only CUDA is competitive, but I thought that at least C++ will be somewhat close and not 60 times slower. I also don't know what to think about the C# results. The algorithm is just the same as C++ and Java, but there's a giant jump 2048
from 1024
.
How is MATLAB performing matrix multiplication so fast?
C++ Code:
float temp = 0;
timer.start();
for(int j = 0; j < rozmer; j++)
{
for (int k = 0; k < rozmer; k++)
{
temp = 0;
for (int m = 0; m < rozmer; m++)
{
temp = temp + matice1[j][m] * matice2[m][k];
}
matice3[j][k] = temp;
}
}
timer.stop();
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
此类问题反复出现,应该比 Stack Overflow 上的“MATLAB 使用高度优化的库”或“MATLAB 使用 MKL”更清楚地回答。
历史:
矩阵乘法(连同矩阵向量、向量向量乘法和许多矩阵分解)是线性代数中最重要的问题。工程师很早就开始使用计算机来解决这些问题。
我不是历史专家,但显然当时每个人都只是用简单的循环重写了他的 FORTRAN 版本。随后出现了一些标准化,并确定了解决大多数线性代数问题所需的“内核”(基本例程)。然后,这些基本运算在称为“基本线性代数子程序”(BLAS) 的规范中标准化。然后,工程师可以在代码中调用这些经过充分测试的标准 BLAS 例程,从而使他们的工作变得更加轻松。
BLAS:
BLAS从level 1(定义标量向量和向量向量运算的第一个版本)发展到level 2(向量矩阵运算)再到level 3(矩阵矩阵运算),并提供越来越多的“内核”因此标准化了越来越多的基本线性代数运算。最初的 FORTRAN 77 实现仍然可以在 Netlib 网站 上找到。
追求更好的性能:
多年来(尤其是 BLAS 1 级和 2 级版本之间:80 年代初),随着向量运算和缓存层次结构的出现,硬件发生了变化。这些演变使得大幅提高 BLAS 子例程的性能成为可能。随后,不同的供应商开始实施越来越高效的 BLAS 例程。
我不知道所有历史实现(当时我还没有出生,也不是孩子),但其中两个最著名的实现出现在 2000 年代初:Intel MKL 和 GotoBLAS。您的 Matlab 使用 Intel MKL,这是一个非常好的、优化的 BLAS,这解释了您所看到的出色性能。
矩阵乘法的技术细节:
那么为什么 Matlab(MKL)在 dgemm(双精度通用矩阵-矩阵乘法)方面如此快呢?简单来说:因为它使用矢量化和良好的数据缓存。更复杂的术语:请参阅 文章由乔纳森摩尔提供。
基本上,当您在提供的 C++ 代码中执行乘法时,您根本不支持缓存。由于我怀疑您创建了一个指向行数组的指针数组,因此您在内部循环中对“matice2”的第 k 列的访问:
matice2[m][k]
非常慢。事实上,当您访问matice2[0][k]时,您必须获取矩阵数组0的第k个元素。然后在下一次迭代中,您必须访问matice2[1][k],它是另一个数组(数组1)的第k个元素。然后,在下一次迭代中,您将访问另一个数组,依此类推...由于整个矩阵matice2
无法容纳在最高缓存中(它是8*1024*1024
> 字节大),程序必须从主存中获取所需的元素,从而损失大量时间。如果您只是转置矩阵,以便访问位于连续的内存地址中,那么您的代码已经运行得更快,因为现在编译器可以同时加载缓存中的整行。只需尝试这个修改后的版本:
这样您就可以看到缓存局部性如何显着提高代码的性能。现在,真正的 dgemm 实现在非常广泛的层面上利用了这一点:它们对由 TLB 大小定义的矩阵块执行乘法(翻译后备缓冲区,长话短说:可以有效地缓存什么) ,以便它们准确地向处理器传输其可以处理的数据量。另一方面是矢量化,它们使用处理器的矢量化指令来实现最佳指令吞吐量,而您无法通过跨平台 C++ 代码真正做到这一点。
最后,人们声称这是由于 Strassen 或 Coppersmith-Winograd 算法造成的,这是错误的,由于上述硬件方面的考虑,这两种算法在实践中都无法实现。
This kind of question is recurring and should be answered more clearly than "MATLAB uses highly optimized libraries" or "MATLAB uses the MKL" for once on Stack Overflow.
History:
Matrix multiplication (together with Matrix-vector, vector-vector multiplication and many of the matrix decompositions) is (are) the most important problems in linear algebra. Engineers have been solving these problems with computers since the early days.
I'm not an expert on the history, but apparently back then, everybody just rewrote his FORTRAN version with simple loops. Some standardization then came along, with the identification of "kernels" (basic routines) that most linear algebra problems needed in order to be solved. These basic operations were then standardized in a specification called: Basic Linear Algebra Subprograms (BLAS). Engineers could then call these standard, well-tested BLAS routines in their code, making their work much easier.
BLAS:
BLAS evolved from level 1 (the first version which defined scalar-vector and vector-vector operations) to level 2 (vector-matrix operations) to level 3 (matrix-matrix operations), and provided more and more "kernels" so standardized more and more of the fundamental linear algebra operations. The original FORTRAN 77 implementations are still available on Netlib's website.
Towards better performance:
So over the years (notably between the BLAS level 1 and level 2 releases: early 80s), hardware changed, with the advent of vector operations and cache hierarchies. These evolutions made it possible to increase the performance of the BLAS subroutines substantially. Different vendors then came along with their implementation of BLAS routines which were more and more efficient.
I don't know all the historical implementations (I was not born or a kid back then), but two of the most notable ones came out in the early 2000s: the Intel MKL and GotoBLAS. Your Matlab uses the Intel MKL, which is a very good, optimized BLAS, and that explains the great performance you see.
Technical details on Matrix multiplication:
So why is Matlab (the MKL) so fast at
dgemm
(double-precision general matrix-matrix multiplication)? In simple terms: because it uses vectorization and good caching of data. In more complex terms: see the article provided by Jonathan Moore.Basically, when you perform your multiplication in the C++ code you provided, you are not at all cache-friendly. Since I suspect you created an array of pointers to row arrays, your accesses in your inner loop to the k-th column of "matice2":
matice2[m][k]
are very slow. Indeed, when you accessmatice2[0][k]
, you must get the k-th element of the array 0 of your matrix. Then in the next iteration, you must accessmatice2[1][k]
, which is the k-th element of another array (the array 1). Then in the next iteration you access yet another array, and so on... Since the entire matrixmatice2
can't fit in the highest caches (it's8*1024*1024
bytes large), the program must fetch the desired element from main memory, losing a lot of time.If you just transposed the matrix, so that accesses would be in contiguous memory addresses, your code would already run much faster because now the compiler can load entire rows in the cache at the same time. Just try this modified version:
So you can see how just cache locality increased your code's performance quite substantially. Now real
dgemm
implementations exploit that to a very extensive level: They perform the multiplication on blocks of the matrix defined by the size of the TLB (Translation lookaside buffer, long story short: what can effectively be cached), so that they stream to the processor exactly the amount of data it can process. The other aspect is vectorization, they use the processor's vectorized instructions for optimal instruction throughput, which you can't really do from your cross-platform C++ code.Finally, people claiming that it's because of Strassen's or Coppersmith–Winograd algorithm are wrong, both these algorithms are not implementable in practice, because of hardware considerations mentioned above.
这是我在配备 Tesla C2070 的计算机上使用 MATLAB R2011a + 并行计算工具箱 的结果:
MATLAB 使用高度优化的库进行矩阵乘法,这就是普通 MATLAB 矩阵乘法如此快的原因。
gpuArray
版本使用 MAGMA。在配备 Tesla K20c 的计算机上使用 R2014a 更新,以及新的
timeit
和gputimeit
功能:使用 R2018b 更新在具有 16 个物理核心和 Tesla V100 的 WIN64 计算机上:(
注意:在某些时候(我忘记了具体时间)
gpuArray
从 MAGMA 切换到 cuBLAS - MAGMA 仍用于某些gpuArray< /code> 操作)
在具有 32 个物理核心和 A100 GPU 的 WIN64 计算机上使用 R2022a 进行更新:
Here's my results using MATLAB R2011a + Parallel Computing Toolbox on a machine with a Tesla C2070:
MATLAB uses highly optimized libraries for matrix multiplication which is why the plain MATLAB matrix multiplication is so fast. The
gpuArray
version uses MAGMA.Update using R2014a on a machine with a Tesla K20c, and the new
timeit
andgputimeit
functions:Update using R2018b on a WIN64 machine with 16 physical cores and a Tesla V100:
(NB: at some point (I forget when exactly)
gpuArray
switched from MAGMA to cuBLAS - MAGMA is still used for somegpuArray
operations though)Update using R2022a on a WIN64 machine with 32 physical cores and an A100 GPU:
这就是原因。 MATLAB 不会像您在 C++ 代码中那样通过循环遍历每个元素来执行简单的矩阵乘法。
当然,我假设您只是使用了
C=A*B
而不是自己编写乘法函数。This is why. MATLAB doesn't perform a naive matrix multiplication by looping over every single element the way you did in your C++ code.
Of course I'm assuming that you just used
C=A*B
instead of writing a multiplication function yourself.Matlab 不久前合并了 LAPACK,所以我假设他们的矩阵乘法至少使用那么快的东西。 LAPACK 源代码和文档很容易获得。
您还可以看看 Goto 和 Van De Geijn 的论文“Anatomy of High-Performance Matrix”
乘法”位于 http://citeseerx.ist.psu .edu/viewdoc/download?doi=10.1.1.140.1785&rep=rep1&type=pdf
Matlab incorporated LAPACK some time ago, so I assume their matrix multiplication uses something at least that fast. LAPACK source code and documentation is readily available.
You might also look at Goto and Van De Geijn's paper "Anatomy of High-Performance Matrix
Multiplication" at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.140.1785&rep=rep1&type=pdf
答案是 LAPACK 和 BLAS 库使 MATLAB 的矩阵运算速度快得惊人,而不是 MATLAB 人员的任何专有代码。
使用 LAPACK 和/或 BLAS 库在您的 C++ 代码中用于矩阵运算,您应该获得与 MATLAB 类似的性能。这些库应该可以在任何现代系统上免费使用,并且其部件是学术界几十年来开发的。请注意,有多种实现,包括一些封闭源代码,例如英特尔 MKL。
关于 BLAS 如何获得高性能的讨论可以在这里找到。
顺便说一句,根据我的经验,直接从 c 调用 LAPACK 库是一个严重的痛苦(但值得)。您需要非常精确地阅读文档。
The answer is LAPACK and BLAS libraries make MATLAB blindingly fast at matrix operations, not any proprietary code by the folks at MATLAB.
Use the LAPACK and/or BLAS libraries in your C++ code for matrix operations and you should get similar performance as MATLAB. These libraries should be freely available on any modern system and parts were developed over decades in academia. Note that there are multiple implementations, including some closed source such as Intel MKL.
A discussion of how BLAS gets high performance is available here.
BTW, it's a serious pain in my experience to call LAPACK libraries directly from c (but worth it). You need to read the documentation VERY precisely.
进行矩阵乘法时,您使用简单的乘法方法,该方法需要
O(n^3)
时间。存在需要
O(n^2.4)
的矩阵乘法算法。这意味着在n=2000
时,您的算法需要的计算量是最佳算法的约 100 倍。您确实应该检查矩阵乘法的维基百科页面,以获取有关实现它的有效方法的更多信息。
When doing matrix multiplying, you use naive multiplication method which takes time of
O(n^3)
.There exist matrix multiplication algorithm which takes
O(n^2.4)
. Which means that atn=2000
your algorithm requires ~100 times as much computation as the best algorithm.You should really check the wikipedia page for matrix multiplication for further information on the efficient ways to implement it.
根据您的 Matlab 版本,我相信它可能已经在使用您的 GPU。
另一件事; Matlab 会跟踪矩阵的许多属性;无论是对角线、赫尔墨斯等,并在此基础上专门研究其算法。也许它的专业化基于您传递的零矩阵,或者类似的东西?也许它正在缓存重复的函数调用,这会扰乱您的计时?也许它优化了重复未使用的矩阵乘积?
为了防止这种情况发生,请使用随机数矩阵,并确保通过将结果打印到屏幕或磁盘等来强制执行。
Depending on your version of Matlab, I believe it might be using your GPU already.
Another thing; Matlab keeps track of many properties of your matrix; wether its diagonal, hermetian, and so forth, and specializes its algorithms based thereon. Maybe its specializing based on the zero matrix you are passing it, or something like that? Maybe it is caching repeated function calls, which messes up your timings? Perhaps it optimizes out repeated unused matrix products?
To guard against such things happening, use a matrix of random numbers, and make sure you force execution by printing the result to screen or disk or somesuch.
对于“为什么 matlab 在执行 xxx 时比其他程序更快”的一般答案是,matlab 有很多内置的、优化的函数。
通常使用的其他程序没有这些功能,因此人们应用自己的创造性解决方案,这比专业优化的代码慢得惊人。
这可以用两种方式解释:
1) 常见/理论方式:Matlab 并没有明显更快,你只是做错了基准测试
2) 现实方式:对于这个东西,Matlab 在实践中更快,因为像 c++ 这样的语言太快了很容易以无效的方式使用。
The general answer to "Why is matlab faster at doing xxx than other programs" is that matlab has a lot of built in, optimized functions.
The other programs that are used often do not have these functions so people apply their own creative solutions, which are suprisingly slower than professionally optimized code.
This can be interpreted in two ways:
1) The common/theoretical way: Matlab is not significantly faster, you are just doing the benchmark wrong
2) The realistic way: For this stuff Matlab is faster in practice because languages as c++ are just too easily used in ineffective ways.
MATLAB 使用 Intel 高度优化的 LAPACK 实现,称为Intel Math Kernel Library(英特尔 MKL) - 特别是 dgemm 函数。速度 该库利用了处理器功能,包括 SIMD 指令和多核处理器。他们没有记录他们使用哪种特定算法。如果您要从 C++ 调用英特尔 MKL,您应该会看到类似的性能。
我不确定 MATLAB 使用哪个库进行 GPU 乘法,但可能类似于 nVidia CUBLAS。
MATLAB uses a highly optimized implementation of LAPACK from Intel known as Intel Math Kernel Library (Intel MKL) - specifically the dgemm function. The speed This library takes advantage of processor features including SIMD instructions and multi-core processors. They don't document which specific algorithm they use. If you were to call Intel MKL from C++ you should see similar performance.
I am not sure what library MATLAB uses for GPU multiplication but probably something like nVidia CUBLAS.
因为 MATLAB 最初是为数值线性代数开发的编程语言(矩阵操作),其中有专门为矩阵乘法开发的库。并且现在 MATLAB也可以使用 >GPU(图形处理单元) 用于此目的。
如果我们看看你的计算结果:
那么我们可以看到,不仅仅是 MATLAB 在矩阵乘法方面如此快: CUDA C(来自 NVIDIA 的编程语言)比 MATLAB 有一些更好的结果。 CUDA C 还具有专门为矩阵乘法开发的库,并且它使用 GPU。
MATLAB 简史
什么是 CUDA C
CUDA C 还使用专门开发的库矩阵乘法,例如OpenGL(开放图形库)。它还使用 GPU 和 Direct3D(在 MS Windows 上)。
比较 CPU 和 GPU 执行速度
来自 CUDA C 编程指南的介绍:
高级阅读
高性能矩阵乘法剖析< /a>,来自 Kazushige Goto 和 Robert A. Van De Geijn
一些有趣的事实
Because MATLAB is a programming language at first developed for numerical linear algebra (matrix manipulations), which has libraries especially developed for matrix multiplications. And now MATLAB can also use the GPUs (Graphics processing unit) for this additionally.
And if we look at your computation results:
then we can see that not only MATLAB is so fast in matrix multiplication: CUDA C (programming language from NVIDIA) has some better results than MATLAB. CUDA C has also libraries especially developed for matrix multiplications and it uses the GPUs.
Short history of MATLAB
What is CUDA C
CUDA C uses also libraries especially developed for matrix multiplications like OpenGL (Open Graphics Library). It uses also GPU and Direct3D (on MS Windows).
Comparing CPU and GPU Execution Speeds
From introduction for CUDA C Programming Guide:
Advanced reading
Basic Linear Algebra Subprograms (BLAS)
Anatomy of High-Performance Matrix Multiplication, from Kazushige Goto and Robert A. Van De Geijn
Some interesting facs
鲜明的对比不仅是由于 Matlab 惊人的优化(正如许多其他答案已经讨论的那样),而且还在于您将矩阵表示为对象的方式。
看起来你把矩阵变成了列表的列表?列表列表包含指向列表的指针,列表中包含矩阵元素。所包含列表的位置是任意分配的。当您循环第一个索引(行号?)时,内存访问时间非常重要。相比之下,为什么不尝试使用以下方法将矩阵实现为单个列表/向量呢?
并且
应该使用相同的乘法算法,使得触发器的次数相同。 (n^3 表示大小为 n 的方阵)
我要求您计时,以便结果与您之前(在同一台机器上)的结果相当。通过比较,您将清楚地了解内存访问时间有多么重要!
The sharp contrast is not only due to Matlab's amazing optimization (as discussed by many other answers already), but also in the way you formulated matrix as an object.
It seems like you made matrix a list of lists? A list of lists contains pointers to lists which then contain your matrix elements. The locations of the contained lists are assigned arbitrarily. As you are looping over your first index (row number?), the time of memory access is very significant. In comparison, why don't you try implement matrix as a single list/vector using the following method?
And
The same multiplication algorithm should be used so that the number of flop is the same. (n^3 for square matrices of size n)
I'm asking you to time it so that the result is comparable to what you had earlier (on the same machine). With the comparison, you will show exactly how significant memory access time can be!
在 C++ 中速度很慢,因为您没有使用多线程。本质上,如果 A = BC,其中它们都是矩阵,则 A 的第一行可以独立于第二行计算,依此类推。如果 A、B 和 C 都是 n × n 矩阵,则可以通过以下方式加速乘法: n^2 的因子,如
a_{i,j} = sum_{k} b_{i,k} c_{k,j}
如果您使用 Eigen [ http://eigen.tuxfamily.org/dox/GettingStarted.html],内置多线程-并且线程数是可调的。
It's slow in C++ because you are not using multithreading. Essentially, if A = B C, where they are all matrices, the first row of A can be computed independently from the 2nd row, etc. If A, B, and C are all n by n matrices, you can speed up the multiplication by a factor of n^2, as
a_{i,j} = sum_{k} b_{i,k} c_{k,j}
If you use, say, Eigen [ http://eigen.tuxfamily.org/dox/GettingStarted.html ], multithreading is built-in and the number of threads is adjustable.