为什么 MATLAB 的矩阵乘法如此快？

发布于 2024-11-09 02:39:45 字数 1050 浏览 14 评论 0原文

我正在使用 CUDA、C++、C#、Java 进行一些基准测试，并使用 MATLAB 进行验证和矩阵生成。当我使用 MATLAB 执行矩阵乘法时，2048x2048 甚至更大的矩阵几乎立即相乘。

             1024x1024   2048x2048   4096x4096
             ---------   ---------   ---------
CUDA C (ms)      43.11      391.05     3407.99
C++ (ms)       6137.10    64369.29   551390.93
C# (ms)       10509.00   300684.00  2527250.00
Java (ms)      9149.90    92562.28   838357.94
MATLAB (ms)      75.01      423.10     3133.90

只有CUDA有竞争力，但我认为至少C++会有点接近，不会慢60倍。我也不知道如何看待 C# 结果。该算法与 C++ 和 Java 相同，但与 1024 相比有一个巨大的跳跃 2048。

MATLAB 为何能够如此快地执行矩阵乘法？

C++代码：

float temp = 0;
timer.start();
for(int j = 0; j < rozmer; j++)
{
    for (int k = 0; k < rozmer; k++)
    {
        temp = 0;
        for (int m = 0; m < rozmer; m++)
        {
            temp = temp + matice1[j][m] * matice2[m][k];
        }
        matice3[j][k] = temp;
    }
}
timer.stop();

原文

I am making some benchmarks with CUDA, C++, C#, Java, and using MATLAB for verification and matrix generation. When I perform matrix multiplication with MATLAB, 2048x2048 and even bigger matrices are almost instantly multiplied.

             1024x1024   2048x2048   4096x4096
             ---------   ---------   ---------
CUDA C (ms)      43.11      391.05     3407.99
C++ (ms)       6137.10    64369.29   551390.93
C# (ms)       10509.00   300684.00  2527250.00
Java (ms)      9149.90    92562.28   838357.94
MATLAB (ms)      75.01      423.10     3133.90

Only CUDA is competitive, but I thought that at least C++ will be somewhat close and not 60 times slower. I also don't know what to think about the C# results. The algorithm is just the same as C++ and Java, but there's a giant jump 2048 from 1024.

How is MATLAB performing matrix multiplication so fast?

C++ Code:

float temp = 0;
timer.start();
for(int j = 0; j < rozmer; j++)
{
    for (int k = 0; k < rozmer; k++)
    {
        temp = 0;
        for (int m = 0; m < rozmer; m++)
        {
            temp = temp + matice1[j][m] * matice2[m][k];
        }
        matice3[j][k] = temp;
    }
}
timer.stop();

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谈下烟灰 2024-11-16 02:39:45

此类问题反复出现，应该比 Stack Overflow 上的“MATLAB 使用高度优化的库”或“MATLAB 使用 MKL”更清楚地回答。

历史：

矩阵乘法（连同矩阵向量、向量向量乘法和许多矩阵分解）是线性代数中最重要的问题。工程师很早就开始使用计算机来解决这些问题。

我不是历史专家，但显然当时每个人都只是用简单的循环重写了他的 FORTRAN 版本。随后出现了一些标准化，并确定了解决大多数线性代数问题所需的“内核”（基本例程）。然后，这些基本运算在称为“基本线性代数子程序”(BLAS) 的规范中标准化。然后，工程师可以在代码中调用这些经过充分测试的标准 BLAS 例程，从而使他们的工作变得更加轻松。

BLAS：

BLAS从level 1（定义标量向量和向量向量运算的第一个版本）发展到level 2（向量矩阵运算）再到level 3（矩阵矩阵运算），并提供越来越多的“内核”因此标准化了越来越多的基本线性代数运算。最初的 FORTRAN 77 实现仍然可以在 Netlib 网站上找到。

追求更好的性能：

多年来（尤其是 BLAS 1 级和 2 级版本之间：80 年代初），随着向量运算和缓存层次结构的出现，硬件发生了变化。这些演变使得大幅提高 BLAS 子例程的性能成为可能。随后，不同的供应商开始实施越来越高效的 BLAS 例程。

我不知道所有历史实现（当时我还没有出生，也不是孩子），但其中两个最著名的实现出现在 2000 年代初：Intel MKL 和 GotoBLAS。您的 Matlab 使用 Intel MKL，这是一个非常好的、优化的 BLAS，这解释了您所看到的出色性能。

矩阵乘法的技术细节：

那么为什么 Matlab（MKL）在 dgemm（双精度通用矩阵-矩阵乘法）方面如此快呢？简单来说：因为它使用矢量化和良好的数据缓存。更复杂的术语：请参阅文章由乔纳森摩尔提供。

基本上，当您在提供的 C++ 代码中执行乘法时，您根本不支持缓存。由于我怀疑您创建了一个指向行数组的指针数组，因此您在内部循环中对“matice2”的第 k 列的访问：matice2[m][k] 非常慢。事实上，当您访问matice2[0][k]时，您必须获取矩阵数组0的第k个元素。然后在下一次迭代中，您必须访问matice2[1][k]，它是另一个数组（数组1）的第k个元素。然后，在下一次迭代中，您将访问另一个数组，依此类推...由于整个矩阵 matice2 无法容纳在最高缓存中（它是 8*1024*1024 > 字节大），程序必须从主存中获取所需的元素，从而损失大量时间。

如果您只是转置矩阵，以便访问位于连续的内存地址中，那么您的代码已经运行得更快，因为现在编译器可以同时加载缓存中的整行。只需尝试这个修改后的版本：

timer.start();
float temp = 0;
//transpose matice2
for (int p = 0; p < rozmer; p++)
{
    for (int q = 0; q < rozmer; q++)
    {
        tempmat[p][q] = matice2[q][p];
    }
}
for(int j = 0; j < rozmer; j++)
{
    for (int k = 0; k < rozmer; k++)
    {
        temp = 0;
        for (int m = 0; m < rozmer; m++)
        {
            temp = temp + matice1[j][m] * tempmat[k][m];
        }
        matice3[j][k] = temp;
    }
}
timer.stop();

这样您就可以看到缓存局部性如何显着提高代码的性能。现在，真正的 dgemm 实现在非常广泛的层面上利用了这一点：它们对由 TLB 大小定义的矩阵块执行乘法（翻译后备缓冲区，长话短说：可以有效地缓存什么），以便它们准确地向处理器传输其可以处理的数据量。另一方面是矢量化，它们使用处理器的矢量化指令来实现最佳指令吞吐量，而您无法通过跨平台 C++ 代码真正做到这一点。

最后，人们声称这是由于 Strassen 或 Coppersmith-Winograd 算法造成的，这是错误的，由于上述硬件方面的考虑，这两种算法在实践中都无法实现。

This kind of question is recurring and should be answered more clearly than "MATLAB uses highly optimized libraries" or "MATLAB uses the MKL" for once on Stack Overflow.

History:

Matrix multiplication (together with Matrix-vector, vector-vector multiplication and many of the matrix decompositions) is (are) the most important problems in linear algebra. Engineers have been solving these problems with computers since the early days.

I'm not an expert on the history, but apparently back then, everybody just rewrote his FORTRAN version with simple loops. Some standardization then came along, with the identification of "kernels" (basic routines) that most linear algebra problems needed in order to be solved. These basic operations were then standardized in a specification called: Basic Linear Algebra Subprograms (BLAS). Engineers could then call these standard, well-tested BLAS routines in their code, making their work much easier.

BLAS:

BLAS evolved from level 1 (the first version which defined scalar-vector and vector-vector operations) to level 2 (vector-matrix operations) to level 3 (matrix-matrix operations), and provided more and more "kernels" so standardized more and more of the fundamental linear algebra operations. The original FORTRAN 77 implementations are still available on Netlib's website.

Towards better performance:

So over the years (notably between the BLAS level 1 and level 2 releases: early 80s), hardware changed, with the advent of vector operations and cache hierarchies. These evolutions made it possible to increase the performance of the BLAS subroutines substantially. Different vendors then came along with their implementation of BLAS routines which were more and more efficient.

I don't know all the historical implementations (I was not born or a kid back then), but two of the most notable ones came out in the early 2000s: the Intel MKL and GotoBLAS. Your Matlab uses the Intel MKL, which is a very good, optimized BLAS, and that explains the great performance you see.

Technical details on Matrix multiplication:

So why is Matlab (the MKL) so fast at dgemm (double-precision general matrix-matrix multiplication)? In simple terms: because it uses vectorization and good caching of data. In more complex terms: see the article provided by Jonathan Moore.

Basically, when you perform your multiplication in the C++ code you provided, you are not at all cache-friendly. Since I suspect you created an array of pointers to row arrays, your accesses in your inner loop to the k-th column of "matice2": matice2[m][k] are very slow. Indeed, when you access matice2[0][k], you must get the k-th element of the array 0 of your matrix. Then in the next iteration, you must access matice2[1][k], which is the k-th element of another array (the array 1). Then in the next iteration you access yet another array, and so on... Since the entire matrix matice2 can't fit in the highest caches (it's 8*1024*1024 bytes large), the program must fetch the desired element from main memory, losing a lot of time.

If you just transposed the matrix, so that accesses would be in contiguous memory addresses, your code would already run much faster because now the compiler can load entire rows in the cache at the same time. Just try this modified version:

timer.start();
float temp = 0;
//transpose matice2
for (int p = 0; p < rozmer; p++)
{
    for (int q = 0; q < rozmer; q++)
    {
        tempmat[p][q] = matice2[q][p];
    }
}
for(int j = 0; j < rozmer; j++)
{
    for (int k = 0; k < rozmer; k++)
    {
        temp = 0;
        for (int m = 0; m < rozmer; m++)
        {
            temp = temp + matice1[j][m] * tempmat[k][m];
        }
        matice3[j][k] = temp;
    }
}
timer.stop();

So you can see how just cache locality increased your code's performance quite substantially. Now real dgemm implementations exploit that to a very extensive level: They perform the multiplication on blocks of the matrix defined by the size of the TLB (Translation lookaside buffer, long story short: what can effectively be cached), so that they stream to the processor exactly the amount of data it can process. The other aspect is vectorization, they use the processor's vectorized instructions for optimal instruction throughput, which you can't really do from your cross-platform C++ code.

Finally, people claiming that it's because of Strassen's or Coppersmith–Winograd algorithm are wrong, both these algorithms are not implementable in practice, because of hardware considerations mentioned above.

回复收藏 0 原文

空心↖ 2024-11-16 02:39:45

这是我在配备 Tesla C2070 的计算机上使用 MATLAB R2011a + 并行计算工具箱的结果：

>> A = rand(1024); gA = gpuArray(A);
% warm up by executing the operations a couple of times, and then:
>> tic, C = A * A; toc
Elapsed time is 0.075396 seconds.
>> tic, gC = gA * gA; toc
Elapsed time is 0.008621 seconds.

MATLAB 使用高度优化的库进行矩阵乘法，这就是普通 MATLAB 矩阵乘法如此快的原因。 gpuArray 版本使用 MAGMA。

在配备 Tesla K20c 的计算机上使用 R2014a 更新，以及新的 timeit 和 gputimeit 功能：

>> A = rand(1024); gA = gpuArray(A);
>> timeit(@()A*A)
ans =
    0.0324
>> gputimeit(@()gA*gA)
ans =
    0.0022

使用 R2018b 更新在具有 16 个物理核心和 Tesla V100 的 WIN64 计算机上：（

>> timeit(@()A*A)
ans =
    0.0229
>> gputimeit(@()gA*gA)
ans =
   4.8019e-04

注意：在某些时候（我忘记了具体时间）gpuArray 从 MAGMA 切换到 cuBLAS - MAGMA 仍用于某些 gpuArray< /code> 操作）

在具有 32 个物理核心和 A100 GPU 的 WIN64 计算机上使用 R2022a 进行更新：

>> timeit(@()A*A)
ans =
    0.0076
>> gputimeit(@()gA*gA)
ans =
   2.5344e-04

Here's my results using MATLAB R2011a + Parallel Computing Toolbox on a machine with a Tesla C2070:

>> A = rand(1024); gA = gpuArray(A);
% warm up by executing the operations a couple of times, and then:
>> tic, C = A * A; toc
Elapsed time is 0.075396 seconds.
>> tic, gC = gA * gA; toc
Elapsed time is 0.008621 seconds.

MATLAB uses highly optimized libraries for matrix multiplication which is why the plain MATLAB matrix multiplication is so fast. The gpuArray version uses MAGMA.

Update using R2014a on a machine with a Tesla K20c, and the new timeit and gputimeit functions:

>> A = rand(1024); gA = gpuArray(A);
>> timeit(@()A*A)
ans =
    0.0324
>> gputimeit(@()gA*gA)
ans =
    0.0022

Update using R2018b on a WIN64 machine with 16 physical cores and a Tesla V100:

>> timeit(@()A*A)
ans =
    0.0229
>> gputimeit(@()gA*gA)
ans =
   4.8019e-04

(NB: at some point (I forget when exactly) gpuArray switched from MAGMA to cuBLAS - MAGMA is still used for some gpuArray operations though)

Update using R2022a on a WIN64 machine with 32 physical cores and an A100 GPU:

>> timeit(@()A*A)
ans =
    0.0076
>> gputimeit(@()gA*gA)
ans =
   2.5344e-04

回复收藏 0 原文

弥枳 2024-11-16 02:39:45

这就是原因。 MATLAB 不会像您在 C++ 代码中那样通过循环遍历每个元素来执行简单的矩阵乘法。

当然，我假设您只是使用了 C=A*B 而不是自己编写乘法函数。

回复收藏 0 原文

浮云落日 2024-11-16 02:39:45

Matlab 不久前合并了 LAPACK，所以我假设他们的矩阵乘法至少使用那么快的东西。 LAPACK 源代码和文档很容易获得。

您还可以看看 Goto 和 Van De Geijn 的论文“Anatomy of High-Performance Matrix”
乘法”位于 http://citeseerx.ist.psu .edu/viewdoc/download?doi=10.1.1.140.1785&rep=rep1&type=pdf

回复收藏 0 原文

挽袖吟 2024-11-16 02:39:45

答案是 LAPACK 和 BLAS 库使 MATLAB 的矩阵运算速度快得惊人，而不是 MATLAB 人员的任何专有代码。

使用 LAPACK 和/或 BLAS 库在您的 C++ 代码中用于矩阵运算，您应该获得与 MATLAB 类似的性能。这些库应该可以在任何现代系统上免费使用，并且其部件是学术界几十年来开发的。请注意，有多种实现，包括一些封闭源代码，例如英特尔 MKL。

关于 BLAS 如何获得高性能的讨论可以在这里找到。

顺便说一句，根据我的经验，直接从 c 调用 LAPACK 库是一个严重的痛苦（但值得）。您需要非常精确地阅读文档。

回复收藏 0 原文

倾城泪 2024-11-16 02:39:45

进行矩阵乘法时，您使用简单的乘法方法，该方法需要 O(n^3) 时间。

存在需要O(n^2.4)的矩阵乘法算法。这意味着在 n=2000 时，您的算法需要的计算量是最佳算法的约 100 倍。
您确实应该检查矩阵乘法的维基百科页面，以获取有关实现它的有效方法的更多信息。

回复收藏 0 原文

黯淡〆 2024-11-16 02:39:45

根据您的 Matlab 版本，我相信它可能已经在使用您的 GPU。

另一件事； Matlab 会跟踪矩阵的许多属性；无论是对角线、赫尔墨斯等，并在此基础上专门研究其算法。也许它的专业化基于您传递的零矩阵，或者类似的东西？也许它正在缓存重复的函数调用，这会扰乱您的计时？也许它优化了重复未使用的矩阵乘积？

为了防止这种情况发生，请使用随机数矩阵，并确保通过将结果打印到屏幕或磁盘等来强制执行。

回复收藏 0 原文

属性 2024-11-16 02:39:45

对于“为什么 matlab 在执行 xxx 时比其他程序更快”的一般答案是，matlab 有很多内置的、优化的函数。

通常使用的其他程序没有这些功能，因此人们应用自己的创造性解决方案，这比专业优化的代码慢得惊人。

这可以用两种方式解释：

1) 常见/理论方式：Matlab 并没有明显更快，你只是做错了基准测试

2) 现实方式：对于这个东西，Matlab 在实践中更快，因为像 c++ 这样的语言太快了很容易以无效的方式使用。

回复收藏 0 原文

我一直都在从未离去 2024-11-16 02:39:45

MATLAB 使用 Intel 高度优化的 LAPACK 实现，称为Intel Math Kernel Library（英特尔 MKL） - 特别是 dgemm 函数。速度该库利用了处理器功能，包括 SIMD 指令和多核处理器。他们没有记录他们使用哪种特定算法。如果您要从 C++ 调用英特尔 MKL，您应该会看到类似的性能。

我不确定 MATLAB 使用哪个库进行 GPU 乘法，但可能类似于 nVidia CUBLAS。

回复收藏 0 原文

蒲公英的约定 2024-11-16 02:39:45

因为 MATLAB 最初是为数值线性代数开发的编程语言（矩阵操作），其中有专门为矩阵乘法开发的库。并且现在 MATLAB也可以使用 >GPU（图形处理单元） 用于此目的。

如果我们看看你的计算结果：

<前><代码> 1024x1024 2048x2048 4096x4096
--------- --------- ---------
CUDA C（毫秒）43.11 391.05 3407.99
C++（毫秒）6137.10 64369.29 551390.93
C#（毫秒）10509.00 300684.00 2527250.00
Java（毫秒）9149.90 92562.28 838357.94
MATLAB（毫秒）75.01 423.10 3133.90

那么我们可以看到，不仅仅是 MATLAB 在矩阵乘法方面如此快： CUDA C（来自 NVIDIA 的编程语言）比 MATLAB 有一些更好的结果。 CUDA C 还具有专门为矩阵乘法开发的库，并且它使用 GPU。

MATLAB 简史

新墨西哥大学计算机科学系系主任 Cleve Moler 在 20 世纪 70 年代末开始开发 MATLAB。他设计它是为了让他的学生能够访问LINPACK（一个软件用于执行数值线性代数的库）和 EISPACK （是一个软件数值库线性代数计算），而无需学习 Fortran。它很快传播到其他大学，并在应用数学界找到了强大的受众。工程师 Jack Little 在 1983 年 Moler 访问斯坦福大学期间接触到了它。认识到它的商业潜力，他与 Moler 和 Steve Bangert 一起加入。他们用 C 语言重写了 MATLAB，并于 1984 年创立了 MathWorks 以继续其发展。这些重写的库被称为 JACKPAC。 2000 年，MATLAB 被重写，以使用一组更新的矩阵操作库，LAPACK（是数值线性代数的标准软件库）。
来源

什么是 CUDA C

CUDA C 还使用专门开发的库矩阵乘法，例如OpenGL（开放图形库）。它还使用 GPU 和 Direct3D（在 MS Windows 上）。

CUDA 平台 旨在与编程语言配合使用例如 C、C++ 和 Fortran。与之前的 API（例如 相比），这种可访问性使并行编程专家可以更轻松地使用 GPU 资源Direct3D 和 OpenGL，其中需要图形编程的高级技能。此外，CUDA还支持OpenACC和等编程框架OpenCL。
CUDA处理流程示例：
将数据从主内存复制到 GPU 内存
CPU 启动 GPU 计算内核
GPU 的 CUDA 核心并行执行内核
将结果数据从 GPU 内存复制到主内存

比较 CPU 和 GPU 执行速度

我们运行了一个基准测试，测量了在 Intel Xeon 处理器 X5650 上执行 50 个时间步（网格大小为 64、128、512、1024 和 2048）所需的时间，然后使用 NVIDIA Tesla C2050 GPU .
对于 2048 的网格大小，该算法显示计算时间减少了 7.5 倍，从 CPU 上的一分多钟减少到 GPU 上的不到 10 秒。对数标度图显示，对于较小的网格尺寸，CPU 实际上更快。然而，随着技术的发展和成熟，GPU 解决方案越来越能够处理较小的问题，我们预计这一趋势将持续下去。
来源

来自 CUDA C 编程指南的介绍：

在对实时、高清 3D 图形永不满足的市场需求的推动下，可编程图形处理器单元或 GPU 已发展成为高度并行、多线程、众核处理器，具有巨大的计算能力和非常高的内存带宽，如下所示<代码>图1和图2。
图 1. CPU 和 GPU 每秒的浮点运算
图 2。 CPU 和 GPU 的内存带宽
CPU 和 GPU 之间浮点能力差异背后的原因是 GPU 专门用于计算密集型、高度并行计算（这正是图形渲染的目的），因此设计时需要投入更多晶体管如图 3 所示。
图 3。 GPU 将更多晶体管用于数据处理
更具体地说，GPU 特别适合解决可以表示为数据并行计算的问题 - 同一程序在许多数据元素上并行执行 - 具有高算术强度 - 算术运算与内存的比率运营。由于对每个数据元素执行相同的程序，因此对复杂的流程控制要求较低，并且由于对多个数据元素执行且计算强度高，因此可以通过计算而不是大数据缓存来隐藏内存访问延迟.
数据并行处理将数据元素映射到并行处理线程。许多处理大型数据集的应用程序可以使用数据并行编程模型来加速计算。在 3D 渲染中，大量像素和顶点被映射到并行线程。同样，图像和媒体处理应用程序（例如渲染图像的后处理、视频编码和解码、图像缩放、立体视觉和模式识别）可以将图像块和像素映射到并行处理线程。事实上，图像渲染和处理领域之外的许多算法都是通过数据并行处理来加速的，从一般信号处理或物理模拟到计算金融或计算生物学。
来源

高级阅读

一些有趣的事实

我已经编写了 C++ 矩阵乘法，它和 Matlab 一样快，但需要小心。（在 Matlab 使用 GPU 之前）。
来自此答案的引用。

Because MATLAB is a programming language at first developed for numerical linear algebra (matrix manipulations), which has libraries especially developed for matrix multiplications. And now MATLAB can also use the GPUs (Graphics processing unit) for this additionally.

And if we look at your computation results:

             1024x1024   2048x2048   4096x4096
             ---------   ---------   ---------
CUDA C (ms)      43.11      391.05     3407.99
C++ (ms)       6137.10    64369.29   551390.93
C# (ms)       10509.00   300684.00  2527250.00
Java (ms)      9149.90    92562.28   838357.94
MATLAB (ms)      75.01      423.10     3133.90

then we can see that not only MATLAB is so fast in matrix multiplication: CUDA C (programming language from NVIDIA) has some better results than MATLAB. CUDA C has also libraries especially developed for matrix multiplications and it uses the GPUs.

Short history of MATLAB

Cleve Moler, the chairman of the computer science department at the University of New Mexico, started developing MATLAB in the late 1970s. He designed it to give his students access to LINPACK (a software library for performing numerical linear algebra) and EISPACK (is a software library for numerical computation of linear algebra) without them having to learn Fortran. It soon spread to other universities and found a strong audience within the applied mathematics community. Jack Little, an engineer, was exposed to it during a visit Moler made to Stanford University in 1983. Recognizing its commercial potential, he joined with Moler and Steve Bangert. They rewrote MATLAB in C and founded MathWorks in 1984 to continue its development. These rewritten libraries were known as JACKPAC. In 2000, MATLAB was rewritten to use a newer set of libraries for matrix manipulation, LAPACK (is a standard software library for numerical linear algebra).
Source

What is CUDA C

CUDA C uses also libraries especially developed for matrix multiplications like OpenGL (Open Graphics Library). It uses also GPU and Direct3D (on MS Windows).

The CUDA platform is designed to work with programming languages such as C, C++, and Fortran. This accessibility makes it easier for specialists in parallel programming to use GPU resources, in contrast to prior APIs like Direct3D and OpenGL, which required advanced skills in graphics programming. Also, CUDA supports programming frameworks such as OpenACC and OpenCL.
Example of CUDA processing flow:
Copy data from main memory to GPU memory
CPU initiates the GPU compute kernel
GPU's CUDA cores execute the kernel in parallel
Copy the resulting data from GPU memory to main memory

Comparing CPU and GPU Execution Speeds

We ran a benchmark in which we measured the amount of time it took to execute 50 time steps for grid sizes of 64, 128, 512, 1024, and 2048 on an Intel Xeon Processor X5650 and then using an NVIDIA Tesla C2050 GPU.
For a grid size of 2048, the algorithm shows a 7.5x decrease in compute time from more than a minute on the CPU to less than 10 seconds on the GPU. The log scale plot shows that the CPU is actually faster for small grid sizes. As the technology evolves and matures, however, GPU solutions are increasingly able to handle smaller problems, a trend that we expect to continue.
Source

From introduction for CUDA C Programming Guide:

Driven by the insatiable market demand for realtime, high-definition 3D graphics, the programmable Graphic Processor Unit or GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and very high memory bandwidth, as illustrated by Figure 1 and Figure 2.
Figure 1. Floating-Point Operations per Second for the CPU and GPU
Figure 2. Memory Bandwidth for the CPU and GPU
The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation - exactly what graphics rendering is about - and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 3.
Figure 3. The GPU Devotes More Transistors to Data Processing
More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with high arithmetic intensity - the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.
Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.
Source

Advanced reading

GPUs (Graphics processing unit)
MATLAB
CUDA C Programming Guide
Using GPUs in MATLAB
Basic Linear Algebra Subprograms (BLAS)
Anatomy of High-Performance Matrix Multiplication, from Kazushige Goto and Robert A. Van De Geijn

Some interesting facs

I've written C++ matrix multiplication that is as fast as Matlab's but it took some care. (Before Matlab was using GPUs for this).
Сitation from this answer.

回复收藏 0 原文

旧人九事 2024-11-16 02:39:45

鲜明的对比不仅是由于 Matlab 惊人的优化（正如许多其他答案已经讨论的那样），而且还在于您将矩阵表示为对象的方式。

看起来你把矩阵变成了列表的列表？列表列表包含指向列表的指针，列表中包含矩阵元素。所包含列表的位置是任意分配的。当您循环第一个索引（行号？）时，内存访问时间非常重要。相比之下，为什么不尝试使用以下方法将矩阵实现为单个列表/向量呢？

#include <vector>

struct matrix {
    matrix(int x, int y) : n_row(x), n_col(y), M(x * y) {}
    int n_row;
    int n_col;
    std::vector<double> M;
    double &operator()(int i, int j);
};

并且

double &matrix::operator()(int i, int j) {
    return M[n_col * i + j];
}

应该使用相同的乘法算法，使得触发器的次数相同。（n^3 表示大小为 n 的方阵）

我要求您计时，以便结果与您之前（在同一台机器上）的结果相当。通过比较，您将清楚地了解内存访问时间有多么重要！

The sharp contrast is not only due to Matlab's amazing optimization (as discussed by many other answers already), but also in the way you formulated matrix as an object.

It seems like you made matrix a list of lists? A list of lists contains pointers to lists which then contain your matrix elements. The locations of the contained lists are assigned arbitrarily. As you are looping over your first index (row number?), the time of memory access is very significant. In comparison, why don't you try implement matrix as a single list/vector using the following method?

#include <vector>

struct matrix {
    matrix(int x, int y) : n_row(x), n_col(y), M(x * y) {}
    int n_row;
    int n_col;
    std::vector<double> M;
    double &operator()(int i, int j);
};

And

double &matrix::operator()(int i, int j) {
    return M[n_col * i + j];
}

The same multiplication algorithm should be used so that the number of flop is the same. (n^3 for square matrices of size n)

I'm asking you to time it so that the result is comparable to what you had earlier (on the same machine). With the comparison, you will show exactly how significant memory access time can be!

回复收藏 0 原文