MATLAB 矩阵乘法与每列的 for 循环

发布于 2024-10-15 22:03:27 字数 625 浏览 7 评论 0原文

当两个矩阵相乘时,我尝试了以下两个选项:

1)

res = X*A;

2)

for i = 1:size(A,2)
    res(:,i) = X*A(:,i);
end

我为两个矩阵中的 res 预分配了内存。令人惊讶的是,我发现选项 2 更快。

有人可以解释这是怎么回事吗?

编辑: 我试过

K=10000;
clear t1 t2
t1=zeros(K,1);
t2=zeros(K,1);

for k=1:K
    clear res
    x = rand(100,100);
    a = rand(100,100);
    tic
    res = x*a;
    t1(k) = toc;
end

for k=1:K
    clear res2
    res2 = zeros(100,100);
    x = rand(100,100);
    a = rand(100,100);
    tic
    for i = 1:100
        res2(:,i) = x*a(:,i);
    end
    t2(k) = toc;
end

When multiplying two matrices, I tried the following two options:

1)

res = X*A;

2)

for i = 1:size(A,2)
    res(:,i) = X*A(:,i);
end

I preallocated memory for res in both. And surprisingly, I found option 2 to be faster.

Can someone explain how this is so?

edit:
I tried

K=10000;
clear t1 t2
t1=zeros(K,1);
t2=zeros(K,1);

for k=1:K
    clear res
    x = rand(100,100);
    a = rand(100,100);
    tic
    res = x*a;
    t1(k) = toc;
end

for k=1:K
    clear res2
    res2 = zeros(100,100);
    x = rand(100,100);
    a = rand(100,100);
    tic
    for i = 1:100
        res2(:,i) = x*a(:,i);
    end
    t2(k) = toc;
end

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

地狱即天堂 2024-10-22 22:03:27

我循环运行这两个代码 1000 次。平均而言(但并非总是如此),第一个矢量化代码的速度快了 3-4 倍。我在启动计时器之前清除了结果变量并进行了预分配。

x = rand(100,100);
a = rand(100,100);

K=1000;
clear t1 t2
t1=zeros(K,1);
t2=zeros(K,1);

for k=1:K
    clear res
    tic
    res = x*a;
    t1(k) = toc;
end

for k=1:K
    clear res2
    res2 = zeros(100,100);
    tic
    for i = 1:100
        res2(:,i) = x*a(:,i);
    end
    t2(k) = toc;
end

因此,永远不要根据单次运行做出计时结论。

I run both codes in a loop 1000 times. In average (but not always) the first vectorized code was 3-4 times faster. I cleared the result variables and preallocated before starting timer.

x = rand(100,100);
a = rand(100,100);

K=1000;
clear t1 t2
t1=zeros(K,1);
t2=zeros(K,1);

for k=1:K
    clear res
    tic
    res = x*a;
    t1(k) = toc;
end

for k=1:K
    clear res2
    res2 = zeros(100,100);
    tic
    for i = 1:100
        res2(:,i) = x*a(:,i);
    end
    t2(k) = toc;
end

So, never make a timing conclusion based on a single run.

唐婉 2024-10-22 22:03:27

我相信我可以插话一下这两种方法之间的时间差异,以及为什么人们得到不同的相对速度。

在 Matlab 版本 2008a(或接近该版本的版本)之前,for 循环在任何 Matlab 代码中都受到重大打击,因为解释器(非常可读的脚本和代码的较低级别实现之间的一层)必须重新解释每次都通过 for 循环编写代码。

自该版本以来,解释器已经变得越来越好,因此,当运行现代版本的 Matlab 时,解释器可以查看您的代码并说“啊哈!我知道他在做什么,让我稍微优化一下”并避免否则,通过重新解释代码会遭受的打击。

我希望执行矩阵乘法的两种方法能够在相同的时间内进行计算,为什么 for 循环实现运行得更快是因为解释器优化中的一些细节是我们普通人不知道的。

我们应该从中吸取的一个重要教训是,并非所有版本都是平等的。我确实使用两个 Matlab 附加组件(SimBiology 和并行计算工具箱)研究了几个前沿案例,这两个附加组件(特别是如果您希望它们一起工作)在执行速度方面取决于版本,并且有时会发生变化。其他稳定性问题。因此,我保留了 Matlab 的三个最新版本,将测试我是否从每个版本中获得相同的答案,如果我发现某些功能存在问题,我偶尔会回滚到早期版本。对于大多数人来说这可能有点过分了,但可以让您了解版本差异。

希望这有帮助。

编辑:

澄清一下,代码矢量化仍然很重要。但给定这样的脚本:

x_slow = zeros(1,1e5);
x_fast = zeros(1,1e5);


tic;
for i=1:1e5
    x_slow(i) = log(i);
end
time_slow = toc; % evaluates for me in .0132 seconds

tic;
x_fast = log(1:1e5);
time_fast = toc; % evaluates for me in .0055 seconds

基于解释器的改进,在过去的几个版本中,time_slow 和 time_fast 之间的差异已经减少。我相信我看到的例子是 2000a 与 2008b 的对比,但这取决于我的记忆。

Oli 和 Yuk 已经解决了可能发生的其他问题。 time_1 和 time_2 之间通常存在以下差异:

tic; x = log(1:1e5); time_1 = toc
tic; x = log(1:1e5); time_2 = toc

因此,一百万次评估与一次评估的测试是有价值的,具体取决于内存中 x 的位置(在缓存中或没有)。

希望这再次有所帮助。

I believe I can chime in on the variation in timings between the two methods, as well as why people are getting different relative speeds.

Before Matlab version 2008a (or a version near that release), for loops took a major hit in any Matlab code because the interpreter (a layer between the very readable script and a lower level implementation of the code) would have to re-interpret the code each time through the for loop.

Since that release, the interpreter has gotten progressively better so, when running a modern version of Matlab, the interpreter can look at your code and say "Ah ha! I know what he is doing, let me optimize it just a bit" and avoid the hit it would otherwise take by reinterpreting the code.

I would expect the two ways of performing matrix multiplies to evaluate in the same amount of time, why the for loop implementation runs faster is because of some detail in the optimizations of the interpreter that us mere mortals are not privy to know.

One broad lesson we should take from this, is not all versions are equal. I do work on a couple of bleeding edge cases using two Matlab add ons, the SimBiology and the Parallel Computing Toolboxes, both of which (especially if you want them to work together) are version dependent in speed of execution, and from time to time other stability issues. As such, I keep the three most recent releases of Matlab, will test that I get the same answers out of each version, and I'll occasionally roll back to an earlier version if I find issues with some features. This is probably overkill for most people, but gives you an idea of version differences.

Hope this helps.

Edits:

To clarify, code vectorization is still important. But given a script like:

x_slow = zeros(1,1e5);
x_fast = zeros(1,1e5);


tic;
for i=1:1e5
    x_slow(i) = log(i);
end
time_slow = toc; % evaluates for me in .0132 seconds

tic;
x_fast = log(1:1e5);
time_fast = toc; % evaluates for me in .0055 seconds

The disparity between time_slow and time_fast has reduced in the past several versions based on improvements in the interpreter. The example I saw I believe was on 2000a vs. 2008b, but that's subject to my recollection.

There is something else that might be going on that was addressed by Oli and Yuk. There is often a difference between the time_1 and time_2 in:

tic; x = log(1:1e5); time_1 = toc
tic; x = log(1:1e5); time_2 = toc

So the test of one million evaluations vs. one evaluation is valuable, depending on where in memory x is (in cache or no).

Hope this helps again.

洒一地阳光 2024-10-22 22:03:27

这很可能是缓存的影响。当您执行第二个版本时,a 已经在缓存中,因此它具有优势。尝试创建一组独立的输入以使其公平。此外,最好测量例如 100 万次迭代的时间,以消除由于外部影响而导致的典型变化。

This may well be an effect of caching. a is already in the cache by the time you do the second version, so it has an advantage. Try creating an independent set of inputs to make it fair. Also, it's probably better to measure the time of e.g. 1 million iterations of this, in order to eliminate typical variations due to outside effects.

与往事干杯 2024-10-22 22:03:27

在我看来,你没有正确地乘以矩阵,你需要将 X 矩阵的第 i 行和 A 矩阵的第 j 列的所有乘积相加,这可能是一个原因。
请查看此处了解它是如何完成的。

It looks to me that you are not multiplying matrix properly, you need to sum all the products from ith row of X matrix and jth column of A matrix, that might be a reason.
Look here to see how it's done.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文