并行与序列结果aent

发布于 2025-02-09 07:02:22 字数 3529 浏览 2 评论 0 原文

下面的测试结果有些奇怪。 我进行了平行和串行循环,并将它们彼此进行了比较。 我将测试作为4个形状 Parallel serial 并行循环首先首先,然后串行回路,最后是序列循环平行循环。我将它们编码为P,S,PFSL和SFPL。

结果如表

的测试类型和## 在Milli第二秒所示
,直到= 4
P#1 9,472
S#1 13,459
S #2 11,323
P#2 8,854
P#3 9,253
S#3 10,669
直到= 5
Pfsl#1 1,421
8,299
SFPL#1 1,708
6,280
SFPL#1 1,657
6,334
PFSL#2 1,400
8,191
PFSL#3 1,443
8,488
SFPL#3 1,784
6,475

有人可以解释这一点吗?第二种方法如何持续更长的时间?

您可以在

最小可重复的代码:

串行方法

for(int i = somenum; i >= until; i-- ){

    foreach (var nue in nuelist)
    {
        foreach (var path in nue.pathlist)
        {
            foreach (var conn in nue.connlist)
            {
                Func(conn,path); 
            }
        }
    }
}

并行方法

for(int i = somenum; i >= until; i-- ){

    Parallel.ForEach(nuelist,nue =>
    {
        Parallel.ForEach(nue.pathlist,path=>
        {
            Parallel.ForEach(nue.connlist, conn=>
            {
                Func(conn,path);
            });
        });
    });
}

内的路径类中的

Nue firstnue;
string name;
List<Conn> Conns;
public void Func(Conn conn,Path path)
{
    List<Conn> list = new(){conn};
    list.AddRange(path.list);
    _ = new Path(list); 
}
public Path(List<Conn> conns)
{
   //other things
   Conns = new();
   Conns = conns;
   Paths.TryAdd(name,this);
   firstnue.pathlist.Add(this);
   /*
   firstnue is another nue that will be 
   in the next iteration of for loop
   */
}
public static ConcurrentDictionary<string,Path> Paths = new();

内部nue类

public ConcurrentBag<Path> pathlist;
public Nue()
{
    pathlist = new ConcurrentBag<Path>();
}

conn类

Nue From;
Nue To;
public Conn(Nue From, Nue To)
{
    this.From = From;
    this.To = To;
}

在MAIN方法中

using System.diagnostics;
StopWatch watch = new();
watch.start();
// for serial results, uncomment lines that are below
// serial(somenum = n,until = l);
// watch.stop();
// int s = watch.elapsed;

// for parallel results, uncomment lines that are below
// parallel(somenum = n,until = l);
// watch.stop();
// int p = watch.elapsed;

// for sfpl results, uncomment lines that are below
// serial(n,l);
// watch.stop();
// int sf = watch.elapsed;
// watch.restart();
// parallel(n,l);
// int pl = watch.elapsed;

// for pfsl results, uncomment lines that are below
// parallel(n,l);
// watch.stop();
// int pf = watch.elapsed;
// watch.restart();
// serial(n,l);
// int sl = watch.elapsed;

There is something weird in the test results below.
I did parallel and serial loop and compared them with each other.
I did the test as one of 4 shapes parallel , serial , parallel loop first then serial loop , and lastly serial loop first then parallel loop. I coded them as p , s, pfsl, and sfpl respectfully.

the results are as in table

type of test and # time taken in milli second
until = 4
p#1 9,472
s#1 13,459
s#2 11,323
p#2 8,854
p#3 9,253
s#3 10,669
until =5
pfsl#1 1,421
8,299
sfpl#1 1,708
6,280
sfpl#1 1,657
6,334
pfsl#2 1,400
8,191
pfsl#3 1,443
8,488
sfpl#3 1,784
6,475

could someone explain this? and how always the second method lasts longer?

you can do the same test in here(just export the project as it is, and run the main method )

minimal reproducible code:

The serial method

for(int i = somenum; i >= until; i-- ){

    foreach (var nue in nuelist)
    {
        foreach (var path in nue.pathlist)
        {
            foreach (var conn in nue.connlist)
            {
                Func(conn,path); 
            }
        }
    }
}

The parallel method

for(int i = somenum; i >= until; i-- ){

    Parallel.ForEach(nuelist,nue =>
    {
        Parallel.ForEach(nue.pathlist,path=>
        {
            Parallel.ForEach(nue.connlist, conn=>
            {
                Func(conn,path);
            });
        });
    });
}

Inside Path class

Nue firstnue;
string name;
List<Conn> Conns;
public void Func(Conn conn,Path path)
{
    List<Conn> list = new(){conn};
    list.AddRange(path.list);
    _ = new Path(list); 
}
public Path(List<Conn> conns)
{
   //other things
   Conns = new();
   Conns = conns;
   Paths.TryAdd(name,this);
   firstnue.pathlist.Add(this);
   /*
   firstnue is another nue that will be 
   in the next iteration of for loop
   */
}
public static ConcurrentDictionary<string,Path> Paths = new();

Inside Nue class

public ConcurrentBag<Path> pathlist;
public Nue()
{
    pathlist = new ConcurrentBag<Path>();
}

Inside Conn class

Nue From;
Nue To;
public Conn(Nue From, Nue To)
{
    this.From = From;
    this.To = To;
}

in main method

using System.diagnostics;
StopWatch watch = new();
watch.start();
// for serial results, uncomment lines that are below
// serial(somenum = n,until = l);
// watch.stop();
// int s = watch.elapsed;

// for parallel results, uncomment lines that are below
// parallel(somenum = n,until = l);
// watch.stop();
// int p = watch.elapsed;

// for sfpl results, uncomment lines that are below
// serial(n,l);
// watch.stop();
// int sf = watch.elapsed;
// watch.restart();
// parallel(n,l);
// int pl = watch.elapsed;

// for pfsl results, uncomment lines that are below
// parallel(n,l);
// watch.stop();
// int pf = watch.elapsed;
// watch.restart();
// serial(n,l);
// int sl = watch.elapsed;

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

谜泪 2025-02-16 07:02:23

我对大型矩阵乘法的结果非常不同。在6核机上,正确的并行。对于,可以使性能提高9倍。对于核心而言,这是6次,另外3次是由于高线程:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22621
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.302
  [Host]     : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
  DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT


           Method |      Mean |    Error |   StdDev | Ratio |
----------------- |----------:|---------:|---------:|------:|
       SerialMult | 288.63 ms | 5.736 ms | 8.585 ms |  1.00 |
     ParallelMult |  75.92 ms | 1.311 ms | 1.162 ms |  0.26 |
 ParallelMultTemp |  49.63 ms | 0.873 ms | 0.817 ms |  0.17 |
 ParallelMultVect |  33.04 ms | 0.588 ms | 0.521 ms |  0.11 |

问题的代码不完整且很难阅读。除了将项目添加到列表外,似乎没有其他事情。不可能说出结果数是什么意思或为什么它们是这样。

但是,链接的项目提到了神经网络,因此有意义的实际基准是矩阵乘法。在MLP中,前馈阶段全部与矩阵乘法有关。

stopwatch 对于基准测试没有用,因为其他程序可以延迟执行。即使是多次执行相同的方法并平均结果,结果也不够,因为可能会有峰值,热身和缓存效果。这就是为什么如今几乎每个基准都使用 benchmarkdotnet package。 BDN只要必须运行基准,直到它收集足够的测量以提供统计正确的结果。

乘法代码

串行乘法方法是:

static double[,] MatrixProduct(double[,] matrixA,
    double[,] matrixB)
{
    int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
    int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
    if (aCols != bRows)
        throw new Exception("xxx");

    double[,] result = new double[aRows, bCols];

    for (int i = 0; i < aRows; ++i) // each row of A
    for (int j = 0; j < bCols; ++j) // each col of B
    for (int k = 0; k < aCols; ++k) // could use k < bRows
        result[i,j] += matrixA[i,k] * matrixB[k,j];
    return result;
}

幼稚的并行版本简单地用并行替换外循环。对于

static double[,] MatrixProductP(double[,] matrixA,
    double[,] matrixB)
{
    int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
    int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
    if (aCols != bRows)
        throw new Exception("xxx");

    double[,] result = new double[aRows, bCols];

    Parallel.For(0, aRows, i =>
    {
        // each row of A
        for (int j = 0; j < bCols; ++j) // each col of B
        for (int k = 0; k < aCols; ++k) // could use k < bRows
            result[i, j] += matrixA[i, k] * matrixB[k, j];
    });
    return result;
}

一个轻微的调整是在内部循环中使用临时变量,而不是直接写入结果数组:

static double[,] MatrixProductPTemp(double[,] matrixA,
    double[,] matrixB)
{
    int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
    int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
    if (aCols != bRows)
        throw new Exception("xxx");

    double[,] result = new double[aRows, bCols];

    Parallel.For(0, aRows, i =>
    {
        // each row of A
        for (int j = 0; j < bCols; ++j) // each col of B
        {
            double temp=0;
            for (int k = 0; k < aCols; ++k) // could use k < bRows
                temp += matrixA[i, k] * matrixB[k, j];
            result[i, j] = temp;
        }
    });
    return result;
} 

偶数此外,从 matrixa 的行被复制到每个并行循环开始时的本地向量,以进一步减少缓存失误和非本地访问的机会:

static double[,] MatrixProductPVect(double[,] matrixA,
    double[,] matrixB)
{
    int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
    int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
    if (aCols != bRows)
        throw new Exception("xxx");

    double[,] result = new double[aRows, bCols];

    Parallel.For(0, aRows, i =>
    {
        double[] row = new double[aCols];
        System.Buffer.BlockCopy(matrixA, i*aCols*sizeof(double), row, 0, aCols*sizeof(double));

        for (int j = 0; j < bCols; ++j) // each col of B
        {
            double temp=0;
            
            for (int k = 0; k < aCols; ++k) // could use k < bRows
                temp += row[k] * matrixB[k, j];
            result[i, j] = temp;
        }
    });
    return result;
}

用于基准测试矩阵乘法以下.NET使用6代码。这是整个程序。CS文件

// See https://aka.ms/new-console-template for more information

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

var summary = BenchmarkRunner.Run(typeof(Program).Assembly);

[MarkdownExporterAttribute.StackOverflow]
public class SerialVsParallel
{
    private const int L = 500;
    private const int N = 1000;
    private const int M = 200;

    private double[,] _matrixA;
    private double[,] _matrixB;
    
    public SerialVsParallel()
    {
        
        _matrixA = new double[L,M];
        _matrixB = new double[M,N];

        var random = new Random(42);
        for (var i = 0; i < L; i++)
        {
            for (var j = 0; j < M; j++)
            {
                _matrixA[i, j] = random.NextDouble();
            }
        }
        for (var i = 0; i < M; i++)
        {
            for (var j = 0; j < N; j++)
            {
                _matrixB[i, j] = random.NextDouble();
            }
        }
    }

    static double[,] MatrixProduct(double[,] matrixA,
        double[,] matrixB)
    {
        int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
        int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
        if (aCols != bRows)
            throw new Exception("xxx");

        double[,] result = new double[aRows, bCols];

        for (int i = 0; i < aRows; ++i) // each row of A
        for (int j = 0; j < bCols; ++j) // each col of B
        for (int k = 0; k < aCols; ++k) // could use k < bRows
            result[i,j] += matrixA[i,k] * matrixB[k,j];
        return result;
    }
    
    static double[,] MatrixProductP(double[,] matrixA,
        double[,] matrixB)
    {
        int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
        int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
        if (aCols != bRows)
            throw new Exception("xxx");

        double[,] result = new double[aRows, bCols];

        Parallel.For(0, aRows, i =>
        {
            // each row of A
            for (int j = 0; j < bCols; ++j) // each col of B
            for (int k = 0; k < aCols; ++k) // could use k < bRows
                result[i, j] += matrixA[i, k] * matrixB[k, j];
        });
        return result;
    }

    static double[,] MatrixProductPTemp(double[,] matrixA,
        double[,] matrixB)
    {
        int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
        int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
        if (aCols != bRows)
            throw new Exception("xxx");

        double[,] result = new double[aRows, bCols];

        Parallel.For(0, aRows, i =>
        {
            // each row of A
            for (int j = 0; j < bCols; ++j) // each col of B
            {
                double temp=0;
                for (int k = 0; k < aCols; ++k) // could use k < bRows
                    temp += matrixA[i, k] * matrixB[k, j];
                result[i, j] = temp;
            }
        });
        return result;
    } 
    
    static double[,] MatrixProductPVect(double[,] matrixA,
        double[,] matrixB)
    {
        int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
        int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
        if (aCols != bRows)
            throw new Exception("xxx");

        double[,] result = new double[aRows, bCols];

        Parallel.For(0, aRows, i =>
        {
            double[] row = new double[aCols];
            System.Buffer.BlockCopy(matrixA, i*aCols*sizeof(double), row, 0, aCols*sizeof(double));
            // each row of A
            for (int j = 0; j < bCols; ++j) // each col of B
            {
                double temp=0;
                
                for (int k = 0; k < aCols; ++k) // could use k < bRows
                    temp += row[k] * matrixB[k, j];
                result[i, j] = temp;
            }
        });
        return result;
    }
    

    [Benchmark(Baseline = true)]
    public double[,] SerialMult() => MatrixProduct(_matrixA,_matrixB);

    [Benchmark]
    public double[,] ParallelMult() => MatrixProductP(_matrixA,_matrixB);
    
    [Benchmark]
    public double[,] ParallelMultTemp() => MatrixProductPTemp(_matrixA,_matrixB);

    [Benchmark]
    public double[,] ParallelMultVect() => MatrixProductPVect(_matrixA,_matrixB);
    
}

产生以下输出

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22621
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.302
  [Host]     : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
  DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT


           Method |      Mean |    Error |   StdDev | Ratio |
----------------- |----------:|---------:|---------:|------:|
       SerialMult | 288.63 ms | 5.736 ms | 8.585 ms |  1.00 |
     ParallelMult |  75.92 ms | 1.311 ms | 1.162 ms |  0.26 |
 ParallelMultTemp |  49.63 ms | 0.873 ms | 0.817 ms |  0.17 |
 ParallelMultVect |  33.04 ms | 0.588 ms | 0.521 ms |  0.11 |

运行此操作会在6核机上 ,而幼稚的并行方法比序列方法快4倍。当然,不是的速度。只需使用 temp 变量,但比串行方法更快地执行了6倍。将行复制到本地缓冲区中的性能比串行版本要好9倍

,因为串行版本是因为内部循环中的行读取和写作会导致许多缓存失误。使用 temp 变量消除了其中一些。使用该行的副本减少了缓存的进一步允许处理器利用超线程

My results on large matrix multiplication are very different. On a 6-core machine, a correct Parallel.For can result in 9 times better performance. That's 6 times for the cores and another 3 due to hyperthreading:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22621
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.302
  [Host]     : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
  DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT


           Method |      Mean |    Error |   StdDev | Ratio |
----------------- |----------:|---------:|---------:|------:|
       SerialMult | 288.63 ms | 5.736 ms | 8.585 ms |  1.00 |
     ParallelMult |  75.92 ms | 1.311 ms | 1.162 ms |  0.26 |
 ParallelMultTemp |  49.63 ms | 0.873 ms | 0.817 ms |  0.17 |
 ParallelMultVect |  33.04 ms | 0.588 ms | 0.521 ms |  0.11 |

The question's code is incomplete and very hard to read. It doesn't seem to be doing anything other than adding items to lists too. It's impossible to say what the result numbers mean or why they are this way.

The linked project mentions neural networks though, so a meaningful real benchmark would be matrix multiplications. In an MLP, the feed-forward stage is all about matrix multiplications.

Stopwatch isn't useful for benchmarking as execution can be delayed by other programs. Even executing the same method lots of times and averaging the results isn't enough, as there may be spikes, warmup and caching effects. That's why almost every benchmark these days uses the BenchmarkDotNet package. BDN will run a benchmark as long as it has to until it gathers enough measurements to provide a statistically correct result.

The multiplication code was borrowed from this article.

The serial multiplication method is :

static double[,] MatrixProduct(double[,] matrixA,
    double[,] matrixB)
{
    int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
    int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
    if (aCols != bRows)
        throw new Exception("xxx");

    double[,] result = new double[aRows, bCols];

    for (int i = 0; i < aRows; ++i) // each row of A
    for (int j = 0; j < bCols; ++j) // each col of B
    for (int k = 0; k < aCols; ++k) // could use k < bRows
        result[i,j] += matrixA[i,k] * matrixB[k,j];
    return result;
}

A naive parallel version simply replaces the outer loop with Parallel.For

static double[,] MatrixProductP(double[,] matrixA,
    double[,] matrixB)
{
    int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
    int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
    if (aCols != bRows)
        throw new Exception("xxx");

    double[,] result = new double[aRows, bCols];

    Parallel.For(0, aRows, i =>
    {
        // each row of A
        for (int j = 0; j < bCols; ++j) // each col of B
        for (int k = 0; k < aCols; ++k) // could use k < bRows
            result[i, j] += matrixA[i, k] * matrixB[k, j];
    });
    return result;
}

A slight tweak is to use a temporary variable for the inner loop instead of writing directly to the result array:

static double[,] MatrixProductPTemp(double[,] matrixA,
    double[,] matrixB)
{
    int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
    int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
    if (aCols != bRows)
        throw new Exception("xxx");

    double[,] result = new double[aRows, bCols];

    Parallel.For(0, aRows, i =>
    {
        // each row of A
        for (int j = 0; j < bCols; ++j) // each col of B
        {
            double temp=0;
            for (int k = 0; k < aCols; ++k) // could use k < bRows
                temp += matrixA[i, k] * matrixB[k, j];
            result[i, j] = temp;
        }
    });
    return result;
} 

Going even further, the row from matrixA is copied into a local vector at the start of each parallel loop, to reduce even further the chance of cache misses and non-local access:

static double[,] MatrixProductPVect(double[,] matrixA,
    double[,] matrixB)
{
    int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
    int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
    if (aCols != bRows)
        throw new Exception("xxx");

    double[,] result = new double[aRows, bCols];

    Parallel.For(0, aRows, i =>
    {
        double[] row = new double[aCols];
        System.Buffer.BlockCopy(matrixA, i*aCols*sizeof(double), row, 0, aCols*sizeof(double));

        for (int j = 0; j < bCols; ++j) // each col of B
        {
            double temp=0;
            
            for (int k = 0; k < aCols; ++k) // could use k < bRows
                temp += row[k] * matrixB[k, j];
            result[i, j] = temp;
        }
    });
    return result;
}

To benchmark matrix multiplication the following .NET 6 code is used. This is the whole Program.cs file

// See https://aka.ms/new-console-template for more information

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

var summary = BenchmarkRunner.Run(typeof(Program).Assembly);

[MarkdownExporterAttribute.StackOverflow]
public class SerialVsParallel
{
    private const int L = 500;
    private const int N = 1000;
    private const int M = 200;

    private double[,] _matrixA;
    private double[,] _matrixB;
    
    public SerialVsParallel()
    {
        
        _matrixA = new double[L,M];
        _matrixB = new double[M,N];

        var random = new Random(42);
        for (var i = 0; i < L; i++)
        {
            for (var j = 0; j < M; j++)
            {
                _matrixA[i, j] = random.NextDouble();
            }
        }
        for (var i = 0; i < M; i++)
        {
            for (var j = 0; j < N; j++)
            {
                _matrixB[i, j] = random.NextDouble();
            }
        }
    }

    static double[,] MatrixProduct(double[,] matrixA,
        double[,] matrixB)
    {
        int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
        int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
        if (aCols != bRows)
            throw new Exception("xxx");

        double[,] result = new double[aRows, bCols];

        for (int i = 0; i < aRows; ++i) // each row of A
        for (int j = 0; j < bCols; ++j) // each col of B
        for (int k = 0; k < aCols; ++k) // could use k < bRows
            result[i,j] += matrixA[i,k] * matrixB[k,j];
        return result;
    }
    
    static double[,] MatrixProductP(double[,] matrixA,
        double[,] matrixB)
    {
        int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
        int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
        if (aCols != bRows)
            throw new Exception("xxx");

        double[,] result = new double[aRows, bCols];

        Parallel.For(0, aRows, i =>
        {
            // each row of A
            for (int j = 0; j < bCols; ++j) // each col of B
            for (int k = 0; k < aCols; ++k) // could use k < bRows
                result[i, j] += matrixA[i, k] * matrixB[k, j];
        });
        return result;
    }

    static double[,] MatrixProductPTemp(double[,] matrixA,
        double[,] matrixB)
    {
        int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
        int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
        if (aCols != bRows)
            throw new Exception("xxx");

        double[,] result = new double[aRows, bCols];

        Parallel.For(0, aRows, i =>
        {
            // each row of A
            for (int j = 0; j < bCols; ++j) // each col of B
            {
                double temp=0;
                for (int k = 0; k < aCols; ++k) // could use k < bRows
                    temp += matrixA[i, k] * matrixB[k, j];
                result[i, j] = temp;
            }
        });
        return result;
    } 
    
    static double[,] MatrixProductPVect(double[,] matrixA,
        double[,] matrixB)
    {
        int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
        int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
        if (aCols != bRows)
            throw new Exception("xxx");

        double[,] result = new double[aRows, bCols];

        Parallel.For(0, aRows, i =>
        {
            double[] row = new double[aCols];
            System.Buffer.BlockCopy(matrixA, i*aCols*sizeof(double), row, 0, aCols*sizeof(double));
            // each row of A
            for (int j = 0; j < bCols; ++j) // each col of B
            {
                double temp=0;
                
                for (int k = 0; k < aCols; ++k) // could use k < bRows
                    temp += row[k] * matrixB[k, j];
                result[i, j] = temp;
            }
        });
        return result;
    }
    

    [Benchmark(Baseline = true)]
    public double[,] SerialMult() => MatrixProduct(_matrixA,_matrixB);

    [Benchmark]
    public double[,] ParallelMult() => MatrixProductP(_matrixA,_matrixB);
    
    [Benchmark]
    public double[,] ParallelMultTemp() => MatrixProductPTemp(_matrixA,_matrixB);

    [Benchmark]
    public double[,] ParallelMultVect() => MatrixProductPVect(_matrixA,_matrixB);
    
}

Running this produces the following output

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22621
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.302
  [Host]     : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
  DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT


           Method |      Mean |    Error |   StdDev | Ratio |
----------------- |----------:|---------:|---------:|------:|
       SerialMult | 288.63 ms | 5.736 ms | 8.585 ms |  1.00 |
     ParallelMult |  75.92 ms | 1.311 ms | 1.162 ms |  0.26 |
 ParallelMultTemp |  49.63 ms | 0.873 ms | 0.817 ms |  0.17 |
 ParallelMultVect |  33.04 ms | 0.588 ms | 0.521 ms |  0.11 |

On a 6-core machine, the naive parallel method is 4 times faster than the serial method. That's certainly not as fast as it should be. Simply using the temp variable though resulted in 6x faster execution than the serial method. Copying the row into a local buffer results in 9 times better performance than the serial version

That's because reading and writing across rows in the inner loop results in a lot of cache misses. Using the temp variable eliminated some of these. Using a copy of the row reduces cache misses even further and allows the processors to take advantage of hyperthreading

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文