C# .NET multithreading task-parallel-library

如何使用Parallel.Foreach乘以2个矩阵？

发布于 2025-02-01 17:55:03 字数 1442 浏览 3 评论 0原文

有一个功能可以像往常一样乘以两个矩阵，

public IMatrix Multiply(IMatrix m1, IMatrix m2)
        {
            var resultMatrix = new Matrix(m1.RowCount, m2.ColCount);
            for (long i = 0; i < m1.RowCount; i++)
            {
                for (byte j = 0; j < m2.ColCount; j++)
                {
                    long sum = 0;
                    for (byte k = 0; k < m1.ColCount; k++)
                    {
                        sum += m1.GetElement(i, k) * m2.GetElement(k, j);
                    }

                    resultMatrix.SetElement(i, j, sum);
                }
            }

            return resultMatrix;
        }

应该使用并行重写。插线，我尝试过这种方式

public IMatrix Multiply(IMatrix m1, IMatrix m2)
        {
            // todo: feel free to add your code here

            var resultMatrix = new Matrix(m1.RowCount, m2.ColCount);
            
            Parallel.ForEach(m1.RowCount, row =>
            {
                for (byte j = 0; j < m2.ColCount; j++)
                {
                    long sum = 0;
                    for (byte k = 0; k < m1.ColCount; k++)
                    {
                        sum += m1.GetElement(row, k) * m2.GetElement(k, j);
                    }

                    resultMatrix.SetElement(row, j, sum);
                }
            });

            return resultMatrix;
        }

，但是循环中的类型参数存在错误。我该如何修复？

原文

There is a function that multiplies two matrices as usual

public IMatrix Multiply(IMatrix m1, IMatrix m2)
        {
            var resultMatrix = new Matrix(m1.RowCount, m2.ColCount);
            for (long i = 0; i < m1.RowCount; i++)
            {
                for (byte j = 0; j < m2.ColCount; j++)
                {
                    long sum = 0;
                    for (byte k = 0; k < m1.ColCount; k++)
                    {
                        sum += m1.GetElement(i, k) * m2.GetElement(k, j);
                    }

                    resultMatrix.SetElement(i, j, sum);
                }
            }

            return resultMatrix;
        }

This function should be rewritten using Parallel.ForEach Threading, I tried this way

public IMatrix Multiply(IMatrix m1, IMatrix m2)
        {
            // todo: feel free to add your code here

            var resultMatrix = new Matrix(m1.RowCount, m2.ColCount);
            
            Parallel.ForEach(m1.RowCount, row =>
            {
                for (byte j = 0; j < m2.ColCount; j++)
                {
                    long sum = 0;
                    for (byte k = 0; k < m1.ColCount; k++)
                    {
                        sum += m1.GetElement(row, k) * m2.GetElement(k, j);
                    }

                    resultMatrix.SetElement(row, j, sum);
                }
            });

            return resultMatrix;
        }

But there is an error with the type argument in the loop. How can I fix it?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

陈甜 2025-02-08 17:55:03

只需并行即可而不是并行

Parallel.For(0, m1.RowCount, i =>{
   ...
}

。不是采取的方法。

乘以矩阵的一个问题是，您需要为每行的一个值访问最内向循环中的一个矩阵之一。您的处理器可能很难缓存此访问模式，从而导致许多缓存失误。因此，相当容易的优化是将整列复制到临时数组，并在阅读下一篇之前进行所有需要此列的计算。这使所有内存访问都可以访问良好，线性且易于缓存。这将在整体上做更多的工作，但是更好的缓存利用很容易使其获胜。有更多的缓存有效方法，但复杂性也倾向于增加。

另一个优化将是使用 simd 需要特定于平台的代码才能获得最佳性能，并且可能涉及更多工作。但是您也许可以找到已经进行了优化的库。

但也许最重要的是，您的代码。让简单的事情消耗大量时间非常容易。例如，您正在使用接口，因此，如果您可能对每个无法嵌入的内存访问的虚拟方法调用，则与直接阵列访问相比，可能会导致严重的性能罚款。

Just use Parallel.For instead of Parallel.Foreach, that should let you keep the exact same body as the non-parallel version:

Parallel.For(0, m1.RowCount, i =>{
   ...
}

Note that only fairly large matrices will benefit from parallelization, so if you are working with 4x4 matrices for graphics, this is not the approach to take.

One problem with multiplying matrices is that you need to access one value for each row for one of the matrices in your innermost loop. This access pattern may be difficult to cache by your processor, causing lots of cache misses. So a fairly easy optimization is to copy an entire column to a temporary array and do all computations that need this column before reading the next. This lets all memory access be nice and linear and easy to cache. this will do more work overall, but better cache utilization easily makes it a win. There are even more cache efficient methods, but the complexity also tend to increase.

Another optimization would be to use SIMD, but this might require platform specific code for best performance, and will likely involve more work. But you might be able to find libraries that are already optimized.

But perhaps most importantly, Profile your code. It is quite easy to have simple things consume lot of time. You are for example using an interface, so if you may have a virtual method call for each memory access that cannot be inlined, potentially causing a severe performance penalty compared to a direct array access.

回复收藏 0 原文