提高标准矩阵乘法算法的效率?

发布于 2024-11-27 03:09:20 字数 108 浏览 1 评论 0原文

如何提高标准矩阵乘法算法的效率?

这种方式涉及到的主要操作是:C[i][j]+=A[i][p]*B[p][j]

可以采取哪些措施来提高效率算法?

How can I improve the efficiency of standard matrix multiplication algorithm?

The main operation involved in this approach is: C[i][j]+=A[i][p]*B[p][j]

What can be done to improve the efficiency of the algorithm?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

扭转时空 2024-12-04 03:09:20

您可能想看看使用 BLAS(基本线性代数子例程)库,特别是 Intel 提供的 MKL 这里,AMD 有他们的 ACML 这里,还有(开源)Goto BLAS 这里

(密集)矩阵-矩阵乘法内核将是 ?GEMM 调用,其中 ? 指示浮点类型。例如,DGEMM 将调用 double 例程。

除非您非常有信心知道自己在进行低级优化,否则这些库可能会比您手动编码的库提供更好的性能。

如果您确实想尝试自己编码,那么您可能需要考虑以下事项:

  1. 使用“向量”指令。 SSE、SSE2..4 指令得到广泛支持,一些较新的 CPU 还将支持 AVX 指令。
  2. 嵌套循环展开以最大化浮点操作与加载/存储操作的比率。
  3. 分块算法可确保有效的缓存使用。
  4. 多线程。

此参考资料可能会让您了解当前的状况:

3 级 BLAS - K Goto 的高性能实现。

希望这有帮助。

You might want to have a look at using a BLAS (Basic Linear Algebra Subroutine) library, specifically Intel offer their MKL here, AMD have their ACML here and there's also the (opensource) Goto BLAS here.

The (dense) matrix-matrix multiply kernel will be a ?GEMM call, where the ? indicates the floating point type. For example DGEMM will call the double routine.

Unless you're extremely confident you know what you're doing with low-level optimisations, these libraries will probably offer better performance than something you can code by hand.

If you do want to have a go at coding this yourself then you may want to consider the following:

  1. Use "vector" instructions. SSE, SSE2..4 instructions are widely supported, some newer CPU's will also support AVX instructions.
  2. Nested loop unrolling to maximise the ratio of floating point operations to load/store operations.
  3. Block-wise algorithms to ensure effective cache use.
  4. Multi-threading.

This reference might give you an idea of the current state of things:

High-performance implementation of the level-3 BLAS - K Goto.

Hope this helps.

分分钟 2024-12-04 03:09:20
  1. 缓存阻塞 - 确保正确使用和重用缓存中的值
  2. 更好的算法 - 矩阵相乘的“按定义”方式不是最佳的,请查看 Strassen 算法
  3. 并行化 - 如果您的机器有多个内核和/或处理器,您可以分而治之
  4. SIMD - 利用 SSE 现代 CPU 架构中的矢量指令
  5. GPGPU - 现代 GPU 经过优化,可以完成此类任务。查看 CUDAOpenCL

请注意,使用这些方法并不能保证更好的性能。需要进行大量调整才能显着加快速度。人们投入了大量资金来研究如何快速进行矩阵乘法,因此关于该主题的期刊文章并不缺乏。

  1. Cache blocking - making sure you're properly using and reusing values in the cache
  2. Better algorithm - the "by-definition" way to multiply matrices is not optimal, take a look at Strassen's algorithm
  3. Parallelization - if your machine has more than one core and/or processor, you can divide and conquer
  4. SIMD - take advantage of SSE vector instructions in modern CPU architectures
  5. GPGPU - modern GPUs are optimized to do just this sort of thing. Look into CUDA and OpenCL.

Note that using these methods does not guarantee better performance. There is a lot of tuning required give a significant speedup. There is a lot of money going into figuring out how to multiply matrices quickly so there is no shortage of journal articles on the subject.

哭泣的笑容 2024-12-04 03:09:20

I would suggest reading Chapter 1 of Golub and Van Loan, which addresses this exact question.

腻橙味 2024-12-04 03:09:20

如果问题涉及多个矩阵乘法 - M1 x M2 x ... x Mn - 那么还有另一种基于动态规划的优化技术,这是另一种球类游戏。请注意,这不适用于提高两个矩阵相乘的效率;但是,如果您以成对方式将三个或更多矩阵相乘,那么您可以在更高级别上进行优化。只是想我会把这个答案放在堆上以完善信息。

If the question concerns multiple matrix multiplications - M1 x M2 x ... x Mn - then there is another optimization technique based on dynamic programming, which is sort of another ball-game. Note that this doesn't apply to improving the efficiency of multiplying two matrices; however, if you are multiplying three or more matrices in a pairwise fashion, then you can optimize at an even higher level. Just thought I'd throw this answer on the heap to round out the information.

过期情话 2024-12-04 03:09:20

嗯,有Strassen算法,根据矩阵的大小,它比您列出的标准算法。当然还有更快的算法,但它们实现起来并不那么简单。

标准算法是O(N^3),
Strassen 算法为 O(N^2.8),
Coppersmith-Winograd 的复杂度为 O(N^2.3)

Well there's Strassen's Algorithm, which, depending on the size of your matrix, is ever so slightly faster than the standard algorithm that you listed. There are of course even faster algorithms, but they arent so simple to implement.

The standard algorithm is O(N^3),
Strassen's algo is O(N^2.8),
and Coppersmith-Winograd is O(N^2.3)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文