是否值得在 Python 中使用多线程 blas 实现和多处理?
假设我有一台 16 核机器,和一个令人尴尬的并行程序。我使用了大量的 numpy 点积和 numpy 数组的添加,如果我不使用多处理,那将是理所当然的:确保 numpy 是针对使用多线程的 blas 版本构建的。但是,我正在使用多处理,并且所有内核始终都在努力工作。在这种情况下,使用多线程 blas 有什么好处吗?
大多数操作是(bla)类型 1,有些是类型 2。
Suppose I have a 16 core machine, and an embarrassingly parallel program. I use lots of numpy dot products and addition of numpy arrays, and if I did not use multiprocessing it would be a no-brainer: Make sure numpy is built against a version of blas that uses multithreading. However, I am using multiprocessing, and all cores are working hard at all times. In this case, is there any benefit to be had from using a multithreading blas?
Most of the operations are (blas) type 1, some are type 2.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可能需要稍微小心一下这样的假设:您的代码实际上使用了多线程 BLAS 调用。相对较少的 numpy 运算符实际上使用底层 BLAS,并且相对较少的 BLAS 调用实际上是多线程的。
numpy.dot
使用 BLASdot
、gemv
或gemm
,具体取决于操作,但其中仅gemm 通常是多线程的,因为这样做的 O(N) 和 O(N^2) BLAS 调用很少有任何性能优势。如果您将自己限制为 1 级和 2 级 BLAS 操作,我怀疑您实际上正在使用任何多线程 BLAS 调用,即使您使用的是使用多线程 BLAS 构建的 numpy 实现,例如 Atlas 或 MKL 。You might need to be a little careful about the assumption that your code is actually used multithreaded BLAS calls. Relatively few numpy operators actually use the underlying BLAS, and relatively few BLAS calls are actually multithreaded.
numpy.dot
uses either BLASdot
,gemv
orgemm
, depending on the operation, but of those, onlygemm
is usually multithreaded, because there is rarely any performance benefit for the O(N) and O(N^2) BLAS calls in doing so. If you are limiting yourself to Level 1 and Level 2 BLAS operations, I doubt you are actually using any multithreaded BLAS calls, even if you are using a numpy implementation built with a mulithreaded BLAS, like Atlas or MKL.如果您已经在使用多处理,并且所有内核都处于最大负载,那么添加等待处理器的线程几乎没有任何好处(如果有的话)。
根据您的算法和您正在执行的操作,使用一种类型可能比使用另一种类型更有利,但这非常依赖。
If you are already using multiprocessing, and all cores are at max load, then there will be very little, if any, benefit to adding threads that will be waiting around for a processor.
Depending on your algorithm and what you're doing, it may be more beneficial to use one type over the other, but that's very dependent.