加速 Numpy 数组上的循环

发布于 2024-11-06 20:05:32 字数 359 浏览 1 评论 0原文

在我的代码中,我有一个 for 循环,它对多维 numpy 数组进行索引,并使用每次迭代时获得的子数组进行一些操作。看起来像这样

for sub in Arr:
  #do stuff using sub

现在使用 sub 完成的内容已完全矢量化,因此它应该是高效的。另一方面,这个循环迭代大约 ~10^5 次,是瓶颈。你认为我会通过将这部分卸载到 C 来获得改进吗?我有点不愿意这样做,因为 do stuff using sub 使用广播、切片、智能索引技巧,这些技巧写起来很乏味我也欢迎关于在将计算卸载到 C 时如何处理广播、切片、智能索引的想法和建议。

In my code I have for loop that indexes over a multidimensional numpy array and does some operation using the sub-array that is obtained at each iteration. It looks like this

for sub in Arr:
  #do stuff using sub

Now the stuff that is done using sub is fully vectorized, so it should be efficient. On the other hand this loop iterates about ~10^5 times and is the bottleneck. Do you think I will get an improvement by offloading this part to C. I am somewhat reluctant to do so because the do stuff using sub uses broadcasting, slicing, smart-indexing tricks that would be tedious to write in plain C. I would also welcome thoughts and suggestions about how to deal with broadcasting, slicing, smart-indexing when offloading computation to C.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

热风软妹 2024-11-13 20:05:33

你可以看看scipy.weave。您可以使用 scipy.weave.blitz 将表达式透明地转换为 C++ 代码并运行它。它将自动处理切片并消除临时变量,但您声称 for 循环的主体不会创建临时变量,因此您的情况可能会有所不同。

但是,如果您想用更高效的东西替换整个 for 循环,那么您可以使用 scipy.inline 。缺点是您必须编写 C++ 代码。这应该不会太难,因为您可以使用与 numpy 数组表达式非常接近的 Blitz++ 语法。直接支持切片,但不支持广播。

有两种解决方法:

  1. 是使用 numpy-C api 并使用多维迭代器。他们透明地处理广播。但是,您正在调用 Numpy 运行时,因此可能会产生一些开销。另一种选择,可能也是更简单的选择是使用通常的矩阵表示法进行广播。广播操作可以写成全为向量的外积。好处是 Blitz++ 实际上不会在内存中创建这个临时广播数组,它会弄清楚如何将其包装到等效循环中。

  2. 对于第二个选项,请查看http://www.oonumerics。 org/blitz/docs/blitz_3.html#SEC88 用于索引占位符。只要你的矩阵维数小于 11 就可以了。此链接显示了如何使用它们来形成外部产品 http://www.oonumerics .org/blitz/docs/blitz_3.html#SEC99(搜索外部产品以转到文档的相关部分)。

San you can take a look at scipy.weave. You can use scipy.weave.blitz to transparently translate your expression into C++ code and run it. It will handle slicing automatically and get rid of temporaries, but you claim that the body of your for loop does not create temporaries so your milage may vary.

However if you want to replace your entire for loop with something more efficient then you could make use of scipy.inline. The drawback is that you have to write C++ code. This should not be too hard because you can use Blitz++ syntax which is very close to numpy array expressions. Slicing is directly supported, broadcasting however is not.

There are two work arounds:

  1. is to use the numpy-C api and use multi-dimensional iterators. They transparently handle broadcasting. However you are invoking the Numpy runtime so there might be some overhead. The other option, and possibly the simpler option is to use the usual matrix notation for broadcasting. Broadcast operations can be written as outer-products with vector of all ones. The good thing is that Blitz++ will not actually create this temporary broadcasted arrays in memory, it will figure out how to wrap it into an equivalent loop.

  2. For the second option take a look at http://www.oonumerics.org/blitz/docs/blitz_3.html#SEC88 for index place holders. As long as your matrix has less than 11 dimensions you are fine. This link shows how they can be used to form outer products http://www.oonumerics.org/blitz/docs/blitz_3.html#SEC99 (search for outer products to go to the relevant part of the document).

把时间冻结 2024-11-13 20:05:33

除了使用 Cython 之外,您还可以使用 Fortran 编写瓶颈部分。然后使用f2py将其编译为Python .pyd文件。

Besides using Cython, you can write the bottle-neck part(s) in Fortran. Then use f2py to compile it to Python .pyd file.

触ぅ动初心 2024-11-13 20:05:32

如果您无法“矢量化”整个操作并且循环确实是瓶颈,那么我强烈建议使用 Cython。我最近一直在尝试它,它使用起来很简单,并且与 numpy 有一个不错的界面。对于像 langevin 积分器这样的东西,我发现比 numpy 中的一个不错的实现有 115 倍的加速。请参阅此处的文档:

http://docs.cython.org/src/tutorial/numpy.html

我还建议您查看以下论文

您可能会看到令人满意的加速效果输入输入数组和循环计数器,但如果您想充分利用 cython 的潜力,那么您将必须对等效的广播进行硬编码。

If you can't 'vectorize' the entire operation and looping is indeed the bottleneck, then I highly recommend using Cython. I've been dabbling around with it recently and it is straightforward to work with and has a decent interface with numpy. For something like a langevin integrator I saw a 115x speedup over a decent implementation in numpy. See the documentation here:

http://docs.cython.org/src/tutorial/numpy.html

and I also recommend looking at the following paper

You may see satisfactory speedups by just typing the input array and the loop counter, but if you want to leverage the full potential of cython, then you are going to have to hardcode the equivalent broadcasting.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文