加速 Numpy 数组上的循环
在我的代码中,我有一个 for 循环,它对多维 numpy 数组进行索引,并使用每次迭代时获得的子数组进行一些操作。看起来像这样
for sub in Arr:
#do stuff using sub
现在使用 sub
完成的内容已完全矢量化,因此它应该是高效的。另一方面,这个循环迭代大约 ~10^5
次,是瓶颈。你认为我会通过将这部分卸载到 C 来获得改进吗?我有点不愿意这样做,因为 do stuff using sub
使用广播、切片、智能索引技巧,这些技巧写起来很乏味我也欢迎关于在将计算卸载到 C 时如何处理广播、切片、智能索引的想法和建议。
In my code I have for loop that indexes over a multidimensional numpy array and does some operation using the sub-array that is obtained at each iteration. It looks like this
for sub in Arr:
#do stuff using sub
Now the stuff that is done using sub
is fully vectorized, so it should be efficient. On the other hand this loop iterates about ~10^5
times and is the bottleneck. Do you think I will get an improvement by offloading this part to C. I am somewhat reluctant to do so because the do stuff using sub
uses broadcasting, slicing, smart-indexing tricks that would be tedious to write in plain C. I would also welcome thoughts and suggestions about how to deal with broadcasting, slicing, smart-indexing when offloading computation to C.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
你可以看看
scipy.weave
。您可以使用 scipy.weave.blitz 将表达式透明地转换为 C++ 代码并运行它。它将自动处理切片并消除临时变量,但您声称for
循环的主体不会创建临时变量,因此您的情况可能会有所不同。但是,如果您想用更高效的东西替换整个 for 循环,那么您可以使用 scipy.inline 。缺点是您必须编写 C++ 代码。这应该不会太难,因为您可以使用与 numpy 数组表达式非常接近的 Blitz++ 语法。直接支持切片,但不支持广播。
有两种解决方法:
是使用 numpy-C api 并使用多维迭代器。他们透明地处理广播。但是,您正在调用 Numpy 运行时,因此可能会产生一些开销。另一种选择,可能也是更简单的选择是使用通常的矩阵表示法进行广播。广播操作可以写成全为向量的外积。好处是 Blitz++ 实际上不会在内存中创建这个临时广播数组,它会弄清楚如何将其包装到等效循环中。
对于第二个选项,请查看http://www.oonumerics。 org/blitz/docs/blitz_3.html#SEC88 用于索引占位符。只要你的矩阵维数小于 11 就可以了。此链接显示了如何使用它们来形成外部产品 http://www.oonumerics .org/blitz/docs/blitz_3.html#SEC99(搜索外部产品以转到文档的相关部分)。
San you can take a look at
scipy.weave
. You can usescipy.weave.blitz
to transparently translate your expression intoC++
code and run it. It will handle slicing automatically and get rid of temporaries, but you claim that the body of yourfor
loop does not create temporaries so your milage may vary.However if you want to replace your entire for loop with something more efficient then you could make use of
scipy.inline
. The drawback is that you have to writeC++
code. This should not be too hard because you can useBlitz++
syntax which is very close to numpy array expressions. Slicing is directly supported, broadcasting however is not.There are two work arounds:
is to use the numpy-C api and use multi-dimensional iterators. They transparently handle broadcasting. However you are invoking the Numpy runtime so there might be some overhead. The other option, and possibly the simpler option is to use the usual matrix notation for broadcasting. Broadcast operations can be written as outer-products with vector of all ones. The good thing is that
Blitz++
will not actually create this temporary broadcasted arrays in memory, it will figure out how to wrap it into an equivalent loop.For the second option take a look at http://www.oonumerics.org/blitz/docs/blitz_3.html#SEC88 for index place holders. As long as your matrix has less than 11 dimensions you are fine. This link shows how they can be used to form outer products http://www.oonumerics.org/blitz/docs/blitz_3.html#SEC99 (search for outer products to go to the relevant part of the document).
除了使用 Cython 之外,您还可以使用 Fortran 编写瓶颈部分。然后使用f2py将其编译为Python .pyd文件。
Besides using Cython, you can write the bottle-neck part(s) in Fortran. Then use f2py to compile it to Python .pyd file.
如果您无法“矢量化”整个操作并且循环确实是瓶颈,那么我强烈建议使用 Cython。我最近一直在尝试它,它使用起来很简单,并且与 numpy 有一个不错的界面。对于像 langevin 积分器这样的东西,我发现比 numpy 中的一个不错的实现有 115 倍的加速。请参阅此处的文档:
http://docs.cython.org/src/tutorial/numpy.html
我还建议您查看以下论文,
您可能会看到令人满意的加速效果输入输入数组和循环计数器,但如果您想充分利用 cython 的潜力,那么您将必须对等效的广播进行硬编码。
If you can't 'vectorize' the entire operation and looping is indeed the bottleneck, then I highly recommend using Cython. I've been dabbling around with it recently and it is straightforward to work with and has a decent interface with numpy. For something like a langevin integrator I saw a 115x speedup over a decent implementation in numpy. See the documentation here:
http://docs.cython.org/src/tutorial/numpy.html
and I also recommend looking at the following paper
You may see satisfactory speedups by just typing the input array and the loop counter, but if you want to leverage the full potential of cython, then you are going to have to hardcode the equivalent broadcasting.