在Python中重写MATLAB代码,并以快速执行
这是一个嵌套的循环,其中一个内部指数取决于外部索引,其中包括以下参数:
f = rand(1,70299)
nech=24*30*24
N=length(f);
xh=(1:nech)/24;
在Matlab:
sf2(1:nech)=0.;
sf2vel(1:nech)=0.;
count(1:nech)=0.;
for i=1:nech
for j=1:N-i-1
sf2(i)=sf2(i)+(f(j+i)-f(j))^2;
count(i)=count(i)+1;
end
sf2(i)=sf2(i)/count(i);
end
in Python:
def structFunPython(f,N,nech):
sf2 = np.zeros(N)
count = np.zeros(N)
for i in range(nech):
indN = np.arange(1,N-i-1)
for j in indN:
sf2[i] += np.power((f[i+j]-f[j]),2)
count[i] += 1
sf2[i] = sf2[i]/count[i]
return sf2
与Cython:
import cython
cimport numpy as np
import numpy as np
def structFun(np.ndarray f,N,nech):
cdef np.ndarray sf2 = np.zeros(N), count = np.zeros(N),
for i in range(nech):
indN = np.arange(1,N-i-1)
for j in indN:
sf2[i] += np.power((f[i+j]-f[j]),2)
count[i] += 1
sf2[i] = sf2[i]/count[i]
return sf2
执行时间:
Matlab: 7.8377 sec
Python: 3651.35 sec
Cython: 3336.21 sec
我很难相信Python和Cython(尤其是Cython)(尤其是Cython)对于相同的计算,所以我认为我一定已经在Python/Cython循环中犯了一个错误,但是我看不到哪里。
Here is a nested loop, where the inner one indices depend on the outer one, with the following parameters:
f = rand(1,70299)
nech=24*30*24
N=length(f);
xh=(1:nech)/24;
In MATLAB:
sf2(1:nech)=0.;
sf2vel(1:nech)=0.;
count(1:nech)=0.;
for i=1:nech
for j=1:N-i-1
sf2(i)=sf2(i)+(f(j+i)-f(j))^2;
count(i)=count(i)+1;
end
sf2(i)=sf2(i)/count(i);
end
In Python:
def structFunPython(f,N,nech):
sf2 = np.zeros(N)
count = np.zeros(N)
for i in range(nech):
indN = np.arange(1,N-i-1)
for j in indN:
sf2[i] += np.power((f[i+j]-f[j]),2)
count[i] += 1
sf2[i] = sf2[i]/count[i]
return sf2
With cython:
import cython
cimport numpy as np
import numpy as np
def structFun(np.ndarray f,N,nech):
cdef np.ndarray sf2 = np.zeros(N), count = np.zeros(N),
for i in range(nech):
indN = np.arange(1,N-i-1)
for j in indN:
sf2[i] += np.power((f[i+j]-f[j]),2)
count[i] += 1
sf2[i] = sf2[i]/count[i]
return sf2
Executions times:
Matlab: 7.8377 sec
Python: 3651.35 sec
Cython: 3336.21 sec
I have a hard time believing Python and Cython (especially Cython) are that slow for the same computation, so I think I must have made an error in my Python/Cython loops, but I can't see where.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
免责声明:正如 @norok2 在评论中指出的那样,由于使用了
pdist
,我的方法至少分配了N*(N-1)/2
双倍。对于您的N = 70299
这意味着数组中有大约 18.5 GB 的双精度数。其他索引数组将具有类似的大小。因此,除非您的某些用例的N
较小,否则本答案中的矢量化方法仅在您拥有大量内存时才可行。正如其他人所指出的,仅仅将代码从一种语言翻译成另一种语言并不会产生两种语言的最佳代码。单独使用 Cython 并不能保证加速,就像单独使用 NumPy 不能保证加速一样。
norok2的答案很好地向您展示了如何使用
numba
或类似的东西来编译您的数字代码。这可以为您提供与 MATLAB 中的性能非常相似的功能,因为 MATLAB 有自己的即时 (JIT) 编译器。还有优化代码编译的回旋余地,因为多种实现最终可能会带来截然不同的性能。无论如何,我想指出的是,您还可以通过使用 NumPy 和 SciPy 中的高级功能来加速代码。特别是,您想要计算一组 1d 点之间的成对平方距离。这就是
scipy.spatial。 distance.pdist
可以为你做(使用'sqeuclidean'
平方欧几里得范数)。优点是它只计算每个成对距离一次(这对于 CPU 和内存性能来说非常好),但缺点是挑选出你想要总结的贡献有点麻烦。无论如何,这里是与您的 Python 实现相对应的代码(修复了内部循环使用 np.arange(1, Ni) 而不是 np.arange(1, Ni-1) ):
这里发生的是,
dists
counts
dists
中每个值的 1d 索引(这是困难的部分),将其存储在inds
中np.add.at
将每个贡献累加到适当的产出指数以下是
N = 1000
的一些计时,其中func2()
是 norok2 的回答:上面的解决方案要快得多,但当然它仍然比完全编译的解决方案慢。如果您遇到其他依赖项或在系统上安装 llvm 的问题,这可能是一个合理的折衷方案。最重要的是,代码应该适应您尝试优化它的语言。
为了完整起见,这里是我用于比较的实现(我稍微更改了签名,因为
N
可以从输入数组计算,我修复了一些栅栏错误):通过这些定义,所有三个函数在机器精度内给出相同的结果:
Disclaimer: as @norok2 pointed out in a comment, my approach allocates at least
N*(N-1)/2
doubles due to the use ofpdist
. For yourN = 70299
this means about 18.5 GB of doubles in an array. Other index arrays will have similar size. So unless some of your use cases have smallerN
, the vectorised approach in this answer is only feasible if you have a lot of memory.As others have noted, just translating code from one language to another will not lead to optimal code in both languages. And using Cython alone is no guarantee for speed-up, in the same way that using NumPy alone is no guarantee for speed-up.
norok2's answer nicely shows you how you can use
numba
or something similar to compile your numerical code. This can give you something quite similar to what you have in MATLAB in terms of performance, since MATLAB has a just-in-time (JIT) compiler of its own. There's also wiggle room to optimize your code for compilation, because multiple implementations can end up with wildly different performance.Anyway, I want to make the point that you can also speed up your code by using advanced features in NumPy and SciPy. In particular, what you want is to compute pairwise squared distances among a set of 1d points. This is what
scipy.spatial.distance.pdist
can do for you (with the'sqeuclidean'
squared Euclidean norm). The upside is that it only compute each pairwise distance once (which is great for CPU and memory performance), but the downside is that picking out the contributions that you want to sum up a bit more cumbersome.Anyway, here is the code corresponding to your Python implementation (with the fix that the inner loop uses
np.arange(1, N-i)
rather thannp.arange(1, N-i-1)
):What happens here is that
dists
counts
dists
that it corresponds to (this is the hard part), store it ininds
np.add.at
to accumulate each contribution to the appropriate output indexHere are some timings for
N = 1000
, wherefunc2()
is the corresponding function from norok2's answer:The above solution is considerably faster, but of course it's still slower than a fully compiled solution. If you have an issue with additional dependencies or getting llvm installed on your system, this might be a reasonable compromise. The bottom line is that code should be adapted to the language you're trying to optimise it in.
For completeness' sake here are the implementations I used for the comparison (I slightly changed the signatures because
N
can be computed from the input array, and I fixed some fencepost errors):With these definitions all three functions give the same result within machine precision:
我将 MATLAB 代码重写为等效的 Python 代码的方式可能是(注意 MATLAB 中从 1 开始的索引和 Python 中从 0 开始的索引...因为我不知道在没有上下文的情况下应该如何调整它,所以我采用了最简单的方法):
请注意,
count[i] += 1
是无用的,因为count[i]
的最终值是预先知道的,并且实际上是整个count
> 没用,例如:Speed-up
这是一个手动案例努巴加速。这就像添加/使用 Numba @njit 装饰器一样简单:
现在,
func
、func_nb
、func2
和 func2_nb 都执行相同的计算:如果您确实需要坚持使用 Cython,这里有一个基于 func2 的实现:
与 Numba 相比,编写起来要复杂得多,但实现了类似的性能。
诀窍是有一个函数
_func2_cy
,它的运行几乎不需要与 Python 交互(读:它以 C 速度运行)。结果再次与
func2
相同:计时
通过一些小玩具基准测试,我们可以感受到加速,包括像您一样编写的类似函数,以及 Andras Deak 的非常好的答案(但修复索引以匹配上述内容):
...以及您提供的输入大小:
The way I would rewrite the MATLAB code as equivalent Python code could be (beware of indices starting at 1 in MATLAB and at 0 in Python... since I cannot know how this should be adapted without context I went with the simplest approach):
Note that
count[i] += 1
is useless as the final value ofcount[i]
is known in advance, and actually the wholecount
is useless, e.g.:Speed-up
This a manual case of Numba speed up. This is just as easy as adding/using a Numba
@njit
decorator:Now,
func
,func_nb
,func2
andfunc2_nb
all perform the same computation:If you really need to stick to Cython, here is an implementation based on
func2
:This is significantly more complicated to write compared to Numba, but achieves similar performances.
The trick is to have a function
_func2_cy
which operates with little to no interaction with Python (read: it goes at C speed).Again the result is the same as
func2
:Timings
With some little toy benchmarking we get a feeling of the speed-ups, including a comparable function written as you did, and the vectorized solution proposed in Andras Deak's very fine answer (but fixing the indices to match the above):
...and with the input sizes you provided:
您已经得到了一些不错的答案,这些答案呈现出快速的Numpy和Numba实现。我想提出一个体面的Cython实施,以进行公平的比较。
首先,让我们的时间NOROK2在我的计算机上实现了最快的NUMBA实现:
为了获得Cython的快速内存访问,为其提供有关输入阵列的所有必要信息至关重要。在您的情况下,
f_arr
是一个c- contiguul np.ndarray,因此我们使用c- contiguul memoryViewsdouble> double [:: 1]
而不是正常内存>double double double double [:]
。不同之处在于,索引c con-contiguul memenationview将简单的C代码f_arr [i]
还原为f_arr [i + f_arr.stridess [0]]
。接下来,值得一提的是,Python Power Operator
A ** 2
将用C代码POW> POW(A,2)
替代POW
- 功能。即使在C中,我们也只会编写a*a
。因此,让我们在Cython中做同样的事情。在代码中:
在MacOS上使用Apple Clang 13.1.6在我的计算机上编译,这比上述NUMBA实现慢:
但是,人们应该意识到Cython基本上只是python to c 编译器。这意味着在删除了所有Python交互后,我们可以尝试使用C代码时要应用的相同性能技巧:传递优化标志
-O3
,并启用当前平台上可用的所有CPU指令-march =本机
。请注意,这也意味着良好的Cython代码的性能(即在紧密循环中没有Python交互的代码)在很大程度上取决于您的C编译器。根据我的经验,由于MSVC的自动矢量不良,Numba通常比Windows上的Cython快。在MacOS/Linux和GCC或Clang上,通常是相反的。这些众所周知的性能技巧之一是循环汇总,以便给出编译器提示以simd矢量化特定循环。在最终的循环中展开,功能看起来像这样:
在我的计算机上,这比Numba快几乎两倍:
为了迈出一步,我们可以在
prange
,这只是围绕 OpenMP :
在带有8个CPU内核的我的机器上,这是迄今为止最快的实现:
但是,值得一提的是Numba也值得一提支持线程并行性,因此我希望与Numba相似。
You already got some nice answers that presented fast numpy and numba implementations. I'd like to present a decent Cython implementation for a fair comparison.
First of all, let's time norok2's fastest numba implementation on my machine:
In order to obtain fast memory access in Cython, it's crucial to give it all the necessary information about the input arrays. In your case,
f_arr
is a C-contiguous np.ndarray, so we use a C-contiguous memoryviewsdouble[::1]
instead of normal memoryviewsdouble[:]
. The difference is that indexing a C-contiguous memoryview reduces to plain C codef_arr[i]
while the latter reduces tof_arr[i + f_arr.strides[0]]
.Next, it's worth mentioning that the Python power operator
a**2
is gonna be substituted by the C codepow(a,2)
, i.e. we call thepow
-function. Calling a function in tight loops has unnecessary overhead, even in C. In C, we would just writea*a
. So let's do the same in Cython.In Code:
Compiled on my machine with Apple Clang 13.1.6 on macOS, this is slower than the aforementioned numba implementation:
However, one should be aware that Cython is basically just a Python to C compiler. This means that after we removed all the python interactions, we can try the same performance tricks we'd apply when using C code: passing the optimization flag
-O3
and enabling all CPU instructions available on the current platform-march=native
. Note that this also implies that the performance of good Cython code (i.e. code with no python interactions in tight loops) heavily depends on your C Compiler. In my experience, numba is often faster than Cython on Windows due to MSVC's bad auto-vectorization. On macOS/Linux and gcc or clang, it's often the other way around.One of these well-known performance tricks is loop-unrolling in order to give the compiler hints to SIMD-vectorize the specific loop. Unrolling the innermost loop, the function looks like this:
On my machine, this is nearly two times faster than numba:
To take it one step further, we can parallelize the outer loop with the help of
prange
which is just a thin wrapper around OpenMP:On my machine with 8 CPU Cores, this is by far the fastest implementation:
However, it's worth mentioning that Numba also supports thread parallelism, so I'd expect a similar performance with Numba.
通过一些基本的 python/numpy 重写,我可以大大加快你的代码
对于适度的样本大小:
它们匹配:
和计时:
我首先将
for i in indN:
替换为for i in范围(1,N-1-i):
。数组上的迭代比列表或范围
上的迭代慢。正如其他人指出的,我们不需要迭代
count
。但最大的变化是用数组切片替换 j 迭代,对整个切片和数组求和上电。
我还没有足够多地研究
i
迭代来消除这个问题。f[i+1:N-1]
切片的长度各不相同,从nech
到 0。MATLAB执行大量 jit 编译,因此您可以通过迭代来完成。早在 20 世纪 90 年代,当我使用 MATLAB 时,这可能是一种糟糕的形式 - 那些版本需要整个矩阵计算才能获得合理的速度。 Python 级别的迭代很慢(解释),并且在数组上比在列表上慢。尽可能尝试使用全数组方法。或者使用
numba
来编译计算。====
With some basic python/numpy rewrites I can speed up your code quite a bit
For a modest sample size:
they match:
and timings:
I first replaced
for i in indN:
withfor i in range(1,N-1-i):
. Iteration on an array is slower than iteration on a list orrange
.As others noted we don't need to iterate the
count
.But the big change is to replace the
j
iteration with an array slice, power on the whole slice and array sum.I haven't looked at the
i
iteration enough to eliminate that.f[i+1:N-1]
slice varies in length, fromnech
down to 0.MATLAB does a lot of jit compilation, so you can get by with iterations. Back in the 1990s when I worked MATLAB that would have been bad form - those versions required whole-matrix calculations to get any reasonable speed. Python level iteration is slow (interpreted), and slower on arrays than on lists. Try to use whole-array methods where possible. Or use
numba
to compile the calculations.====