在Python中重写MATLAB代码，并以快速执行

发布于 2025-01-18 15:46:35 字数 1436 浏览 0 评论 0原文

这是一个嵌套的循环，其中一个内部指数取决于外部索引，其中包括以下参数：

f = rand(1,70299)
nech=24*30*24
N=length(f);
xh=(1:nech)/24;

在Matlab：

sf2(1:nech)=0.;
sf2vel(1:nech)=0.;
count(1:nech)=0.;

for i=1:nech
    for j=1:N-i-1
        sf2(i)=sf2(i)+(f(j+i)-f(j))^2;
        count(i)=count(i)+1; 
    end
    sf2(i)=sf2(i)/count(i);
end

in Python：

def structFunPython(f,N,nech):
    sf2 = np.zeros(N)
    count = np.zeros(N)
    for i in range(nech):
        indN = np.arange(1,N-i-1)
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[j]),2)
            count[i] += 1
        sf2[i] = sf2[i]/count[i]
    return sf2

与Cython：

import cython
cimport numpy as np
import numpy as np
def structFun(np.ndarray f,N,nech):
    cdef np.ndarray sf2 = np.zeros(N), count = np.zeros(N),
    for i in range(nech):
        indN = np.arange(1,N-i-1)
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[j]),2)
            count[i] += 1
        sf2[i] = sf2[i]/count[i]
    return sf2

执行时间：

Matlab: 7.8377 sec
Python: 3651.35 sec
Cython: 3336.21 sec

我很难相信Python和Cython（尤其是Cython）（尤其是Cython）对于相同的计算，所以我认为我一定已经在Python/Cython循环中犯了一个错误，但是我看不到哪里。

原文

Here is a nested loop, where the inner one indices depend on the outer one, with the following parameters:

f = rand(1,70299)
nech=24*30*24
N=length(f);
xh=(1:nech)/24;

In MATLAB:

sf2(1:nech)=0.;
sf2vel(1:nech)=0.;
count(1:nech)=0.;

for i=1:nech
    for j=1:N-i-1
        sf2(i)=sf2(i)+(f(j+i)-f(j))^2;
        count(i)=count(i)+1; 
    end
    sf2(i)=sf2(i)/count(i);
end

In Python:

def structFunPython(f,N,nech):
    sf2 = np.zeros(N)
    count = np.zeros(N)
    for i in range(nech):
        indN = np.arange(1,N-i-1)
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[j]),2)
            count[i] += 1
        sf2[i] = sf2[i]/count[i]
    return sf2

With cython:

import cython
cimport numpy as np
import numpy as np
def structFun(np.ndarray f,N,nech):
    cdef np.ndarray sf2 = np.zeros(N), count = np.zeros(N),
    for i in range(nech):
        indN = np.arange(1,N-i-1)
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[j]),2)
            count[i] += 1
        sf2[i] = sf2[i]/count[i]
    return sf2

Executions times:

Matlab: 7.8377 sec
Python: 3651.35 sec
Cython: 3336.21 sec

I have a hard time believing Python and Cython (especially Cython) are that slow for the same computation, so I think I must have made an error in my Python/Cython loops, but I can't see where.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

渡你暖光 2025-01-25 15:46:35

免责声明：正如 @norok2 在评论中指出的那样，由于使用了 pdist，我的方法至少分配了 N*(N-1)/2 双倍。对于您的 N = 70299 这意味着数组中有大约 18.5 GB 的双精度数。其他索引数组将具有类似的大小。因此，除非您的某些用例的 N 较小，否则本答案中的矢量化方法仅在您拥有大量内存时才可行。

正如其他人所指出的，仅仅将代码从一种语言翻译成另一种语言并不会产生两种语言的最佳代码。单独使用 Cython 并不能保证加速，就像单独使用 NumPy 不能保证加速一样。

norok2的答案很好地向您展示了如何使用numba或类似的东西来编译您的数字代码。这可以为您提供与 MATLAB 中的性能非常相似的功能，因为 MATLAB 有自己的即时 (JIT) 编译器。还有优化代码编译的回旋余地，因为多种实现最终可能会带来截然不同的性能。

无论如何，我想指出的是，您还可以通过使用 NumPy 和 SciPy 中的高级功能来加速代码。特别是，您想要计算一组 1d 点之间的成对平方距离。这就是 scipy.spatial。 distance.pdist 可以为你做（使用 'sqeuclidean' 平方欧几里得范数）。优点是它只计算每个成对距离一次（这对于 CPU 和内存性能来说非常好），但缺点是挑选出你想要总结的贡献有点麻烦。

无论如何，这里是与您的 Python 实现相对应的代码（修复了内部循环使用 np.arange(1, Ni) 而不是 np.arange(1, Ni-1) ）：

from scipy.spatial.distance import pdist

def pdisty(f, nech):
    offset = f.size - nech
    res = np.zeros_like(f, shape=nech)
    dists = pdist(f[:, None], metric='sqeuclidean')
    counts = np.arange(offset, f.size)[::-1]
    inds = np.repeat(np.arange(res.size), counts)
    np.add.at(res, inds, dists[:inds.size])
    res /= counts
    return res

这里发生的是，

我们计算每个唯一的数组值对的成对距离并将其存储到 dists
我们计算每个点涉及的对的数量（这就是我们必须在最后标准化），存储它counts
找出它对应的 dists 中每个值的 1d 索引（这是困难的部分），将其存储在 inds 中
使用np.add.at将每个贡献累加到适当的产出指数
用计数标准化。

以下是 N = 1000 的一些计时，其中 func2() 是 norok2 的回答：

>>> %timeit structFunPython(f, f.size - 1)
... %timeit func2(f, f.size - 1)
... %timeit pdisty(f, f.size - 1)
1.48 s ± 89.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
274 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
36.7 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

上面的解决方案要快得多，但当然它仍然比完全编译的解决方案慢。如果您遇到其他依赖项或在系统上安装 llvm 的问题，这可能是一个合理的折衷方案。最重要的是，代码应该适应您尝试优化它的语言。

为了完整起见，这里是我用于比较的实现（我稍微更改了签名，因为 N 可以从输入数组计算，我修复了一些栅栏错误）：

def structFunPython(f, nech):
    """Very slightly modified from the question"""
    N = f.size
    sf2 = np.zeros(nech)
    count = np.zeros(nech)
    for i in range(nech):
        indN = np.arange(1,N-i)
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[i]),2)
            count[i] += 1
        sf2[i] = sf2[i]/count[i]
    return sf2


def func2(f_arr, nech):
    """Very slightly modified from norok2's answer

    See https://stackoverflow.com/a/71704834/5067311

    """
    n = f_arr.size
    sf2 = np.zeros(nech)
    for i in range(nech):
        for j in range(1, n - i):
            sf2[i] += (f_arr[i + j] - f_arr[i]) ** 2
        sf2[i] /= (n - i - 1)
    return sf2

通过这些定义，所有三个函数在机器精度内给出相同的结果：

rng = np.random.default_rng()
N = 1000
nech = N - 2
f = rng.random(1000)
assert np.allclose(structFunPython(f, nech), func2(f, nech))
assert np.allclose(structFunPython(f, nech), pdisty(f, nech))

Disclaimer: as @norok2 pointed out in a comment, my approach allocates at least N*(N-1)/2 doubles due to the use of pdist. For your N = 70299 this means about 18.5 GB of doubles in an array. Other index arrays will have similar size. So unless some of your use cases have smaller N, the vectorised approach in this answer is only feasible if you have a lot of memory.

As others have noted, just translating code from one language to another will not lead to optimal code in both languages. And using Cython alone is no guarantee for speed-up, in the same way that using NumPy alone is no guarantee for speed-up.

norok2's answer nicely shows you how you can use numba or something similar to compile your numerical code. This can give you something quite similar to what you have in MATLAB in terms of performance, since MATLAB has a just-in-time (JIT) compiler of its own. There's also wiggle room to optimize your code for compilation, because multiple implementations can end up with wildly different performance.

Anyway, I want to make the point that you can also speed up your code by using advanced features in NumPy and SciPy. In particular, what you want is to compute pairwise squared distances among a set of 1d points. This is what scipy.spatial.distance.pdist can do for you (with the 'sqeuclidean' squared Euclidean norm). The upside is that it only compute each pairwise distance once (which is great for CPU and memory performance), but the downside is that picking out the contributions that you want to sum up a bit more cumbersome.

Anyway, here is the code corresponding to your Python implementation (with the fix that the inner loop uses np.arange(1, N-i) rather than np.arange(1, N-i-1)):

from scipy.spatial.distance import pdist

def pdisty(f, nech):
    offset = f.size - nech
    res = np.zeros_like(f, shape=nech)
    dists = pdist(f[:, None], metric='sqeuclidean')
    counts = np.arange(offset, f.size)[::-1]
    inds = np.repeat(np.arange(res.size), counts)
    np.add.at(res, inds, dists[:inds.size])
    res /= counts
    return res

What happens here is that

we compute the pairwise distance of each unique pair of array values and store it to dists
we compute the number of pairs each point is involved in (this is what we'll have to normalise at the end), store it to counts
figure out the 1d index of each value in dists that it corresponds to (this is the hard part), store it in inds
use np.add.at to accumulate each contribution to the appropriate output index
normalise with the counts.

Here are some timings for N = 1000, where func2() is the corresponding function from norok2's answer:

>>> %timeit structFunPython(f, f.size - 1)
... %timeit func2(f, f.size - 1)
... %timeit pdisty(f, f.size - 1)
1.48 s ± 89.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
274 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
36.7 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The above solution is considerably faster, but of course it's still slower than a fully compiled solution. If you have an issue with additional dependencies or getting llvm installed on your system, this might be a reasonable compromise. The bottom line is that code should be adapted to the language you're trying to optimise it in.

For completeness' sake here are the implementations I used for the comparison (I slightly changed the signatures because N can be computed from the input array, and I fixed some fencepost errors):

def structFunPython(f, nech):
    """Very slightly modified from the question"""
    N = f.size
    sf2 = np.zeros(nech)
    count = np.zeros(nech)
    for i in range(nech):
        indN = np.arange(1,N-i)
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[i]),2)
            count[i] += 1
        sf2[i] = sf2[i]/count[i]
    return sf2


def func2(f_arr, nech):
    """Very slightly modified from norok2's answer

    See https://stackoverflow.com/a/71704834/5067311

    """
    n = f_arr.size
    sf2 = np.zeros(nech)
    for i in range(nech):
        for j in range(1, n - i):
            sf2[i] += (f_arr[i + j] - f_arr[i]) ** 2
        sf2[i] /= (n - i - 1)
    return sf2

With these definitions all three functions give the same result within machine precision:

rng = np.random.default_rng()
N = 1000
nech = N - 2
f = rng.random(1000)
assert np.allclose(structFunPython(f, nech), func2(f, nech))
assert np.allclose(structFunPython(f, nech), pdisty(f, nech))

回复收藏 0 原文

初相遇 2025-01-25 15:46:35

我将 MATLAB 代码重写为等效的 Python 代码的方式可能是（注意 MATLAB 中从 1 开始的索引和 Python 中从 0 开始的索引...因为我不知道在没有上下文的情况下应该如何调整它，所以我采用了最简单的方法）：

import numpy as np


def func(f_arr, nech, n):
    sf2 = np.zeros(nech)
    count = np.zeros(nech)
    for i in range(nech):
        for j in range(n - i):
            sf2[i] += (f_arr[i + j] - f_arr[i]) ** 2
            count[i] += 1
        sf2[i] /= count[i]
    return sf2

请注意，count[i] += 1 是无用的，因为 count[i] 的最终值是预先知道的，并且实际上是整个 count > 没用，例如：

import numpy as np


def func2(f_arr, nech, n):
    sf2 = np.zeros(nech)
    for i in range(nech):
        for j in range(n - i):
            sf2[i] += (f_arr[i + j] - f_arr[i]) ** 2
        sf2[i] /= (n - i)
    return sf2

Speed-up

这是一个手动案例努巴加速。这就像添加/使用 Numba @njit 装饰器一样简单：

import numba as nb


func_nb = nb.njit(func)
func2_nb = nb.njit(func2)

现在，func、func_nb、func2和 func2_nb 都执行相同的计算：

nech = n = 6
f_arr = np.arange(n)
print(func(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func_nb(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func2(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func2_nb(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

如果您确实需要坚持使用 Cython，这里有一个基于 func2 的实现：

%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


import numpy as np
import cython as cy


cpdef func2_cy(f_arr, nech, n):
    sf2 = np.zeros(nech)
    _func2_cy(f_arr.astype(np.float_), sf2, nech, n)
    return sf2


cdef _func2_cy(double[:] f_arr, double[:] sf2, cy.int nech, cy.int n):
    for i in range(nech):
        for j in range(1, n - i):
            sf2[i] = sf2[i] + (f_arr[i + j] - f_arr[i]) ** 2
        sf2[i] = sf2[i] / (n - i)

与 Numba 相比，编写起来要复杂得多，但实现了类似的性能。
诀窍是有一个函数 _func2_cy ，它的运行几乎不需要与 Python 交互（读：它以 C 速度运行）。

结果再次与 func2 相同：

nech = n = 6
f_arr = np.arange(n)
print(func2_cy(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

计时

通过一些小玩具基准测试，我们可以感受到加速，包括像您一样编写的类似函数，以及 Andras Deak 的非常好的答案（但修复索引以匹配上述内容）：

def func_OP(f, nech, n):
    sf2 = np.zeros(n)
    count = np.zeros(n)
    for i in range(nech):
        indN = np.arange(n - i)  # <-- indexing fixed
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[i]),2)
            count[i] += 1
        sf2[i] = sf2[i] / count[i]
    return sf2


func_OP_nb = nb.njit(func_OP)

def func_pdisty(f, nech, n):
    res = np.zeros(nech)
    dists = scipy.spatial.distance.pdist(f[:, None], metric='sqeuclidean')
    counts = np.arange(n - 1, n - nech - 1, -1)
    inds = np.repeat(np.arange(res.size), counts)
    np.add.at(res, inds, dists[:inds.size])
    res /= (counts + 1)
    return res

nech = n = 6
f_arr = np.arange(n)
print(func_OP(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func_pdisty(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

nech = n = 1000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 1.5 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 567 ms per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 352 ms per loop
%timeit func_OP_nb(f_arr, nech, n)
# 1000 loops, best of 5: 1.87 ms per loop
%timeit func_nb(f_arr, nech, n)
# 1000 loops, best of 5: 1.7 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 1000 loops, best of 5: 768 µs per loop
%timeit func_pdisty(f_arr, nech, n)
# 10 loops, best of 5: 44.5 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 1000 loops, best of 5: 1 ms per loop

nech = n = 2000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 6.01 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 2.3 s per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 1.42 s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 100 loops, best of 5: 7.31 ms per loop
%timeit func_nb(f_arr, nech, n)
# 100 loops, best of 5: 6.82 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 100 loops, best of 5: 3.05 ms per loop
%timeit func_pdisty(f_arr, nech, n)
# 1 loop, best of 5: 344 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 100 loops, best of 5: 3.95 ms per loop

nech = n = 4000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 24.3 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 9.27 s per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 5.71 s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 10 loops, best of 5: 29 ms per loop
%timeit func_nb(f_arr, nech, n)
# 10 loops, best of 5: 27.3 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 100 loops, best of 5: 12.2 ms per loop
%timeit func_pdisty(f_arr, nech, n)
# 1 loop, best of 5: 706 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 100 loops, best of 5: 15.9 ms per loop

...以及您提供的输入大小：

nech = 24 * 30 * 24
n = 70299
f_arr = np.random.random(n)
%timeit -n1 -r1 func_OP(f_arr, nech, n)
# 1 loop, best of 1: 1h 4min 50s per loop
%timeit -n1 -r1 func(f_arr, nech, n)
# 1 loop, best of 1: 21min 14s per loop
%timeit -n1 -r1 func2(f_arr, nech, n)  # only one run / loop
# 1 loop, best of 1: 13min 9s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 1 loop, best of 5: 4.74 s per loop
%timeit func_nb(f_arr, nech, n)
# 1 loop, best of 5: 4 s per loop
%timeit func2_nb(f_arr, nech, n)
# 1 loop, best of 5: 1.62 s per loop
# %timeit func_pdisty(f_arr, nech, n)
# -- MEMORY ERROR --
%timeit func2_cy(f_arr, nech, n)
# 1 loop, best of 5: 2.2 s per loop

The way I would rewrite the MATLAB code as equivalent Python code could be (beware of indices starting at 1 in MATLAB and at 0 in Python... since I cannot know how this should be adapted without context I went with the simplest approach):

import numpy as np


def func(f_arr, nech, n):
    sf2 = np.zeros(nech)
    count = np.zeros(nech)
    for i in range(nech):
        for j in range(n - i):
            sf2[i] += (f_arr[i + j] - f_arr[i]) ** 2
            count[i] += 1
        sf2[i] /= count[i]
    return sf2

Note that count[i] += 1 is useless as the final value of count[i] is known in advance, and actually the whole count is useless, e.g.:

import numpy as np


def func2(f_arr, nech, n):
    sf2 = np.zeros(nech)
    for i in range(nech):
        for j in range(n - i):
            sf2[i] += (f_arr[i + j] - f_arr[i]) ** 2
        sf2[i] /= (n - i)
    return sf2

Speed-up

This a manual case of Numba speed up. This is just as easy as adding/using a Numba @njit decorator:

import numba as nb


func_nb = nb.njit(func)
func2_nb = nb.njit(func2)

Now, func, func_nb, func2 and func2_nb all perform the same computation:

nech = n = 6
f_arr = np.arange(n)
print(func(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func_nb(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func2(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func2_nb(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

If you really need to stick to Cython, here is an implementation based on func2:

%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


import numpy as np
import cython as cy


cpdef func2_cy(f_arr, nech, n):
    sf2 = np.zeros(nech)
    _func2_cy(f_arr.astype(np.float_), sf2, nech, n)
    return sf2


cdef _func2_cy(double[:] f_arr, double[:] sf2, cy.int nech, cy.int n):
    for i in range(nech):
        for j in range(1, n - i):
            sf2[i] = sf2[i] + (f_arr[i + j] - f_arr[i]) ** 2
        sf2[i] = sf2[i] / (n - i)

This is significantly more complicated to write compared to Numba, but achieves similar performances.
The trick is to have a function _func2_cy which operates with little to no interaction with Python (read: it goes at C speed).

Again the result is the same as func2:

nech = n = 6
f_arr = np.arange(n)
print(func2_cy(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

Timings

With some little toy benchmarking we get a feeling of the speed-ups, including a comparable function written as you did, and the vectorized solution proposed in Andras Deak's very fine answer (but fixing the indices to match the above):

def func_OP(f, nech, n):
    sf2 = np.zeros(n)
    count = np.zeros(n)
    for i in range(nech):
        indN = np.arange(n - i)  # <-- indexing fixed
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[i]),2)
            count[i] += 1
        sf2[i] = sf2[i] / count[i]
    return sf2


func_OP_nb = nb.njit(func_OP)

def func_pdisty(f, nech, n):
    res = np.zeros(nech)
    dists = scipy.spatial.distance.pdist(f[:, None], metric='sqeuclidean')
    counts = np.arange(n - 1, n - nech - 1, -1)
    inds = np.repeat(np.arange(res.size), counts)
    np.add.at(res, inds, dists[:inds.size])
    res /= (counts + 1)
    return res

nech = n = 6
f_arr = np.arange(n)
print(func_OP(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func_pdisty(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

nech = n = 1000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 1.5 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 567 ms per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 352 ms per loop
%timeit func_OP_nb(f_arr, nech, n)
# 1000 loops, best of 5: 1.87 ms per loop
%timeit func_nb(f_arr, nech, n)
# 1000 loops, best of 5: 1.7 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 1000 loops, best of 5: 768 µs per loop
%timeit func_pdisty(f_arr, nech, n)
# 10 loops, best of 5: 44.5 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 1000 loops, best of 5: 1 ms per loop

nech = n = 2000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 6.01 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 2.3 s per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 1.42 s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 100 loops, best of 5: 7.31 ms per loop
%timeit func_nb(f_arr, nech, n)
# 100 loops, best of 5: 6.82 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 100 loops, best of 5: 3.05 ms per loop
%timeit func_pdisty(f_arr, nech, n)
# 1 loop, best of 5: 344 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 100 loops, best of 5: 3.95 ms per loop

nech = n = 4000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 24.3 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 9.27 s per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 5.71 s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 10 loops, best of 5: 29 ms per loop
%timeit func_nb(f_arr, nech, n)
# 10 loops, best of 5: 27.3 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 100 loops, best of 5: 12.2 ms per loop
%timeit func_pdisty(f_arr, nech, n)
# 1 loop, best of 5: 706 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 100 loops, best of 5: 15.9 ms per loop

...and with the input sizes you provided:

nech = 24 * 30 * 24
n = 70299
f_arr = np.random.random(n)
%timeit -n1 -r1 func_OP(f_arr, nech, n)
# 1 loop, best of 1: 1h 4min 50s per loop
%timeit -n1 -r1 func(f_arr, nech, n)
# 1 loop, best of 1: 21min 14s per loop
%timeit -n1 -r1 func2(f_arr, nech, n)  # only one run / loop
# 1 loop, best of 1: 13min 9s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 1 loop, best of 5: 4.74 s per loop
%timeit func_nb(f_arr, nech, n)
# 1 loop, best of 5: 4 s per loop
%timeit func2_nb(f_arr, nech, n)
# 1 loop, best of 5: 1.62 s per loop
# %timeit func_pdisty(f_arr, nech, n)
# -- MEMORY ERROR --
%timeit func2_cy(f_arr, nech, n)
# 1 loop, best of 5: 2.2 s per loop

回复收藏 0 原文

浮生未歇 2025-01-25 15:46:35

您已经得到了一些不错的答案，这些答案呈现出快速的Numpy和Numba实现。我想提出一个体面的Cython实施，以进行公平的比较。

首先，让我们的时间NOROK2在我的计算机上实现了最快的NUMBA实现：

In [3]: %timeit func2_nb(f_arr, n)
3.4 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

为了获得Cython的快速内存访问，为其提供有关输入阵列的所有必要信息至关重要。在您的情况下，f_arr是一个c- contiguul np.ndarray，因此我们使用c- contiguul memoryViews double> double [:: 1]而不是正常内存> double double double double [：]。不同之处在于，索引c con-contiguul memenationview将简单的C代码f_arr [i]还原为f_arr [i + f_arr.stridess [0]]。
接下来，值得一提的是，Python Power Operator A ** 2将用C代码POW> POW（A，2）替代POW - 功能。即使在C中，我们也只会编写a*a。因此，让我们在Cython中做同样的事情。

在代码中：

%%cython -c=-O3 -c=-march=native
# for MSVC use: -c=/Ox -c=/arch:AVX2
#           or: -c=/Ox -c=/arch:AVX512 if your CPU supports AVX512 

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def func2_cy(double[::1] f_arr, int nech):
    cdef int n = f_arr.size
    cdef double[::1] sf2 = np.zeros(nech)
    cdef int i, j
    for i in range(nech):
        for j in range(n-i-2):
            sf2[i] += (f_arr[i+j+1]-f_arr[i])*(f_arr[i+j+1]-f_arr[i])
        sf2[i] /= (n - i - 2)
    return np.asarray(sf2)

在MacOS上使用Apple Clang 13.1.6在我的计算机上编译，这比上述NUMBA实现慢：

In [5]: %timeit func2_cy(f_arr, n)
4.2 s ± 32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

但是，人们应该意识到Cython基本上只是python to c 编译器。这意味着在删除了所有Python交互后，我们可以尝试使用C代码时要应用的相同性能技巧：传递优化标志-O3，并启用当前平台上可用的所有CPU指令-march =本机。请注意，这也意味着良好的Cython代码的性能（即在紧密循环中没有Python交互的代码）在很大程度上取决于您的C编译器。根据我的经验，由于MSVC的自动矢量不良，Numba通常比Windows上的Cython快。在MacOS/Linux和GCC或Clang上，通常是相反的。

这些众所周知的性能技巧之一是循环汇总，以便给出编译器提示以simd矢量化特定循环。在最终的循环中展开，功能看起来像这样：

%%cython -c=-O3 -c=-march=native
# for MSVC use: -c=/Ox -c=/arch:AVX2
#           or: -c=/Ox -c=/arch:AVX512 if your CPU supports AVX512 

#cython: cdivision=True

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def func3_cy(double[::1] f_arr, int nech):
    cdef int n = f_arr.size
    cdef double[::1] sf2 = np.zeros(nech)
    cdef double fi, fij, fij1, fij2, fij3
    cdef int i, j, ntmp
    for i in range(nech):
        # unroll the inner loop for better SIMD vectorization
        j = 1
        ntmp = (n-i-2) - ((n-i-2) % 4)
        # we use a while-loop since Cython doesn't support range(j, ntmp, 4)
        while j < ntmp:
            # unpack the values and hope the CPU keeps them in its register
            fi   = f_arr[i]
            fij  = f_arr[i+j+1]
            fij1 = f_arr[i+j+2]
            fij2 = f_arr[i+j+3]
            fij3 = f_arr[i+j+4]
            sf2[i] += ((fij-fi)*(fij-fi) + (fij1-fi)*(fij1-fi) + (fij2-fi)*(fij2-fi) + (fij3-fi)*(fij3-fi))
            j += 4
        for j in range(ntmp, n - i - 2):
            sf2[i] += (f_arr[i+j+1] - f_arr[i]) * (f_arr[i+j+1] - f_arr[i])
        sf2[i] /= (n-i-2)
    return np.asarray(sf2)

在我的计算机上，这比Numba快几乎两倍：

In [7]: %timeit func3_cy(f_arr, n)
1.7 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

为了迈出一步，我们可以在 prange，这只是围绕 OpenMP ：

%%cython -c=-O3 -c=-march=native -c=-fopenmp --link-args=-fopenmp
# for MSVC use: -c=/Ox -c=/openmp -c=/arch:AVX2
#           or: -c=/Ox -c=/openmp -c=/arch:AVX512 if your CPU supports AVX512 

#cython: cdivision=True

import numpy as np
cimport numpy as np
cimport cython
from cython.parallel cimport prange

@cython.boundscheck(False)
@cython.wraparound(False)
def func4_cy(double[::1] f_arr, int nech):
    cdef int n = f_arr.size
    cdef double[::1] sf2 = np.zeros(nech)
    cdef double fi, fij, fij1, fij2, fij3
    cdef int i, j, ntmp
    for i in prange(nech, nogil=True, schedule="static", chunksize=1):
        # unroll the inner loop for better SIMD vectorization
        j = 1
        ntmp = (n-i-2) - ((n-i-2) % 4)
        # we use a while-loop since Cython doesn't support range(j, ntmp, 4)
        while j < ntmp:
            # unpack the values and hope the CPU keeps them in its register
            fi   = f_arr[i]
            fij  = f_arr[i+j+1]
            fij1 = f_arr[i+j+2]
            fij2 = f_arr[i+j+3]
            fij3 = f_arr[i+j+4]
            sf2[i] += ((fij-fi)*(fij-fi) + (fij1-fi)*(fij1-fi) + (fij2-fi)*(fij2-fi) + (fij3-fi)*(fij3-fi))
            j += 4
        for j in range(ntmp, n - i - 2):
            sf2[i] += (f_arr[i+j+1] - f_arr[i]) * (f_arr[i+j+1] - f_arr[i])
        sf2[i] /= (n-i-2)
    return np.asarray(sf2)

在带有8个CPU内核的我的机器上，这是迄今为止最快的实现：

In [9]: %timeit func4_cy(f_arr, n)
329 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

但是，值得一提的是Numba也值得一提支持线程并行性，因此我希望与Numba相似。

You already got some nice answers that presented fast numpy and numba implementations. I'd like to present a decent Cython implementation for a fair comparison.

First of all, let's time norok2's fastest numba implementation on my machine:

In [3]: %timeit func2_nb(f_arr, n)
3.4 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In order to obtain fast memory access in Cython, it's crucial to give it all the necessary information about the input arrays. In your case, f_arr is a C-contiguous np.ndarray, so we use a C-contiguous memoryviews double[::1] instead of normal memoryviews double[:]. The difference is that indexing a C-contiguous memoryview reduces to plain C code f_arr[i] while the latter reduces to f_arr[i + f_arr.strides[0]].
Next, it's worth mentioning that the Python power operator a**2 is gonna be substituted by the C code pow(a,2), i.e. we call the pow-function. Calling a function in tight loops has unnecessary overhead, even in C. In C, we would just write a*a. So let's do the same in Cython.

In Code:

%%cython -c=-O3 -c=-march=native
# for MSVC use: -c=/Ox -c=/arch:AVX2
#           or: -c=/Ox -c=/arch:AVX512 if your CPU supports AVX512 

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def func2_cy(double[::1] f_arr, int nech):
    cdef int n = f_arr.size
    cdef double[::1] sf2 = np.zeros(nech)
    cdef int i, j
    for i in range(nech):
        for j in range(n-i-2):
            sf2[i] += (f_arr[i+j+1]-f_arr[i])*(f_arr[i+j+1]-f_arr[i])
        sf2[i] /= (n - i - 2)
    return np.asarray(sf2)

Compiled on my machine with Apple Clang 13.1.6 on macOS, this is slower than the aforementioned numba implementation:

In [5]: %timeit func2_cy(f_arr, n)
4.2 s ± 32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

However, one should be aware that Cython is basically just a Python to C compiler. This means that after we removed all the python interactions, we can try the same performance tricks we'd apply when using C code: passing the optimization flag -O3 and enabling all CPU instructions available on the current platform -march=native. Note that this also implies that the performance of good Cython code (i.e. code with no python interactions in tight loops) heavily depends on your C Compiler. In my experience, numba is often faster than Cython on Windows due to MSVC's bad auto-vectorization. On macOS/Linux and gcc or clang, it's often the other way around.

One of these well-known performance tricks is loop-unrolling in order to give the compiler hints to SIMD-vectorize the specific loop. Unrolling the innermost loop, the function looks like this:

%%cython -c=-O3 -c=-march=native
# for MSVC use: -c=/Ox -c=/arch:AVX2
#           or: -c=/Ox -c=/arch:AVX512 if your CPU supports AVX512 

#cython: cdivision=True

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def func3_cy(double[::1] f_arr, int nech):
    cdef int n = f_arr.size
    cdef double[::1] sf2 = np.zeros(nech)
    cdef double fi, fij, fij1, fij2, fij3
    cdef int i, j, ntmp
    for i in range(nech):
        # unroll the inner loop for better SIMD vectorization
        j = 1
        ntmp = (n-i-2) - ((n-i-2) % 4)
        # we use a while-loop since Cython doesn't support range(j, ntmp, 4)
        while j < ntmp:
            # unpack the values and hope the CPU keeps them in its register
            fi   = f_arr[i]
            fij  = f_arr[i+j+1]
            fij1 = f_arr[i+j+2]
            fij2 = f_arr[i+j+3]
            fij3 = f_arr[i+j+4]
            sf2[i] += ((fij-fi)*(fij-fi) + (fij1-fi)*(fij1-fi) + (fij2-fi)*(fij2-fi) + (fij3-fi)*(fij3-fi))
            j += 4
        for j in range(ntmp, n - i - 2):
            sf2[i] += (f_arr[i+j+1] - f_arr[i]) * (f_arr[i+j+1] - f_arr[i])
        sf2[i] /= (n-i-2)
    return np.asarray(sf2)

On my machine, this is nearly two times faster than numba:

In [7]: %timeit func3_cy(f_arr, n)
1.7 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

To take it one step further, we can parallelize the outer loop with the help of prange which is just a thin wrapper around OpenMP:

%%cython -c=-O3 -c=-march=native -c=-fopenmp --link-args=-fopenmp
# for MSVC use: -c=/Ox -c=/openmp -c=/arch:AVX2
#           or: -c=/Ox -c=/openmp -c=/arch:AVX512 if your CPU supports AVX512 

#cython: cdivision=True

import numpy as np
cimport numpy as np
cimport cython
from cython.parallel cimport prange

@cython.boundscheck(False)
@cython.wraparound(False)
def func4_cy(double[::1] f_arr, int nech):
    cdef int n = f_arr.size
    cdef double[::1] sf2 = np.zeros(nech)
    cdef double fi, fij, fij1, fij2, fij3
    cdef int i, j, ntmp
    for i in prange(nech, nogil=True, schedule="static", chunksize=1):
        # unroll the inner loop for better SIMD vectorization
        j = 1
        ntmp = (n-i-2) - ((n-i-2) % 4)
        # we use a while-loop since Cython doesn't support range(j, ntmp, 4)
        while j < ntmp:
            # unpack the values and hope the CPU keeps them in its register
            fi   = f_arr[i]
            fij  = f_arr[i+j+1]
            fij1 = f_arr[i+j+2]
            fij2 = f_arr[i+j+3]
            fij3 = f_arr[i+j+4]
            sf2[i] += ((fij-fi)*(fij-fi) + (fij1-fi)*(fij1-fi) + (fij2-fi)*(fij2-fi) + (fij3-fi)*(fij3-fi))
            j += 4
        for j in range(ntmp, n - i - 2):
            sf2[i] += (f_arr[i+j+1] - f_arr[i]) * (f_arr[i+j+1] - f_arr[i])
        sf2[i] /= (n-i-2)
    return np.asarray(sf2)

On my machine with 8 CPU Cores, this is by far the fastest implementation:

In [9]: %timeit func4_cy(f_arr, n)
329 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

However, it's worth mentioning that Numba also supports thread parallelism, so I'd expect a similar performance with Numba.

回复收藏 0 原文

鹤仙姿 2025-01-25 15:46:35

通过一些基本的 python/numpy 重写，我可以大大加快你的代码

def structFunPython4(f,N,nech):
    sf2 = np.zeros(N)
    #count = np.zeros(N)
    for i in range(nech):
        # indN = np.arange(1,N-i-1)
        sf2[i] = np.sum(np.power(f[i+1:N-1]-f[i],2))
        #for j in range(1,N-i-1):
        #    sf2[i] += np.power((f[i+j]-f[i]),2)
        #    #count[i] += 1
        sf2[i] = sf2[i]/(N-i-2)
    return sf2

对于适度的样本大小：

In [53]: f = np.arange(100); N=f.shape[0]; nech=98

它们匹配：

In [54]: np.allclose(structFunPython(f,N,nech),structFunPython4(f,N,nech))
Out[54]: True

和计时：

In [55]: timeit structFunPython(f,N,nech)
34.4 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [56]: timeit structFunPython4(f,N,nech)
2.22 ms ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

我首先将 for i in indN: 替换为 for i in范围（1，N-1-i）：。数组上的迭代比列表或范围上的迭代慢。

正如其他人指出的，我们不需要迭代count。

但最大的变化是用数组切片替换 j 迭代，对整个切片和数组求和上电。

我还没有足够多地研究 i 迭代来消除这个问题。 f[i+1:N-1] 切片的长度各不相同，从 nech 到 0。MATLAB

执行大量 jit 编译，因此您可以通过迭代来完成。早在 20 世纪 90 年代，当我使用 MATLAB 时，这可能是一种糟糕的形式 - 那些版本需要整个矩阵计算才能获得合理的速度。 Python 级别的迭代很慢（解释），并且在数组上比在列表上慢。尽可能尝试使用全数组方法。或者使用numba来编译计算。

====

With some basic python/numpy rewrites I can speed up your code quite a bit

def structFunPython4(f,N,nech):
    sf2 = np.zeros(N)
    #count = np.zeros(N)
    for i in range(nech):
        # indN = np.arange(1,N-i-1)
        sf2[i] = np.sum(np.power(f[i+1:N-1]-f[i],2))
        #for j in range(1,N-i-1):
        #    sf2[i] += np.power((f[i+j]-f[i]),2)
        #    #count[i] += 1
        sf2[i] = sf2[i]/(N-i-2)
    return sf2

For a modest sample size:

In [53]: f = np.arange(100); N=f.shape[0]; nech=98

they match:

In [54]: np.allclose(structFunPython(f,N,nech),structFunPython4(f,N,nech))
Out[54]: True

and timings:

In [55]: timeit structFunPython(f,N,nech)
34.4 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [56]: timeit structFunPython4(f,N,nech)
2.22 ms ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I first replaced for i in indN: with for i in range(1,N-1-i):. Iteration on an array is slower than iteration on a list or range.

As others noted we don't need to iterate the count.

But the big change is to replace the j iteration with an array slice, power on the whole slice and array sum.

I haven't looked at the i iteration enough to eliminate that. f[i+1:N-1] slice varies in length, from nech down to 0.

MATLAB does a lot of jit compilation, so you can get by with iterations. Back in the 1990s when I worked MATLAB that would have been bad form - those versions required whole-matrix calculations to get any reasonable speed. Python level iteration is slow (interpreted), and slower on arrays than on lists. Try to use whole-array methods where possible. Or use numba to compile the calculations.

====

回复收藏 0 原文

~没有更多了~