在 NumPy 中快速检查 NaN

发布于 2024-11-25 00:56:38 字数 277 浏览 4 评论 0原文

我正在寻找最快的方法来检查 NumPy 数组 X 中是否出现 NaN (np.nan)。 np.isnan(X) 是不可能的,因为它构建了一个形状为 X.shape 的布尔数组,该数组可能非常巨大。

我在 X 中尝试了 np.nan,但这似乎不起作用,因为 np.nan != np.nan。有没有一种快速且节省内存的方法来做到这一点?

(对于那些会问“有多大”的人:我不知道。这是库代码的输入验证。)

I'm looking for the fastest way to check for the occurrence of NaN (np.nan) in a NumPy array X. np.isnan(X) is out of the question, since it builds a boolean array of shape X.shape, which is potentially gigantic.

I tried np.nan in X, but that seems not to work because np.nan != np.nan. Is there a fast and memory-efficient way to do this at all?

(To those who would ask "how gigantic": I can't tell. This is input validation for library code.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

晨敛清荷 2024-12-02 00:56:38

雷的解决方案很好。然而,在我的机器上,使用 numpy.sum 代替 numpy.min

In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop

In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop

min 不同,sum 不需要分支,在现代硬件上这往往相当昂贵。这可能就是 sum 更快的原因。

编辑 上述测试是在数组中间有一个 NaN 的情况下执行的。

有趣的是,存在 NaN 时 min 比不存在 NaN 时慢。随着 NaN 越来越接近数组的开头,它似乎也会变得更慢。另一方面,无论是否存在 NaN 以及它们位于何处,sum 的吞吐量似乎都是恒定的:

In [40]: x = np.random.rand(100000)

In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop

In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

In [43]: x[50000] = np.nan

In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop

In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop

In [46]: x[0] = np.nan

In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop

In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

Ray's solution is good. However, on my machine it is about 2.5x faster to use numpy.sum in place of numpy.min:

In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop

In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop

Unlike min, sum doesn't require branching, which on modern hardware tends to be pretty expensive. This is probably the reason why sum is faster.

edit The above test was performed with a single NaN right in the middle of the array.

It is interesting to note that min is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand, sum's throughput seems constant regardless of whether there are NaNs and where they're located:

In [40]: x = np.random.rand(100000)

In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop

In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

In [43]: x[50000] = np.nan

In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop

In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop

In [46]: x[0] = np.nan

In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop

In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
憧憬巴黎街头的黎明 2024-12-02 00:56:38

我认为 np.isnan(np.min(X)) 应该做你想做的。

I think np.isnan(np.min(X)) should do what you want.

凌乱心跳 2024-12-02 00:56:38

这里有两种通用方法:

  • 检查每个数组项的 nan 并获取 any
  • 应用一些保留 nan 的累积运算(如 sum)并检查其结果。

虽然第一种方法肯定是最干净的,但对一些累积操作(特别是在 BLAS 中执行的操作,如 dot)的大量优化可以使这些操作变得非常快。请注意,dot 与其他一些 BLAS 操作一样,在某些条件下是多线程的。这解释了不同机器之间速度的差异。

输入图像描述这里

import numpy as np
import perfplot


def min(a):
    return np.isnan(np.min(a))


def sum(a):
    return np.isnan(np.sum(a))


def dot(a):
    return np.isnan(np.dot(a, a))


def any(a):
    return np.any(np.isnan(a))


def einsum(a):
    return np.isnan(np.einsum("i->", a))


b = perfplot.bench(
    setup=np.random.rand,
    kernels=[min, sum, dot, any, einsum],
    n_range=[2 ** k for k in range(25)],
    xlabel="len(a)",
)
b.save("out.png")
b.show()

There are two general approaches here:

  • Check each array item for nan and take any.
  • Apply some cumulative operation that preserves nans (like sum) and check its result.

While the first approach is certainly the cleanest, the heavy optimization of some of the cumulative operations (particularly the ones that are executed in BLAS, like dot) can make those quite fast. Note that dot, like some other BLAS operations, are multithreaded under certain conditions. This explains the difference in speed between different machines.

enter image description here

import numpy as np
import perfplot


def min(a):
    return np.isnan(np.min(a))


def sum(a):
    return np.isnan(np.sum(a))


def dot(a):
    return np.isnan(np.dot(a, a))


def any(a):
    return np.any(np.isnan(a))


def einsum(a):
    return np.isnan(np.einsum("i->", a))


b = perfplot.bench(
    setup=np.random.rand,
    kernels=[min, sum, dot, any, einsum],
    n_range=[2 ** k for k in range(25)],
    xlabel="len(a)",
)
b.save("out.png")
b.show()
北方的巷 2024-12-02 00:56:38

即使存在一个可接受的答案,我也想演示以下内容(在 Vista 上使用 Python 2.7.2 和 Numpy 1.6.0):

In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop

In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop

因此,真正有效的方法可能在很大程度上依赖于操作系统。无论如何,基于 dot(.) 似乎是最稳定的。

Even there exist an accepted answer, I'll like to demonstrate the following (with Python 2.7.2 and Numpy 1.6.0 on Vista):

In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop

In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop

Thus, the really efficient way might be heavily dependent on the operating system. Anyway dot(.) based seems to be the most stable one.

梦在深巷 2024-12-02 00:56:38

如果您对 感到满意,它允许创建快速短路(一旦发现 NaN 就停止)函数:

import numba as nb
import math

@nb.njit
def anynan(array):
    array = array.ravel()
    for i in range(array.size):
        if math.isnan(array[i]):
            return True
    return False

如果没有 NaN,该函数实际上可能比 np.min 慢,我认为这是因为np.min对大型数组使用多重处理:

import numpy as np
array = np.random.random(2000000)

%timeit anynan(array)          # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.64 ms per loop

但如果数组中存在 NaN,特别是如果它的位置处于低索引,则速度要快得多:

array = np.random.random(2000000)
array[100] = np.nan

%timeit anynan(array)          # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.65 ms per loop

使用 Cython 或 C 扩展可以实现类似的结果,但这些结果有点复杂(或者很容易以 bottleneck.anynan 形式获得),但最终会这样做与我的 Anynan 函数相同。

If you're comfortable with it allows to create a fast short-circuit (stops as soon as a NaN is found) function:

import numba as nb
import math

@nb.njit
def anynan(array):
    array = array.ravel()
    for i in range(array.size):
        if math.isnan(array[i]):
            return True
    return False

If there is no NaN the function might actually be slower than np.min, I think that's because np.min uses multiprocessing for large arrays:

import numpy as np
array = np.random.random(2000000)

%timeit anynan(array)          # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.64 ms per loop

But in case there is a NaN in the array, especially if it's position is at low indices, then it's much faster:

array = np.random.random(2000000)
array[100] = np.nan

%timeit anynan(array)          # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.65 ms per loop

Similar results may be achieved with Cython or a C extension, these are a bit more complicated (or easily avaiable as bottleneck.anynan) but ultimatly do the same as my anynan function.

鲜血染红嫁衣 2024-12-02 00:56:38
  1. 使用.any()

    if numpy.isnan(myarray).any()

  2. numpy.isfinite 可能比 isnan 更好用于检查

    如果不是 np.isfinite(prop).all()

  1. use .any()

    if numpy.isnan(myarray).any()

  2. numpy.isfinite maybe better than isnan for checking

    if not np.isfinite(prop).all()

空城缀染半城烟沙 2024-12-02 00:56:38

与此相关的是如何找到 NaN 第一次出现的问题。这是我所知道的最快的处理方法:

index = next((i for (i,n) in enumerate(iterable) if n!=n), None)

Related to this is the question of how to find the first occurrence of NaN. This is the fastest way to handle that that I know of:

index = next((i for (i,n) in enumerate(iterable) if n!=n), None)
望她远 2024-12-02 00:56:38

添加到 @nico-schlömer 和 @mseifert 的答案中,我计算了提前停止的 numba 测试 has_nan 的性能,与一些解析完整数组的函数进行比较。

在我的机器上,对于没有 nan 的数组,收支平衡发生在 ~10^4 个元素上。

has_nan_vs_full_parse_methods


import perfplot
import numpy as np
import numba
import math

def min(a):
    return np.isnan(np.min(a))

def dot(a):
    return np.isnan(np.dot(a, a))

def einsum(a):
    return np.isnan(np.einsum("i->", a))

@numba.njit
def has_nan(a):
    for i in range(a.size - 1):
        if math.isnan(a[i]):
            return True
    return False


def array_with_missing_values(n, p):
    """ Return array of size n,  p : nans ( % of array length )
    Ex : n=1e6, p=1 : 1e4 nan assigned at random positions """
    a = np.random.rand(n)
    p = np.random.randint(0, len(a), int(p*len(a)/100))
    a[p] = np.nan
    return a


#%%
perfplot.show(
    setup=lambda n: array_with_missing_values(n, 0),
    kernels=[min, dot, has_nan],
    n_range=[2 ** k for k in range(20)],
    logx=True,
    logy=True,
    xlabel="len(a)",
)

如果数组有 nans 会发生什么?我研究了阵列纳米覆盖的影响。

对于长度为 1,000,000 的数组,如果数组中有 ~10^-3 % nan(因此 ~10 nan),则 has_nan 成为更好的选择。

影响nan-coverage of array


#%%
N = 1000000  # 100000
perfplot.show(
    setup=lambda p: array_with_missing_values(N, p),
    kernels=[min, dot, has_nan],
    n_range=np.array([2 ** k for k in range(20)]) / 2**20 * 0.01, 
    logy=True,
    xlabel=f"% of nan in array (N = {N})",
)

如果在您的应用程序中大多数数组都有 nan 而您正在寻找没有的数组,那么 has_nan 是最好的方法。
别的; dot 似乎是最好的选择。

Adding to @nico-schlömer and @mseifert 's answers, I computed the performance of a numba-test has_nan with early stops, compared to some of the functions that will parse the full array.

On my machine, for an array without nans, the break-even happens for ~10^4 elements.

has_nan_vs_full_parse_methods


import perfplot
import numpy as np
import numba
import math

def min(a):
    return np.isnan(np.min(a))

def dot(a):
    return np.isnan(np.dot(a, a))

def einsum(a):
    return np.isnan(np.einsum("i->", a))

@numba.njit
def has_nan(a):
    for i in range(a.size - 1):
        if math.isnan(a[i]):
            return True
    return False


def array_with_missing_values(n, p):
    """ Return array of size n,  p : nans ( % of array length )
    Ex : n=1e6, p=1 : 1e4 nan assigned at random positions """
    a = np.random.rand(n)
    p = np.random.randint(0, len(a), int(p*len(a)/100))
    a[p] = np.nan
    return a


#%%
perfplot.show(
    setup=lambda n: array_with_missing_values(n, 0),
    kernels=[min, dot, has_nan],
    n_range=[2 ** k for k in range(20)],
    logx=True,
    logy=True,
    xlabel="len(a)",
)

What happens if the array has nans ? I investigated the impact of the nan-coverage of the array.

For arrays of length 1,000,000, has_nan becomes a better option is there are ~10^-3 % nans (so ~10 nans) in the array.

impact of nan-coverage of array


#%%
N = 1000000  # 100000
perfplot.show(
    setup=lambda p: array_with_missing_values(N, p),
    kernels=[min, dot, has_nan],
    n_range=np.array([2 ** k for k in range(20)]) / 2**20 * 0.01, 
    logy=True,
    xlabel=f"% of nan in array (N = {N})",
)

If in your application most arrays have nan and you're looking for ones without, then has_nan is the best approach.
Else; dot seems to be the best option.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文