在 NumPy 中快速检查 NaN
我正在寻找最快的方法来检查 NumPy 数组 X
中是否出现 NaN (np.nan
)。 np.isnan(X)
是不可能的,因为它构建了一个形状为 X.shape
的布尔数组,该数组可能非常巨大。
我在 X 中尝试了 np.nan,但这似乎不起作用,因为 np.nan != np.nan。有没有一种快速且节省内存的方法来做到这一点?
(对于那些会问“有多大”的人:我不知道。这是库代码的输入验证。)
I'm looking for the fastest way to check for the occurrence of NaN (np.nan
) in a NumPy array X
. np.isnan(X)
is out of the question, since it builds a boolean array of shape X.shape
, which is potentially gigantic.
I tried np.nan in X
, but that seems not to work because np.nan != np.nan
. Is there a fast and memory-efficient way to do this at all?
(To those who would ask "how gigantic": I can't tell. This is input validation for library code.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
雷的解决方案很好。然而,在我的机器上,使用
numpy.sum
代替numpy.min
:与
min
不同,sum
不需要分支,在现代硬件上这往往相当昂贵。这可能就是sum
更快的原因。编辑 上述测试是在数组中间有一个 NaN 的情况下执行的。
有趣的是,存在 NaN 时
min
比不存在 NaN 时慢。随着 NaN 越来越接近数组的开头,它似乎也会变得更慢。另一方面,无论是否存在 NaN 以及它们位于何处,sum
的吞吐量似乎都是恒定的:Ray's solution is good. However, on my machine it is about 2.5x faster to use
numpy.sum
in place ofnumpy.min
:Unlike
min
,sum
doesn't require branching, which on modern hardware tends to be pretty expensive. This is probably the reason whysum
is faster.edit The above test was performed with a single NaN right in the middle of the array.
It is interesting to note that
min
is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand,sum
's throughput seems constant regardless of whether there are NaNs and where they're located:我认为
np.isnan(np.min(X))
应该做你想做的。I think
np.isnan(np.min(X))
should do what you want.这里有两种通用方法:
nan
并获取any
。nan
的累积运算(如sum
)并检查其结果。虽然第一种方法肯定是最干净的,但对一些累积操作(特别是在 BLAS 中执行的操作,如
dot
)的大量优化可以使这些操作变得非常快。请注意,dot
与其他一些 BLAS 操作一样,在某些条件下是多线程的。这解释了不同机器之间速度的差异。There are two general approaches here:
nan
and takeany
.nan
s (likesum
) and check its result.While the first approach is certainly the cleanest, the heavy optimization of some of the cumulative operations (particularly the ones that are executed in BLAS, like
dot
) can make those quite fast. Note thatdot
, like some other BLAS operations, are multithreaded under certain conditions. This explains the difference in speed between different machines.即使存在一个可接受的答案,我也想演示以下内容(在 Vista 上使用 Python 2.7.2 和 Numpy 1.6.0):
因此,真正有效的方法可能在很大程度上依赖于操作系统。无论如何,基于
dot(.)
似乎是最稳定的。Even there exist an accepted answer, I'll like to demonstrate the following (with Python 2.7.2 and Numpy 1.6.0 on Vista):
Thus, the really efficient way might be heavily dependent on the operating system. Anyway
dot(.)
based seems to be the most stable one.如果您对 numba 感到满意,它允许创建快速短路(一旦发现 NaN 就停止)函数:
如果没有 NaN,该函数实际上可能比 np.min 慢,我认为这是因为
np.min
对大型数组使用多重处理:但如果数组中存在 NaN,特别是如果它的位置处于低索引,则速度要快得多:
使用 Cython 或 C 扩展可以实现类似的结果,但这些结果有点复杂(或者很容易以
bottleneck.anynan
形式获得),但最终会这样做与我的 Anynan 函数相同。If you're comfortable with numba it allows to create a fast short-circuit (stops as soon as a NaN is found) function:
If there is no
NaN
the function might actually be slower thannp.min
, I think that's becausenp.min
uses multiprocessing for large arrays:But in case there is a NaN in the array, especially if it's position is at low indices, then it's much faster:
Similar results may be achieved with Cython or a C extension, these are a bit more complicated (or easily avaiable as
bottleneck.anynan
) but ultimatly do the same as myanynan
function.使用.any()
if numpy.isnan(myarray).any()
numpy.isfinite 可能比 isnan 更好用于检查
如果不是 np.isfinite(prop).all()
use .any()
if numpy.isnan(myarray).any()
numpy.isfinite maybe better than isnan for checking
if not np.isfinite(prop).all()
与此相关的是如何找到 NaN 第一次出现的问题。这是我所知道的最快的处理方法:
Related to this is the question of how to find the first occurrence of NaN. This is the fastest way to handle that that I know of:
添加到 @nico-schlömer 和 @mseifert 的答案中,我计算了提前停止的 numba 测试
has_nan
的性能,与一些解析完整数组的函数进行比较。在我的机器上,对于没有 nan 的数组,收支平衡发生在 ~10^4 个元素上。
如果数组有 nans 会发生什么?我研究了阵列纳米覆盖的影响。
对于长度为 1,000,000 的数组,如果数组中有 ~10^-3 % nan(因此 ~10 nan),则
has_nan
成为更好的选择。如果在您的应用程序中大多数数组都有
nan
而您正在寻找没有的数组,那么has_nan
是最好的方法。别的;
dot
似乎是最好的选择。Adding to @nico-schlömer and @mseifert 's answers, I computed the performance of a numba-test
has_nan
with early stops, compared to some of the functions that will parse the full array.On my machine, for an array without nans, the break-even happens for ~10^4 elements.
What happens if the array has nans ? I investigated the impact of the nan-coverage of the array.
For arrays of length 1,000,000,
has_nan
becomes a better option is there are ~10^-3 % nans (so ~10 nans) in the array.If in your application most arrays have
nan
and you're looking for ones without, thenhas_nan
is the best approach.Else;
dot
seems to be the best option.