检测向量是否至少有 1 个 NA 的最快方法?
在 R 中检测向量是否至少有 1 个 NA 的最快方法是什么?我一直在使用:
sum( is.na( data ) ) > 0
但这需要检查每个元素、强制转换和求和函数。
What is the fastest way to detect if a vector has at least 1 NA
in R? I've been using:
sum( is.na( data ) ) > 0
But that requires examining each element, coercion, and the sum function.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
从 R 3.1.0 开始,
anyNA()
是执行此操作的方法。在原子向量上,这将在第一个 NA 之后停止,而不是像any(is.na())
那样遍历整个向量。此外,这还可以避免创建带有立即丢弃的is.na
的中间逻辑向量。借用 Joran 的例子:请注意,即使我们修改向量的最后一个值,它的速度也要快得多;这部分是因为避免了中间逻辑向量。
As of R 3.1.0
anyNA()
is the way to do this. On atomic vectors this will stop after the first NA instead of going through the entire vector as would be the case withany(is.na())
. Additionally, this avoids creating an intermediate logical vector withis.na
that is immediately discarded. Borrowing Joran's example:Notice how it is substantially faster even when we modify the last value of the vector; this is in part because of the avoidance of the intermediate logical vector.
我想:
应该稍微快一点。
I'm thinking:
should be slightly faster.
我们在一些 Rcpp 演示文稿中提到了这一点,并且实际上有一些基准测试显示了相当< em>使用 Rcpp 的嵌入式 C++ 比 R 解决方案获得了巨大的收益,因为
矢量化 R 解决方案仍然计算矢量表达式的每个元素
如果您的目标只是满足
any()
,那么您可以在第一个匹配后中止 - 这就是我们的 < em>Rcpp Sugar(实质上:一些 C++ 模板魔法使 C++ 表达式看起来更像 R 表达式,请参阅 这个小插图了解更多)解决方案确实如此。因此,通过让编译后的专用解决方案发挥作用,我们确实获得了快速解决方案。我应该补充一点,虽然我没有将其与此问题中提供的解决方案进行比较,但我对性能相当有信心。
编辑 Rcpp 包包含目录
sugarPerformance
中的示例。对于any()
,它的“sugar-can-abort-soon”比“R-computes-full-vector-expression”增加了数千个,但我应该补充一点,这种情况不涉及 is.na() 而是一个简单的布尔表达式。We mention this in some of our Rcpp presentations and actually have some benchmarks which show a pretty large gain from embedded C++ with Rcpp over the R solution because
a vectorised R solution still computes every single element of the vector expression
if your goal is to just satisfy
any()
, then you can abort after the first match -- which is what our Rcpp sugar (in essence: some C++ template magic to make C++ expressions look more like R expressions, see this vignette for more) solution does.So by getting a compiled specialised solution to work, we do indeed get a fast solution. I should add that while I have not compared this to the solutions offered in this SO question here, I am reasonably confident about the performance.
Edit And the Rcpp package contains examples in the directory
sugarPerformance
. It has an increase of the several thousand of the 'sugar-can-abort-soon' over 'R-computes-full-vector-expression' forany()
, but I should add that that case does not involveis.na()
but a simple boolean expression.人们可以编写一个在 NA 处停止的 for 循环,但是 system.time 取决于 NA 所在的位置...(如果没有,则需要很长时间)
One could write a for loop stopping at NA, but the system.time then depends on where the NA is... (if there is none, it takes looooong)
以下是我的(慢速)机器上迄今为止讨论的一些不同方法的一些实际时间:
match
和%in%
相似并不奇怪,因为%in%
是使用match
实现的。Here are some actual times from my (slow) machine for some of the various methods discussed so far:
It's not surprising that
match
and%in%
are similar, since%in%
is implemented usingmatch
.您可以尝试:
You can try: