检测向量是否至少有 1 个 NA 的最快方法？

发布于 2024-11-18 14:41:26 字数 129 浏览 4 评论 0原文

在 R 中检测向量是否至少有 1 个 NA 的最快方法是什么？我一直在使用：

sum( is.na( data ) ) > 0

但这需要检查每个元素、强制转换和求和函数。

原文

What is the fastest way to detect if a vector has at least 1 NA in R? I've been using:

sum( is.na( data ) ) > 0

But that requires examining each element, coercion, and the sum function.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你在我安 2024-11-25 14:41:26

从 R 3.1.0 开始，anyNA() 是执行此操作的方法。在原子向量上，这将在第一个 NA 之后停止，而不是像 any(is.na()) 那样遍历整个向量。此外，这还可以避免创建带有立即丢弃的 is.na 的中间逻辑向量。借用 Joran 的例子：

x <- y <- runif(1e7)
x[1e4] <- NA
y[1e7] <- NA
microbenchmark::microbenchmark(any(is.na(x)), anyNA(x), any(is.na(y)), anyNA(y), times=10)
# Unit: microseconds
#           expr        min         lq        mean      median         uq
#  any(is.na(x))  13444.674  13509.454  21191.9025  13639.3065  13917.592
#       anyNA(x)      6.840     13.187     13.5283     14.1705     14.774
#  any(is.na(y)) 165030.942 168258.159 178954.6499 169966.1440 197591.168
#       anyNA(y)   7193.784   7285.107   7694.1785   7497.9265   7865.064

请注意，即使我们修改向量的最后一个值，它的速度也要快得多；这部分是因为避免了中间逻辑向量。

As of R 3.1.0 anyNA() is the way to do this. On atomic vectors this will stop after the first NA instead of going through the entire vector as would be the case with any(is.na()). Additionally, this avoids creating an intermediate logical vector with is.na that is immediately discarded. Borrowing Joran's example:

x <- y <- runif(1e7)
x[1e4] <- NA
y[1e7] <- NA
microbenchmark::microbenchmark(any(is.na(x)), anyNA(x), any(is.na(y)), anyNA(y), times=10)
# Unit: microseconds
#           expr        min         lq        mean      median         uq
#  any(is.na(x))  13444.674  13509.454  21191.9025  13639.3065  13917.592
#       anyNA(x)      6.840     13.187     13.5283     14.1705     14.774
#  any(is.na(y)) 165030.942 168258.159 178954.6499 169966.1440 197591.168
#       anyNA(y)   7193.784   7285.107   7694.1785   7497.9265   7865.064

Notice how it is substantially faster even when we modify the last value of the vector; this is in part because of the avoidance of the intermediate logical vector.

回复收藏 0 原文

羁〃客ぐ 2024-11-25 14:41:26

我想：

any(is.na(data))

应该稍微快一点。

I'm thinking:

any(is.na(data))

should be slightly faster.

回复收藏 0 原文

遮云壑 2024-11-25 14:41:26

我们在一些 Rcpp 演示文稿中提到了这一点，并且实际上有一些基准测试显示了相当< em>使用 Rcpp 的嵌入式 C++ 比 R 解决方案获得了巨大的收益，因为

矢量化 R 解决方案仍然计算矢量表达式的每个元素
如果您的目标只是满足 any()，那么您可以在第一个匹配后中止 - 这就是我们的 < em>Rcpp Sugar（实质上：一些 C++ 模板魔法使 C++ 表达式看起来更像 R 表达式，请参阅这个小插图了解更多）解决方案确实如此。

因此，通过让编译后的专用解决方案发挥作用，我们确实获得了快速解决方案。我应该补充一点，虽然我没有将其与此问题中提供的解决方案进行比较，但我对性能相当有信心。

编辑 Rcpp 包包含目录 sugarPerformance 中的示例。对于 any()，它的“sugar-can-abort-soon”比“R-computes-full-vector-expression”增加了数千个，但我应该补充一点，这种情况不涉及 is.na() 而是一个简单的布尔表达式。

回复收藏 0 原文

软糖 2024-11-25 14:41:26

人们可以编写一个在 NA 处停止的 for 循环，但是 system.time 取决于 NA 所在的位置...（如果没有，则需要很长时间）

set.seed(1234)
x <- sample(c(1:5, NA), 100000000, replace = TRUE)

nacount <- function(x){
  for(i in 1:length(x)){
    if(is.na(x[i])) {
      print(TRUE)
      break}
}}

system.time(
  nacount(x)
)
[1] TRUE
       User      System verstrichen 
       0.14        0.04        0.18 

system.time(
  any(is.na(x))
) 
       User      System verstrichen 
       0.28        0.08        0.37 

system.time(
  sum(is.na(x)) > 0
)
       User      System verstrichen 
       0.45        0.07        0.53

One could write a for loop stopping at NA, but the system.time then depends on where the NA is... (if there is none, it takes looooong)

set.seed(1234)
x <- sample(c(1:5, NA), 100000000, replace = TRUE)

nacount <- function(x){
  for(i in 1:length(x)){
    if(is.na(x[i])) {
      print(TRUE)
      break}
}}

system.time(
  nacount(x)
)
[1] TRUE
       User      System verstrichen 
       0.14        0.04        0.18 

system.time(
  any(is.na(x))
) 
       User      System verstrichen 
       0.28        0.08        0.37 

system.time(
  sum(is.na(x)) > 0
)
       User      System verstrichen 
       0.45        0.07        0.53

回复收藏 0 原文

魔 2024-11-25 14:41:26

以下是我的（慢速）机器上迄今为止讨论的一些不同方法的一些实际时间：

x <- runif(1e7)
x[1e4] <- NA

system.time(sum(is.na(x)) > 0)
> system.time(sum(is.na(x)) > 0)
   user  system elapsed 
  0.065   0.001   0.065 

system.time(any(is.na(x)))  
> system.time(any(is.na(x)))
   user  system elapsed 
  0.035   0.000   0.034

system.time(match(NA,x)) 
> system.time(match(NA,x))
  user  system elapsed 
 1.824   0.112   1.918

system.time(NA %in% x) 
> system.time(NA %in% x)
  user  system elapsed 
 1.828   0.115   1.925 

system.time(which(is.na(x) == TRUE))
> system.time(which(is.na(x) == TRUE))
  user  system elapsed 
 0.099   0.029   0.127

match 和 %in% 相似并不奇怪，因为 %in% 是使用 match 实现的。

Here are some actual times from my (slow) machine for some of the various methods discussed so far:

x <- runif(1e7)
x[1e4] <- NA

system.time(sum(is.na(x)) > 0)
> system.time(sum(is.na(x)) > 0)
   user  system elapsed 
  0.065   0.001   0.065 

system.time(any(is.na(x)))  
> system.time(any(is.na(x)))
   user  system elapsed 
  0.035   0.000   0.034

system.time(match(NA,x)) 
> system.time(match(NA,x))
  user  system elapsed 
 1.824   0.112   1.918

system.time(NA %in% x) 
> system.time(NA %in% x)
  user  system elapsed 
 1.828   0.115   1.925 

system.time(which(is.na(x) == TRUE))
> system.time(which(is.na(x) == TRUE))
  user  system elapsed 
 0.099   0.029   0.127

It's not surprising that match and %in% are similar, since %in% is implemented using match.

回复收藏 0 原文

那一片橙海， 2024-11-25 14:41:26

您可以尝试：

d <- c(1,2,3,NA,5,3)

which(is.na(d) == TRUE, arr.ind=TRUE)

You can try:

d <- c(1,2,3,NA,5,3)

which(is.na(d) == TRUE, arr.ind=TRUE)

回复收藏 0 原文

~没有更多了~

关于作者

梦中的蝴蝶

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

检测向量是否至少有 1 个 NA 的最快方法？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

检测向量是否至少有 1 个 NA 的最快方法？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。