检测向量是否至少有 1 个 NA 的最快方法?

发布于 2024-11-18 14:41:26 字数 129 浏览 4 评论 0原文

在 R 中检测向量是否至少有 1 个 NA 的最快方法是什么?我一直在使用:

sum( is.na( data ) ) > 0

但这需要检查每个元素、强制转换和求和函数。

What is the fastest way to detect if a vector has at least 1 NA in R? I've been using:

sum( is.na( data ) ) > 0

But that requires examining each element, coercion, and the sum function.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

你在我安 2024-11-25 14:41:26

从 R 3.1.0 开始,anyNA() 是执行此操作的方法。在原子向量上,这将在第一个 NA 之后停止,而不是像 any(is.na()) 那样遍历整个向量。此外,这还可以避免创建带有立即丢弃的 is.na 的中间逻辑向量。借用 Joran 的例子:

x <- y <- runif(1e7)
x[1e4] <- NA
y[1e7] <- NA
microbenchmark::microbenchmark(any(is.na(x)), anyNA(x), any(is.na(y)), anyNA(y), times=10)
# Unit: microseconds
#           expr        min         lq        mean      median         uq
#  any(is.na(x))  13444.674  13509.454  21191.9025  13639.3065  13917.592
#       anyNA(x)      6.840     13.187     13.5283     14.1705     14.774
#  any(is.na(y)) 165030.942 168258.159 178954.6499 169966.1440 197591.168
#       anyNA(y)   7193.784   7285.107   7694.1785   7497.9265   7865.064

请注意,即使我们修改向量的最后一个值,它的速度也要快得多;这部分是因为避免了中间逻辑向量。

As of R 3.1.0 anyNA() is the way to do this. On atomic vectors this will stop after the first NA instead of going through the entire vector as would be the case with any(is.na()). Additionally, this avoids creating an intermediate logical vector with is.na that is immediately discarded. Borrowing Joran's example:

x <- y <- runif(1e7)
x[1e4] <- NA
y[1e7] <- NA
microbenchmark::microbenchmark(any(is.na(x)), anyNA(x), any(is.na(y)), anyNA(y), times=10)
# Unit: microseconds
#           expr        min         lq        mean      median         uq
#  any(is.na(x))  13444.674  13509.454  21191.9025  13639.3065  13917.592
#       anyNA(x)      6.840     13.187     13.5283     14.1705     14.774
#  any(is.na(y)) 165030.942 168258.159 178954.6499 169966.1440 197591.168
#       anyNA(y)   7193.784   7285.107   7694.1785   7497.9265   7865.064

Notice how it is substantially faster even when we modify the last value of the vector; this is in part because of the avoidance of the intermediate logical vector.

羁〃客ぐ 2024-11-25 14:41:26

我想:

any(is.na(data))

应该稍微快一点。

I'm thinking:

any(is.na(data))

should be slightly faster.

遮云壑 2024-11-25 14:41:26

我们在一些 Rcpp 演示文稿中提到了这一点,并且实际上有一些基准测试显示了相当< em>使用 Rcpp 的嵌入式 C++ 比 R 解决方案获得了巨大的收益,因为

  • 矢量化 R 解决方案仍然计算矢量表达式的每个元素

  • 如果您的目标只是满足 any(),那么您可以在第一个匹配后中止 - 这就是我们的 < em>Rcpp Sugar(实质上:一些 C++ 模板魔法使 C++ 表达式看起来更像 R 表达式,请参阅 这个小插图了解更多)解决方案确实如此。

因此,通过让编译后的专用解决方案发挥作用,我们确实获得了快速解决方案。我应该补充一点,虽然我没有将其与此问题中提供的解决方案进行比较,但我对性能相当有信心。

编辑 Rcpp 包包含目录 sugarPerformance 中的示例。对于 any(),它的“sugar-can-abort-soon”比“R-computes-full-vector-expression”增加了数千个,但我应该补充一点,这种情况不涉及 is.na() 而是一个简单的布尔表达式。

We mention this in some of our Rcpp presentations and actually have some benchmarks which show a pretty large gain from embedded C++ with Rcpp over the R solution because

  • a vectorised R solution still computes every single element of the vector expression

  • if your goal is to just satisfy any(), then you can abort after the first match -- which is what our Rcpp sugar (in essence: some C++ template magic to make C++ expressions look more like R expressions, see this vignette for more) solution does.

So by getting a compiled specialised solution to work, we do indeed get a fast solution. I should add that while I have not compared this to the solutions offered in this SO question here, I am reasonably confident about the performance.

Edit And the Rcpp package contains examples in the directory sugarPerformance. It has an increase of the several thousand of the 'sugar-can-abort-soon' over 'R-computes-full-vector-expression' for any(), but I should add that that case does not involve is.na() but a simple boolean expression.

软糖 2024-11-25 14:41:26

人们可以编写一个在 NA 处停止的 for 循环,但是 system.time 取决于 NA 所在的位置...(如果没有,则需要很长时间)

set.seed(1234)
x <- sample(c(1:5, NA), 100000000, replace = TRUE)

nacount <- function(x){
  for(i in 1:length(x)){
    if(is.na(x[i])) {
      print(TRUE)
      break}
}}

system.time(
  nacount(x)
)
[1] TRUE
       User      System verstrichen 
       0.14        0.04        0.18 

system.time(
  any(is.na(x))
) 
       User      System verstrichen 
       0.28        0.08        0.37 

system.time(
  sum(is.na(x)) > 0
)
       User      System verstrichen 
       0.45        0.07        0.53 

One could write a for loop stopping at NA, but the system.time then depends on where the NA is... (if there is none, it takes looooong)

set.seed(1234)
x <- sample(c(1:5, NA), 100000000, replace = TRUE)

nacount <- function(x){
  for(i in 1:length(x)){
    if(is.na(x[i])) {
      print(TRUE)
      break}
}}

system.time(
  nacount(x)
)
[1] TRUE
       User      System verstrichen 
       0.14        0.04        0.18 

system.time(
  any(is.na(x))
) 
       User      System verstrichen 
       0.28        0.08        0.37 

system.time(
  sum(is.na(x)) > 0
)
       User      System verstrichen 
       0.45        0.07        0.53 
2024-11-25 14:41:26

以下是我的(慢速)机器上迄今为止讨论的一些不同方法的一些实际时间:

x <- runif(1e7)
x[1e4] <- NA

system.time(sum(is.na(x)) > 0)
> system.time(sum(is.na(x)) > 0)
   user  system elapsed 
  0.065   0.001   0.065 

system.time(any(is.na(x)))  
> system.time(any(is.na(x)))
   user  system elapsed 
  0.035   0.000   0.034

system.time(match(NA,x)) 
> system.time(match(NA,x))
  user  system elapsed 
 1.824   0.112   1.918

system.time(NA %in% x) 
> system.time(NA %in% x)
  user  system elapsed 
 1.828   0.115   1.925 

system.time(which(is.na(x) == TRUE))
> system.time(which(is.na(x) == TRUE))
  user  system elapsed 
 0.099   0.029   0.127

match%in% 相似并不奇怪,因为 %in% 是使用 match 实现的。

Here are some actual times from my (slow) machine for some of the various methods discussed so far:

x <- runif(1e7)
x[1e4] <- NA

system.time(sum(is.na(x)) > 0)
> system.time(sum(is.na(x)) > 0)
   user  system elapsed 
  0.065   0.001   0.065 

system.time(any(is.na(x)))  
> system.time(any(is.na(x)))
   user  system elapsed 
  0.035   0.000   0.034

system.time(match(NA,x)) 
> system.time(match(NA,x))
  user  system elapsed 
 1.824   0.112   1.918

system.time(NA %in% x) 
> system.time(NA %in% x)
  user  system elapsed 
 1.828   0.115   1.925 

system.time(which(is.na(x) == TRUE))
> system.time(which(is.na(x) == TRUE))
  user  system elapsed 
 0.099   0.029   0.127

It's not surprising that match and %in% are similar, since %in% is implemented using match.

那一片橙海, 2024-11-25 14:41:26

您可以尝试:

d <- c(1,2,3,NA,5,3)

which(is.na(d) == TRUE, arr.ind=TRUE)

You can try:

d <- c(1,2,3,NA,5,3)

which(is.na(d) == TRUE, arr.ind=TRUE)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文