计算数据框中成对有效观察值（无 NA）的数量

发布于 2025-01-08 15:13:46 字数 1100 浏览 2 评论 0原文

假设我有一个像这样的数据框：

Df <- data.frame(
    V1 = c(1,2,3,NA,5),
    V2 = c(1,2,NA,4,5),
    V3 = c(NA,2,NA,4,NA)
)

现在我想计算两个变量的每个组合的有效观察数。为此，我编写了一个函数 sharedcount：

sharedcount <- function(x,...){
    nx <- names(x)
    alln <- combn(nx,2)
    out <- apply(alln,2,
      function(y)sum(complete.cases(x[y]))
    )
    data.frame(t(alln),out)
}

这给出了输出：

> sharedcount(Df)
  X1 X2 out
1 V1 V2   3
2 V1 V3   1
3 V2 V3   2

一切都很好，但该函数本身在大数据帧（600 个变量和大约 10000 个观察值）上花费了相当长的时间。我感觉我正在监督一种更简单的方法，特别是因为 cor(...,use='pairwise') 的运行速度仍然快得多，而它必须做类似的事情：

> require(rbenchmark)    
> benchmark(sharedcount(TestDf),cor(TestDf,use='pairwise'),
+     columns=c('test','elapsed','relative'),
+     replications=1
+ )
                           test elapsed relative
2 cor(TestDf, use = "pairwise")    0.25     1.0
1           sharedcount(TestDf)    1.90     7.6

任何提示都值得赞赏。

注意：使用 Vincent 的技巧，我编写了一个返回相同数据帧的函数。代码在我下面的回答中。

原文

Say I have a data frame like this:

Df <- data.frame(
    V1 = c(1,2,3,NA,5),
    V2 = c(1,2,NA,4,5),
    V3 = c(NA,2,NA,4,NA)
)

Now I want to count the number of valid observations for every combination of two variables. For that, I wrote a function sharedcount:

sharedcount <- function(x,...){
    nx <- names(x)
    alln <- combn(nx,2)
    out <- apply(alln,2,
      function(y)sum(complete.cases(x[y]))
    )
    data.frame(t(alln),out)
}

This gives the output:

> sharedcount(Df)
  X1 X2 out
1 V1 V2   3
2 V1 V3   1
3 V2 V3   2

All fine, but the function itself takes pretty long on big dataframes (600 variables and about 10000 observations). I have the feeling I'm overseeing an easier approach, especially since cor(...,use='pairwise') is running still a whole lot faster while it has to do something similar :

> require(rbenchmark)    
> benchmark(sharedcount(TestDf),cor(TestDf,use='pairwise'),
+     columns=c('test','elapsed','relative'),
+     replications=1
+ )
                           test elapsed relative
2 cor(TestDf, use = "pairwise")    0.25     1.0
1           sharedcount(TestDf)    1.90     7.6

Any tips are appreciated.

Note : Using Vincent's trick, I wrote a function that returns the same data frame. Code in my answer below.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烟雨凡馨 2025-01-15 15:13:46

下面的速度稍微快一些：

x <- !is.na(Df)
t(x) %*% x

#       test elapsed relative
#    cor(Df)  12.345 1.000000
# t(x) %*% x  20.736 1.679708

The following is slightly faster:

x <- !is.na(Df)
t(x) %*% x

#       test elapsed relative
#    cor(Df)  12.345 1.000000
# t(x) %*% x  20.736 1.679708

回复收藏 0 原文

单身狗的梦 2025-01-15 15:13:46

我认为 Vincent 看起来非常优雅，更不用说比我二年级的 for 循环更快，除了它似乎需要我在下面添加的提取步骤。这只是与数据帧一起使用时 apply 方法中的沉重开销的一个示例。

shrcnt <- function(Df) {Comb <- t(combn(1:ncol(Df),2) )
shrd <- 1:nrow(Comb)
for (i in seq_len(shrd)){ 
     shrd[i] <- sum(complete.cases(Df[,Comb[i,1]], Df[,Comb[i,2]]))}
return(shrd)}

   benchmark(
      shrcnt(Df), sharedcount(Df), {prs <- t(x) %*% x; prs[lower.tri(prs)]}, 
      cor(Df,use='pairwise'),
        columns=c('test','elapsed','relative'),
        replications=100
      )
 #--------------
                       test elapsed relative
3                         {   0.008      1.0
4 cor(Df, use = "pairwise")   0.020      2.5
2           sharedcount(Df)   0.092     11.5
1                shrcnt(Df)   0.036      4.5

I thought Vincent's looked really elegant, not to mention being faster than my sophomoric for-loop, except it seems to be needing an extraction step which I added below. This is just an example of the heavy overhead in the apply method when used with dataframes.

shrcnt <- function(Df) {Comb <- t(combn(1:ncol(Df),2) )
shrd <- 1:nrow(Comb)
for (i in seq_len(shrd)){ 
     shrd[i] <- sum(complete.cases(Df[,Comb[i,1]], Df[,Comb[i,2]]))}
return(shrd)}

   benchmark(
      shrcnt(Df), sharedcount(Df), {prs <- t(x) %*% x; prs[lower.tri(prs)]}, 
      cor(Df,use='pairwise'),
        columns=c('test','elapsed','relative'),
        replications=100
      )
 #--------------
                       test elapsed relative
3                         {   0.008      1.0
4 cor(Df, use = "pairwise")   0.020      2.5
2           sharedcount(Df)   0.092     11.5
1                shrcnt(Df)   0.036      4.5

回复收藏 0 原文

佼人 2025-01-15 15:13:46

基于 Vincent 的可爱技巧和 DWin 的附加 lower.tri() 建议，我想出了以下函数，它为我提供了与原始输出相同的输出（即数据帧），并且运行速度要快得多：

sharedcount2 <- function(x,stringsAsFactors=FALSE,...){
    counts <- crossprod(!is.na(x))
    id <- lower.tri(counts)
    count <- counts[id]
    X1 <- colnames(counts)[col(counts)[id]]
    X2 <- rownames(counts)[row(counts)[id]]
    data.frame(X1,X2,count)
}

请注意使用 crossprod()，因为与 %*% 相比，它的性能稍有提高，但它的作用完全相同。

计时：

> benchmark(sharedcount(TestDf),sharedcount2(TestDf),
+           replications=5,
+           columns=c('test','replications','elapsed','relative'))

                  test replications elapsed relative
1  sharedcount(TestDf)            5   10.00 90.90909
2 sharedcount2(TestDf)            5    0.11  1.00000

注意： 我在问题中提供了 TestDf，因为我注意到计时根据数据帧的大小而有所不同。如图所示，与使用小数据帧相比，时间增加要显着得多。

Based on the lovely trick of Vincent and the additional lower.tri() suggestion of DWin, I came up with following function that gives me the same output (i.e. a data frame) as my original one, and runs a whole lot faster :

sharedcount2 <- function(x,stringsAsFactors=FALSE,...){
    counts <- crossprod(!is.na(x))
    id <- lower.tri(counts)
    count <- counts[id]
    X1 <- colnames(counts)[col(counts)[id]]
    X2 <- rownames(counts)[row(counts)[id]]
    data.frame(X1,X2,count)
}

Note the use of crossprod(), as that one gives a small improvement compared to %*%, but it does exactly the same.

The timings :

> benchmark(sharedcount(TestDf),sharedcount2(TestDf),
+           replications=5,
+           columns=c('test','replications','elapsed','relative'))

                  test replications elapsed relative
1  sharedcount(TestDf)            5   10.00 90.90909
2 sharedcount2(TestDf)            5    0.11  1.00000

Note: I supplied TestDf in the question, as I noticed that the timings differ depending on the size of the data frames. As shown here, the time increase is a lot more dramatic than when compared using a small data frame.

回复收藏 0 原文

~没有更多了~