计算数据框中成对有效观察值(无 NA)的数量
假设我有一个像这样的数据框:
Df <- data.frame(
V1 = c(1,2,3,NA,5),
V2 = c(1,2,NA,4,5),
V3 = c(NA,2,NA,4,NA)
)
现在我想计算两个变量的每个组合的有效观察数。为此,我编写了一个函数 sharedcount
:
sharedcount <- function(x,...){
nx <- names(x)
alln <- combn(nx,2)
out <- apply(alln,2,
function(y)sum(complete.cases(x[y]))
)
data.frame(t(alln),out)
}
这给出了输出:
> sharedcount(Df)
X1 X2 out
1 V1 V2 3
2 V1 V3 1
3 V2 V3 2
一切都很好,但该函数本身在大数据帧(600 个变量和大约 10000 个观察值)上花费了相当长的时间。我感觉我正在监督一种更简单的方法,特别是因为 cor(...,use='pairwise') 的运行速度仍然快得多,而它必须做类似的事情:
> require(rbenchmark)
> benchmark(sharedcount(TestDf),cor(TestDf,use='pairwise'),
+ columns=c('test','elapsed','relative'),
+ replications=1
+ )
test elapsed relative
2 cor(TestDf, use = "pairwise") 0.25 1.0
1 sharedcount(TestDf) 1.90 7.6
任何提示都值得赞赏。
注意:使用 Vincent 的技巧,我编写了一个返回相同数据帧的函数。代码在我下面的回答中。
Say I have a data frame like this:
Df <- data.frame(
V1 = c(1,2,3,NA,5),
V2 = c(1,2,NA,4,5),
V3 = c(NA,2,NA,4,NA)
)
Now I want to count the number of valid observations for every combination of two variables. For that, I wrote a function sharedcount
:
sharedcount <- function(x,...){
nx <- names(x)
alln <- combn(nx,2)
out <- apply(alln,2,
function(y)sum(complete.cases(x[y]))
)
data.frame(t(alln),out)
}
This gives the output:
> sharedcount(Df)
X1 X2 out
1 V1 V2 3
2 V1 V3 1
3 V2 V3 2
All fine, but the function itself takes pretty long on big dataframes (600 variables and about 10000 observations). I have the feeling I'm overseeing an easier approach, especially since cor(...,use='pairwise') is running still a whole lot faster while it has to do something similar :
> require(rbenchmark)
> benchmark(sharedcount(TestDf),cor(TestDf,use='pairwise'),
+ columns=c('test','elapsed','relative'),
+ replications=1
+ )
test elapsed relative
2 cor(TestDf, use = "pairwise") 0.25 1.0
1 sharedcount(TestDf) 1.90 7.6
Any tips are appreciated.
Note : Using Vincent's trick, I wrote a function that returns the same data frame. Code in my answer below.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
下面的速度稍微快一些:
The following is slightly faster:
我认为 Vincent 看起来非常优雅,更不用说比我二年级的 for 循环更快,除了它似乎需要我在下面添加的提取步骤。这只是与数据帧一起使用时 apply 方法中的沉重开销的一个示例。
I thought Vincent's looked really elegant, not to mention being faster than my sophomoric for-loop, except it seems to be needing an extraction step which I added below. This is just an example of the heavy overhead in the apply method when used with dataframes.
基于 Vincent 的可爱技巧和 DWin 的附加
lower.tri()
建议,我想出了以下函数,它为我提供了与原始输出相同的输出(即数据帧),并且运行速度要快得多:请注意使用
crossprod()
,因为与%*%
相比,它的性能稍有提高,但它的作用完全相同。计时:
注意: 我在问题中提供了 TestDf,因为我注意到计时根据数据帧的大小而有所不同。如图所示,与使用小数据帧相比,时间增加要显着得多。
Based on the lovely trick of Vincent and the additional
lower.tri()
suggestion of DWin, I came up with following function that gives me the same output (i.e. a data frame) as my original one, and runs a whole lot faster :Note the use of
crossprod()
, as that one gives a small improvement compared to%*%
, but it does exactly the same.The timings :
Note: I supplied TestDf in the question, as I noticed that the timings differ depending on the size of the data frames. As shown here, the time increase is a lot more dramatic than when compared using a small data frame.