table() 的 as.data.frame 来汇总频率
在R中,我正在寻找一种节省内存的方法来创建表格数据的摘要,如下所示。
以我使用 table()
总结的 data.frame
foo
为例,后面是 as.data.frame ()
获取频率计数。
foo <- data.frame(x= c('a', 'a', 'a', 'b', 'b', 'b'), y=c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- as.data.frame(table(foo), stringsAsFactors=F)
这会导致 bar
的频率计数如下。
x y Freq
1 a ab 1
2 b ab 0
3 a ac 1
4 b ac 0
5 a ad 1
6 b ad 0
7 a ae 0
8 b ae 1
9 a fx 0
10 b fx 1
11 a fy 0
12 b fy 1
我遇到的问题是,当 x
和 y
有多个级别时,它会开始使用大量内存 > 64 GB。我想知道是否有其他方法可以进行这种频率计数。第一步,我设置了 stringsAsFactors=F
,但这并不能完全解决问题。
In R, I'm looking for a memory-efficient way to create a summary of tabular data as follows.
Take for example the data.frame
foo
which I've used table()
to summarize, followed by as.data.frame()
to obtain the frequency counts.
foo <- data.frame(x= c('a', 'a', 'a', 'b', 'b', 'b'), y=c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- as.data.frame(table(foo), stringsAsFactors=F)
This results in the following frequency count for bar
x y Freq
1 a ab 1
2 b ab 0
3 a ac 1
4 b ac 0
5 a ad 1
6 b ad 0
7 a ae 0
8 b ae 1
9 a fx 0
10 b fx 1
11 a fy 0
12 b fy 1
The problem I'm running into is when there are many levels of x
and y
, it starts using up significant amounts of memory >64 GB. I was wondering if there was an alternative way of doing this kind of frequency count. As a first step, I set stringsAsFactors=F
, however this doesn't completely solve the problem.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我有这种快速(稀疏)交叉制表的方法。我认为还有进一步优化的可能性,但对于大型数据集来说,这对我来说已经足够好了。关键是使用
plyr
包中的ninteraction
来快速为每行生成一个数字id。I have this method for fast (sparse) cross tabulation. I think there are possibilities for further optimisation, but it's been good enough for me for large data sets. The key is the use of
ninteraction
from theplyr
package to quickly generate a numeric id for each row.查看 Matrix 包中的 xtabs 方法,该方法执行稀疏交叉制表。
Look at the
xtabs
method in theMatrix
package which does sparse cross-tabulation.