table() 的 as.data.frame 来汇总频率

发布于 2024-08-30 20:28:26 字数 751 浏览 2 评论 0原文

在R中，我正在寻找一种节省内存的方法来创建表格数据的摘要，如下所示。

以我使用 table() 总结的 data.frame foo 为例，后面是 as.data.frame () 获取频率计数。

foo <- data.frame(x= c('a', 'a', 'a', 'b', 'b', 'b'), y=c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- as.data.frame(table(foo), stringsAsFactors=F)

这会导致 bar 的频率计数如下。

   x  y Freq
1  a ab    1
2  b ab    0
3  a ac    1
4  b ac    0
5  a ad    1
6  b ad    0
7  a ae    0
8  b ae    1
9  a fx    0
10 b fx    1
11 a fy    0
12 b fy    1

我遇到的问题是，当 x 和 y 有多个级别时，它会开始使用大量内存 > 64 GB。我想知道是否有其他方法可以进行这种频率计数。第一步，我设置了 stringsAsFactors=F，但这并不能完全解决问题。

原文

In R, I'm looking for a memory-efficient way to create a summary of tabular data as follows.

Take for example the data.frame foo which I've used table() to summarize, followed by as.data.frame() to obtain the frequency counts.

foo <- data.frame(x= c('a', 'a', 'a', 'b', 'b', 'b'), y=c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- as.data.frame(table(foo), stringsAsFactors=F)

This results in the following frequency count for bar

   x  y Freq
1  a ab    1
2  b ab    0
3  a ac    1
4  b ac    0
5  a ad    1
6  b ad    0
7  a ae    0
8  b ae    1
9  a fx    0
10 b fx    1
11 a fy    0
12 b fy    1

The problem I'm running into is when there are many levels of x and y, it starts using up significant amounts of memory >64 GB. I was wondering if there was an alternative way of doing this kind of frequency count. As a first step, I set stringsAsFactors=F, however this doesn't completely solve the problem.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

丑疤怪 2024-09-06 20:28:26

我有这种快速（稀疏）交叉制表的方法。我认为还有进一步优化的可能性，但对于大型数据集来说，这对我来说已经足够好了。关键是使用plyr包中的ninteraction来快速为每行生成一个数字id。

tab <- function(df, drop = TRUE) {
  id <- plyr::ninteraction(df)
  ord <- order(id)

  df <- df[ord, , drop = FALSE]
  id <- id[ord]

  freq <- rle(id)$lengths
  labels <- unrowname(df[cumsum(freq), , drop = FALSE])

  data.frame(labels, freq)
}

I have this method for fast (sparse) cross tabulation. I think there are possibilities for further optimisation, but it's been good enough for me for large data sets. The key is the use of ninteraction from the plyr package to quickly generate a numeric id for each row.

tab <- function(df, drop = TRUE) {
  id <- plyr::ninteraction(df)
  ord <- order(id)

  df <- df[ord, , drop = FALSE]
  id <- id[ord]

  freq <- rle(id)$lengths
  labels <- unrowname(df[cumsum(freq), , drop = FALSE])

  data.frame(labels, freq)
}

回复收藏 0 原文

心碎无痕… 2024-09-06 20:28:26

查看 Matrix 包中的 xtabs 方法，该方法执行稀疏交叉制表。

回复收藏 0 原文

旧时浪漫 2024-09-06 20:28:26

library(plyr)
ddply(foo, ~ x + y, nrow,.drop=FALSE)

library(plyr)
ddply(foo, ~ x + y, nrow,.drop=FALSE)

回复收藏 0 原文

~没有更多了~

关于作者

橘虞初梦

暂无简介

0 文章

0 评论

21 人气

关注发私信

友情链接

文江博客

table() 的 as.data.frame 来汇总频率

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

table() 的 as.data.frame 来汇总频率

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。