在 R 中加速将 person 重塑为周期格式数据帧

发布于 2024-12-10 17:00:22 字数 800 浏览 9 评论 0原文

我有一个包含以人为本的格式的纵向数据的数据集，如下所示：

pid varA_1 varB_1 varA_2 varB_2 varA_3 varB_3 ...
1   1      1      0      3      2      1
2   0      1      0      2      2      1
...
50k 1      0      1      3      1      0

这会产生一个大型数据框，其中至少有 50k 个观测值和最多 29 个周期测量的 90 个变量。

我想获得一种更面向周期的格式，例如：

pid index start stop varA varB varC ...
1   1     ...
1   2     
...
1   29
2   1

我尝试了不同的方法来重塑数据帧（*apply，plyr，reshape2、循环、附加与预填充所有数字矩阵等），但似乎没有获得足够的处理时间（子集+40分钟）。我一路上得到了关于要避免什么的各种提示，但我仍然不确定我是否错过了一些瓶颈或可能的加速。

是否有一种最佳方法来进行这种数据处理，以便我可以评估在纯 R 代码中可以实现的最佳情况处理时间？ Stackoverflow 上也有类似的问题，但没有得到令人信服的答案。。

原文

I have a dataset with longitudinal data in a person-oriented format, as such:

pid varA_1 varB_1 varA_2 varB_2 varA_3 varB_3 ...
1   1      1      0      3      2      1
2   0      1      0      2      2      1
...
50k 1      0      1      3      1      0

This results in a large dataframe, with minimum 50k observations and 90 variables measured for up to 29 periods.

I would like to get a more period-oriented format, as such:

pid index start stop varA varB varC ...
1   1     ...
1   2     
...
1   29
2   1

I have tried different approaches for reshaping the dataframe (*apply, plyr, reshape2, loops, appending vs. prefilling all numeric matrices, etc.,), but do not seem to get a decent processing time (+40min for subsets). I have picked up various hints along the way on what to avoid, but I'm still not sure if I miss some bottleneck or possible speedup.

Is there an optimal way to approach this kind of data-processing, so that I can evaluate the best-case processing time I can achieve in pure R-code? There have been similar questions on Stackoverflow, but they did not result in convincing answers...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

瘫痪情歌 2024-12-17 17:00:22

首先，让我们构建数据示例（我使用 5e3 而不是 50e3 以避免配置出现内存问题）：

nObs <- 5e3
nVar <- 90
nPeriods <- 29

dat <- matrix(rnorm(nObs*nVar*nPeriods), nrow=nObs, ncol=nVar*nPeriods)

df <- data.frame(id=seq_len(nObs), dat)

nmsV <- paste('Var', seq_len(nVar), sep='')
nmsPeriods <- paste('T', seq_len(nPeriods), sep='')

nms <- c(outer(nmsV, nmsPeriods, paste, sep='_'))
names(df)[-1] <- nms

现在使用 stats::reshape 更改格式：

df2 <- reshape(df, dir = "long", varying = 2:((nVar*nPeriods)+1), sep = "_")

我不确定这是否是您正在寻找的快速解决方案。

First, let's build the data example (I am using 5e3 instead of 50e3 to avoid memory problems with my configuration):

nObs <- 5e3
nVar <- 90
nPeriods <- 29

dat <- matrix(rnorm(nObs*nVar*nPeriods), nrow=nObs, ncol=nVar*nPeriods)

df <- data.frame(id=seq_len(nObs), dat)

nmsV <- paste('Var', seq_len(nVar), sep='')
nmsPeriods <- paste('T', seq_len(nPeriods), sep='')

nms <- c(outer(nmsV, nmsPeriods, paste, sep='_'))
names(df)[-1] <- nms

And now with stats::reshape you change the format:

df2 <- reshape(df, dir = "long", varying = 2:((nVar*nPeriods)+1), sep = "_")

I am not sure if this is the fast solution you are looking for.

回复收藏 0 原文

寒尘 2024-12-17 17:00:22

如果内容适合内存，那么老化的 stack() 函数可以非常快。

对于大型集，使用（透明）sqlite 数据库作为中间是最好的。试试Gabor的sqldf包，googlecode上有很多例子。

http://code.google.com/p/sqldf/

回复收藏 0 原文

~没有更多了~

关于作者

小霸王臭丫头

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

在 R 中加速将 person 重塑为周期格式数据帧

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

佚名

今天

゛时过境迁

达拉崩吧

呆萌少年

孤者何惧

友情链接

在 R 中加速将 person 重塑为周期格式数据帧

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

佚名

今天

゛时过境迁

达拉崩吧

呆萌少年

孤者何惧

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。