R:在 data.frame 列中拆分不平衡列表
假设您有一个具有以下结构的数据框:
df <- data.frame(a=c(1,2,3,4), b=c("job1;job2", "job1a", "job4;job5;job6", "job9;job10;job11"))
其中列 b
是分号分隔的列表(按行不平衡)。理想的 data.frame 是:
id,job,jobNum
1,job1,1
1,job2,2
...
3,job6,3
4,job9,1
4,job10,2
4,job11,3
我有一个需要近 2 小时(170K 行)的部分解决方案:
# Split the column by the semicolon. Results in a list.
df$allJobs <- strsplit(df$b, ";", fixed=TRUE)
# Function to reshape column that is a list as a data.frame
simpleStack <- function(data){
start <- as.data.frame.list(data)
names(start) <-c("id", "job")
return(start)
}
# pylr!
system.time(df2 <- ddply(df, .(id), simpleStack))
这似乎是一个大小问题,因为如果我运行
system.time(df2 <- ddply(df[1:4000,c("id", "allJobs")], .(id), simpleStack))
它只需要 9 秒。首先使用 sapply(具有不同的函数)转换为一组 data.frames 很快,但所需的“rbind”需要更长的时间。
Suppose you have a data frame with the following structure:
df <- data.frame(a=c(1,2,3,4), b=c("job1;job2", "job1a", "job4;job5;job6", "job9;job10;job11"))
where the column b
is a semicolon-delimited list (unbalanced by row). The ideal data.frame would be:
id,job,jobNum
1,job1,1
1,job2,2
...
3,job6,3
4,job9,1
4,job10,2
4,job11,3
I have a partial solution that takes almost 2 hours (170K rows):
# Split the column by the semicolon. Results in a list.
df$allJobs <- strsplit(df$b, ";", fixed=TRUE)
# Function to reshape column that is a list as a data.frame
simpleStack <- function(data){
start <- as.data.frame.list(data)
names(start) <-c("id", "job")
return(start)
}
# pylr!
system.time(df2 <- ddply(df, .(id), simpleStack))
It appears to be a size issue, because if I run
system.time(df2 <- ddply(df[1:4000,c("id", "allJobs")], .(id), simpleStack))
it only takes 9 seconds. First converting to a set of data.frames with sapply (with a different function) is fast, but the required `rbind' takes even longer.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我的“splitstacksahpe”包中的
cSplit
旨在处理此类数据操作。这是针对这个问题的实际操作:
您还可以在“dplyr”中使用
strsplit
,然后使用“tidyr”中的unnest
,如下所示:cSplit
from my "splitstacksahpe" package is designed to handle this sort of data manipulation.Here it is in action on this question:
You can also use
strsplit
within "dplyr", and then follow up withunnest
from "tidyr", like this: