根据数据框中的列重塑数据
我需要采用以下格式的 data.frame:
id1 id2 mean start end
1 A D 4 12 15
2 B E 5 14 15
3 C F 6 8 10
并根据 start - end
的差异生成重复行。例如,第一行需要 3 行,第二行需要 1 行,第三行需要 2 行。开始和结束字段应在最终 data.frame 中按顺序排列。这个 data.frame 的最终结果应该是:
id1 id2 mean start end
1 A D 4 12 13
2 A D 4 13 14
3 A D 4 14 15
21 B E 5 14 15
31 C F 6 8 9
32 C F 6 9 10
我已经编写了这个可以工作的函数,但不是用非常 R'esque 的代码编写的:
dupData <- function(df){
diff <- abs(df$start - df$end)
ret <- {}
#Expand our dataframe into the appropriate number of rows.
for (i in 1:nrow(df)){
for (j in 1:diff[i]){
ret <- rbind(ret, df[i,])
}
}
#If matching ID1 and ID2, generate a sequential ordering of start & end dates
for (k in 2:nrow(ret) - 1) {
if ( ret[k,1] == ret[k + 1, 1] & ret[k, 2] == ret[k, 2] ){
ret[k, 5] <- ret[k, 4] + 1
ret[k + 1, 4] <- ret[k, 5]
}
}
return(ret)
}
有人对如何优化此代码有建议吗? plyr
中有可能适用的函数吗?
#sample daters
df <- data.frame(id1 = c("A", "B", "C")
, id2 = c("D", "E", "F")
, mean = c(4,5,6)
, start = c(12,14,8)
, end = c(15, 15, 10)
)
I need to take a data.frame in the format of:
id1 id2 mean start end
1 A D 4 12 15
2 B E 5 14 15
3 C F 6 8 10
and generate duplicate rows based on the difference in start - end
. For example, I need 3 rows for the first row, 1 for the second, and 2 for the third. The start and end fields should be in sequential order in the final data.frame. The end result for this data.frame should be:
id1 id2 mean start end
1 A D 4 12 13
2 A D 4 13 14
3 A D 4 14 15
21 B E 5 14 15
31 C F 6 8 9
32 C F 6 9 10
I have written this function which works, but isn't written in very R'esque code:
dupData <- function(df){
diff <- abs(df$start - df$end)
ret <- {}
#Expand our dataframe into the appropriate number of rows.
for (i in 1:nrow(df)){
for (j in 1:diff[i]){
ret <- rbind(ret, df[i,])
}
}
#If matching ID1 and ID2, generate a sequential ordering of start & end dates
for (k in 2:nrow(ret) - 1) {
if ( ret[k,1] == ret[k + 1, 1] & ret[k, 2] == ret[k, 2] ){
ret[k, 5] <- ret[k, 4] + 1
ret[k + 1, 4] <- ret[k, 5]
}
}
return(ret)
}
Does anyone have suggestions on how to optimize this code? Is there a function in plyr
which may be applicable?
#sample daters
df <- data.frame(id1 = c("A", "B", "C")
, id2 = c("D", "E", "F")
, mean = c(4,5,6)
, start = c(12,14,8)
, end = c(15, 15, 10)
)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
可能有一种更通用的方法可以做到这一点,但下面使用 rbind.fill 。
There's probably a more general way to do this, but below uses
rbind.fill
.survival
包的survSplit
函数按照这些思路做了一些事情,尽管它有更多的选项(例如指定剪切时间)。您也许可以使用它,或者查看它的代码,看看是否可以更好地实现您的简化版本。The
survSplit
function of thesurvival
package does something along these lines, though it has a bit more options (eg specifying the cut times). You might be able to use it, or look at its code to see if you can implement your simplified version better.毫无疑问,这不是迟到总比不到好的时候,但我遇到了类似的问题并想出了这个......
No doubt this isn't one of those times where late is better than never, but i had a similar issue and came up with this...
两种替代方案,许多年后,使用当今流行的
data.table
和tidyverse
包提供替代方案:选项 1:
选项 2:
Two alternatives, many years later, offering alternatives using today's popular
data.table
andtidyverse
packages:Option 1:
Option 2: