根据数据框中的列重塑数据

发布于 2024-09-14 03:07:58 字数 1404 浏览 8 评论 0原文

我需要采用以下格式的 data.frame：

  id1 id2 mean start end
1   A   D    4    12  15
2   B   E    5    14  15
3   C   F    6     8  10

并根据 start - end 的差异生成重复行。例如，第一行需要 3 行，第二行需要 1 行，第三行需要 2 行。开始和结束字段应在最终 data.frame 中按顺序排列。这个 data.frame 的最终结果应该是：

   id1 id2 mean start end
1    A   D    4    12  13
2    A   D    4    13  14
3    A   D    4    14  15
21   B   E    5    14  15
31   C   F    6     8   9
32   C   F    6     9  10

我已经编写了这个可以工作的函数，但不是用非常 R'esque 的代码编写的：

dupData <- function(df){
    diff <- abs(df$start - df$end)
    ret <- {}

    #Expand our dataframe into the appropriate number of rows.
    for (i in 1:nrow(df)){
        for (j in 1:diff[i]){
            ret <- rbind(ret, df[i,])
        } 
    }

    #If matching ID1 and ID2, generate a sequential ordering of start & end dates
    for (k in 2:nrow(ret) - 1) {
        if ( ret[k,1] == ret[k + 1, 1] & ret[k, 2] == ret[k, 2]  ){ 
            ret[k, 5] <- ret[k, 4] + 1
            ret[k + 1, 4] <- ret[k, 5]  
        }
    }
    return(ret)
}

有人对如何优化此代码有建议吗？ plyr 中有可能适用的函数吗？

#sample daters
df <- data.frame(id1 = c("A", "B", "C")
        , id2 = c("D", "E", "F")
        , mean = c(4,5,6)  
        , start = c(12,14,8)
        , end = c(15, 15, 10)
)

原文

I need to take a data.frame in the format of:

  id1 id2 mean start end
1   A   D    4    12  15
2   B   E    5    14  15
3   C   F    6     8  10

and generate duplicate rows based on the difference in start - end. For example, I need 3 rows for the first row, 1 for the second, and 2 for the third. The start and end fields should be in sequential order in the final data.frame. The end result for this data.frame should be:

   id1 id2 mean start end
1    A   D    4    12  13
2    A   D    4    13  14
3    A   D    4    14  15
21   B   E    5    14  15
31   C   F    6     8   9
32   C   F    6     9  10

I have written this function which works, but isn't written in very R'esque code:

dupData <- function(df){
    diff <- abs(df$start - df$end)
    ret <- {}

    #Expand our dataframe into the appropriate number of rows.
    for (i in 1:nrow(df)){
        for (j in 1:diff[i]){
            ret <- rbind(ret, df[i,])
        } 
    }

    #If matching ID1 and ID2, generate a sequential ordering of start & end dates
    for (k in 2:nrow(ret) - 1) {
        if ( ret[k,1] == ret[k + 1, 1] & ret[k, 2] == ret[k, 2]  ){ 
            ret[k, 5] <- ret[k, 4] + 1
            ret[k + 1, 4] <- ret[k, 5]  
        }
    }
    return(ret)
}

Does anyone have suggestions on how to optimize this code? Is there a function in plyr which may be applicable?

#sample daters
df <- data.frame(id1 = c("A", "B", "C")
        , id2 = c("D", "E", "F")
        , mean = c(4,5,6)  
        , start = c(12,14,8)
        , end = c(15, 15, 10)
)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雪化雨蝶 2024-09-21 03:07:58

可能有一种更通用的方法可以做到这一点，但下面使用 rbind.fill 。

cbind(df[rep(1:nrow(df), times = apply(df[,4:5], 1, diff)), 1:3],
      rbind.fill(apply(df[,4:5], 1, function(x)
                       data.frame(start = x[1]:(x[2]-1), end = (x[1]+1):x[2]))))


##     id1 id2 mean start end
## 1     A   D    4    12  13
## 1.1   A   D    4    13  14
## 1.2   A   D    4    14  15
## 2     B   E    5    14  15
## 3     C   F    6     8   9
## 3.1   C   F    6     9  10

There's probably a more general way to do this, but below uses rbind.fill.

cbind(df[rep(1:nrow(df), times = apply(df[,4:5], 1, diff)), 1:3],
      rbind.fill(apply(df[,4:5], 1, function(x)
                       data.frame(start = x[1]:(x[2]-1), end = (x[1]+1):x[2]))))


##     id1 id2 mean start end
## 1     A   D    4    12  13
## 1.1   A   D    4    13  14
## 1.2   A   D    4    14  15
## 2     B   E    5    14  15
## 3     C   F    6     8   9
## 3.1   C   F    6     9  10

回复收藏 0 原文

内心荒芜 2024-09-21 03:07:58

survival 包的 survSplit 函数按照这些思路做了一些事情，尽管它有更多的选项（例如指定剪切时间）。您也许可以使用它，或者查看它的代码，看看是否可以更好地实现您的简化版本。

回复收藏 0 原文

怪我鬧 2024-09-21 03:07:58

毫无疑问，这不是迟到总比不到好的时候，但我遇到了类似的问题并想出了这个......

library(plyr)
ddply(df, c("id1", "id2", "mean", "start", "end"), summarise,
                    sq=seq(1:(end-start)))

No doubt this isn't one of those times where late is better than never, but i had a similar issue and came up with this...

library(plyr)
ddply(df, c("id1", "id2", "mean", "start", "end"), summarise,
                    sq=seq(1:(end-start)))

回复收藏 0 原文

陪我终i 2024-09-21 03:07:58

两种替代方案，许多年后，使用当今流行的 data.table 和 tidyverse 包提供替代方案：

选项 1：

library(data.table)
setDT(mydf)[, list(mean, start = start:(end-1)), .(id1, id2)][, end := start + 1][]
   id1 id2 mean start end
1:   A   D    4    12  13
2:   A   D    4    13  14
3:   A   D    4    14  15
4:   B   E    5    14  15
5:   C   F    6     8   9
6:   C   F    6     9  10

选项 2：

library(tidyverse)
mydf %>% 
  group_by(id1, id2, mean) %>% 
  summarise(start = list(start:(end-1))) %>% 
  unnest(start) %>% 
  mutate(end = start+1)

Two alternatives, many years later, offering alternatives using today's popular data.table and tidyverse packages:

Option 1:

library(data.table)
setDT(mydf)[, list(mean, start = start:(end-1)), .(id1, id2)][, end := start + 1][]
   id1 id2 mean start end
1:   A   D    4    12  13
2:   A   D    4    13  14
3:   A   D    4    14  15
4:   B   E    5    14  15
5:   C   F    6     8   9
6:   C   F    6     9  10

Option 2:

library(tidyverse)
mydf %>% 
  group_by(id1, id2, mean) %>% 
  summarise(start = list(start:(end-1))) %>% 
  unnest(start) %>% 
  mutate(end = start+1)

回复收藏 0 原文

~没有更多了~