总结 R 数据框中的分组记录（...再次）

发布于 2024-08-29 04:26:06 字数 1252 浏览 2 评论 0原文

（我今天早些时候试图问这个问题，但后来意识到我过度简化了问题；我收到的答案是正确的，但我无法使用它们，因为我对原始问题中的问题过度简化。这是我的第二次尝试...）

我在 R 中有一个数据框，如下所示：

"Timestamp", "Source", "Target", "Length", "Content"
0.1        , P1      , P2      , 5       , "ABCDE"
0.2        , P1      , P2      , 3       , "HIJ"
0.4        , P1      , P2      , 4       , "PQRS"
0.5        , P2      , P1      , 2       , "ZY"
0.9        , P2      , P1      , 4       , "SRQP"
1.1        , P1      , P2      , 1       , "B"
1.6        , P1      , P2      , 3       , "DEF"
2.0        , P2      , P1      , 3       , "IJK"
...

我想将其转换为：

"StartTime", "EndTime", "Duration", "Source", "Target", "Length", "Content"
0.1        , 0.4      , 0.3       , P1      , P2      , 12      , "ABCDEHIJPQRS"
0.5        , 0.9      , 0.4       , P2      , P1      , 6       , "ZYSRQP"
1.1        , 1.6      , 0.5       , P1      , P2      , 4       , "BDEF"
...

尝试将其转换为英语，我想将具有相同“源”和“目标”的连续记录分组在一起，然后打印每组的一条记录，显示开始时间、结束时间和时间。该组的持续时间 (=EndTime-StartTime)，以及该组的长度总和，以及该组中内容的串联（全部都是字符串）。

TimeOffset 值在整个数据帧中始终会增加。

我查看了melt/recast，感觉它可以用来解决问题，但无法理解文档。我怀疑在 R 中可以做到这一点，但我真的不知道从哪里开始。在紧要关头，我可以导出数据框并在例如Python中执行它，但如果可能的话，我更愿意留在R中。

预先感谢您可以提供的任何帮助

原文

(I tried to ask this question earlier today, but later realised I over-simplified the question; the answers I received were correct, but I couldn't use them because of my over-simplification of the problem in the original question. Here's my 2nd attempt...)

I have a data frame in R that looks like:

"Timestamp", "Source", "Target", "Length", "Content"
0.1        , P1      , P2      , 5       , "ABCDE"
0.2        , P1      , P2      , 3       , "HIJ"
0.4        , P1      , P2      , 4       , "PQRS"
0.5        , P2      , P1      , 2       , "ZY"
0.9        , P2      , P1      , 4       , "SRQP"
1.1        , P1      , P2      , 1       , "B"
1.6        , P1      , P2      , 3       , "DEF"
2.0        , P2      , P1      , 3       , "IJK"
...

and I want to convert this to:

"StartTime", "EndTime", "Duration", "Source", "Target", "Length", "Content"
0.1        , 0.4      , 0.3       , P1      , P2      , 12      , "ABCDEHIJPQRS"
0.5        , 0.9      , 0.4       , P2      , P1      , 6       , "ZYSRQP"
1.1        , 1.6      , 0.5       , P1      , P2      , 4       , "BDEF"
...

Trying to put this into English, I want to group consecutive records with the same 'Source' and 'Target' together, then print out a single record per group showing the StartTime, EndTime & Duration (=EndTime-StartTime) for that group, along with the sum of the Lengths for that group, and a concatenation of the Content (which will all be strings) in that group.

The TimeOffset values will always increase throughout the data frame.

I had a look at melt/recast and have a feeling that it could be used to solve the problem, but couldn't get my head around the documentation. I suspect it's possible to do this within R, but I really don't know where to start. In a pinch I could export the data frame out and do it in e.g. Python, but I'd prefer to stay within R if possible.

Thanks in advance for any assistance you can provide

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

背叛残局 2024-09-05 04:26:06

这是使用 plyr 的另一个解决方案：

id <- with(df1, paste(Source, Target))
df1$group <- cumsum(c(TRUE, id[-1] != id[-length(id)]))

library(plyr)
ddply(df1, c("group"), summarise, 
  start = min(Timestamp),
  end = max(Timestamp),
  content = paste(Content, collapse = ", ")
)

Here's another solution using plyr:

id <- with(df1, paste(Source, Target))
df1$group <- cumsum(c(TRUE, id[-1] != id[-length(id)]))

library(plyr)
ddply(df1, c("group"), summarise, 
  start = min(Timestamp),
  end = max(Timestamp),
  content = paste(Content, collapse = ", ")
)

回复收藏 0 原文

痴情换悲伤 2024-09-05 04:26:06

试试这个：

id <- as.numeric(gsub("P","",paste(df$Source,df$Target,sep="")))
df$id <- cumsum(c(TRUE,diff(id)!=0))
res <- by(df, df$id,
          function(x) {
            len <- nrow(x)
            start <- x[1,1]
            end <- x[len,1]
            dur <- end - start
            src <- x[1,2]
            trg <- x[1,3]
            len <- sum(x[,4])
            cont <- paste(x[,5],collapse="")
            return(c(start,end,dur,src,trg,len,cont))
          }
          )
do.call(rbind,res)

PS：您需要将结果转换为“正确”的格式，因为最终结果是字符串矩阵。

Try this:

id <- as.numeric(gsub("P","",paste(df$Source,df$Target,sep="")))
df$id <- cumsum(c(TRUE,diff(id)!=0))
res <- by(df, df$id,
          function(x) {
            len <- nrow(x)
            start <- x[1,1]
            end <- x[len,1]
            dur <- end - start
            src <- x[1,2]
            trg <- x[1,3]
            len <- sum(x[,4])
            cont <- paste(x[,5],collapse="")
            return(c(start,end,dur,src,trg,len,cont))
          }
          )
do.call(rbind,res)

P.S.: You would need to convert the result to the "correct" format, as the end result is a matrix of strings.

回复收藏 0 原文

仅一夜美梦 2024-09-05 04:26:06

坚持我的（不优雅的）方式

df1 <- read.table(textConnection("
Timestamp Source Target Length Content
0.1         P1       P2       5        ABCDE
0.2         P1       P2       3        HIJ
0.4         P1       P2       4        PQRS
0.5         P2       P1       2        ZY
0.9         P2       P1       4        SRQP
1.1         P1       P2       1        B
1.6         P1       P2       3        DEF
2.0         P2       P1       3        IJK
"),header=T)

df <- adply(df1, 1 ,transform, newSource = 
as.numeric(paste(substr(Source, 2, 2),substr(Target, 2, 2),sep=""))  ) 

ind <- cbind(rle(df$newSource)[[1]],cumsum(rle(df$newSource)[[1]]))
ind2 <- apply(ind,1,function(x) c(x[2]-(x[1]-1),x[2]))
res <- ldply(apply(ind2,2,function(x) data.frame(StartTime = df[x[1],1] , 
EndTime = df[x[2],1] ,
Duration = df[x[2],1] - df[x[1],1] ,
Source = df[x[1],2] ,
Target = df[x[1],3] ,
Length=sum(df[x[1]:x[2],4]) ,
Content=paste(df[x[1]:x[2],5],collapse="")
) ))

  StartTime EndTime Duration Source Target Length      Content
1       0.1     0.4      0.3     P1     P2     12 ABCDEHIJPQRS
2       0.5     0.9      0.4     P2     P1      6       ZYSRQP
3       1.1     1.6      0.5     P1     P2      4         BDEF
4       2.0     2.0      0.0     P2     P1      3          IJK

Sticking on my (not elegant) way

df1 <- read.table(textConnection("
Timestamp Source Target Length Content
0.1         P1       P2       5        ABCDE
0.2         P1       P2       3        HIJ
0.4         P1       P2       4        PQRS
0.5         P2       P1       2        ZY
0.9         P2       P1       4        SRQP
1.1         P1       P2       1        B
1.6         P1       P2       3        DEF
2.0         P2       P1       3        IJK
"),header=T)

df <- adply(df1, 1 ,transform, newSource = 
as.numeric(paste(substr(Source, 2, 2),substr(Target, 2, 2),sep=""))  ) 

ind <- cbind(rle(df$newSource)[[1]],cumsum(rle(df$newSource)[[1]]))
ind2 <- apply(ind,1,function(x) c(x[2]-(x[1]-1),x[2]))
res <- ldply(apply(ind2,2,function(x) data.frame(StartTime = df[x[1],1] , 
EndTime = df[x[2],1] ,
Duration = df[x[2],1] - df[x[1],1] ,
Source = df[x[1],2] ,
Target = df[x[1],3] ,
Length=sum(df[x[1]:x[2],4]) ,
Content=paste(df[x[1]:x[2],5],collapse="")
) ))

  StartTime EndTime Duration Source Target Length      Content
1       0.1     0.4      0.3     P1     P2     12 ABCDEHIJPQRS
2       0.5     0.9      0.4     P2     P1      6       ZYSRQP
3       1.1     1.6      0.5     P1     P2      4         BDEF
4       2.0     2.0      0.0     P2     P1      3          IJK

回复收藏 0 原文

~没有更多了~