如何识别和汇总数据框中匹配组的数据集？

发布于 2024-12-02 03:18:12 字数 862 浏览 0 评论 0原文

这是一个示例数据框：

set.seed(0)
x1 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2)
x2 <- c(1, 1, 0, 0, 0, 1, 1, 1, 1)
x3 <- c(1, 1, 2, 2, 4, 1, 1, 2, 1)
n  <- c(1, 1, 1, 5, 5, 1, 1, 1, 1)
y <- rnorm(9)

mydf <- data.frame(x1, x2, x3, n, y)

我想做的是

识别 n=1 的行并且共享相同的 (x1, x2, x3) 值，
为每个子集返回一行，其中 y = Mean(y) 且 n = length (y)
保持其他行相同。

例如，新的数据框是

x1 <- c(1,            1,    1,    1,    2,                 2)
x2 <- c(1,            0,    0,    0,    1,                 1)
x3 <- c(1,            2,    2,    4,    1,                 2)
n  <- c(2,            1,    5,    5,    3,                 1)
y  <- c(mean(y[1:2]), y[3], y[4], y[5], mean(y[c(6:7,9)]), y[8])

newdf <- data.frame(x1, x2, x3, n, y)

我可以用条件和循环来解决这个问题，但我更愿意学习更优雅的方法来做到这一点。

原文

Here is an example dataframe:

set.seed(0)
x1 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2)
x2 <- c(1, 1, 0, 0, 0, 1, 1, 1, 1)
x3 <- c(1, 1, 2, 2, 4, 1, 1, 2, 1)
n  <- c(1, 1, 1, 5, 5, 1, 1, 1, 1)
y <- rnorm(9)

mydf <- data.frame(x1, x2, x3, n, y)

What I would like to do is

identify rows with n=1 and which share identical values of (x1, x2, x3)
return a single row for each subset with y = mean(y) and n = length(y)
keep other rows the same.

for example, the new dataframe would be

x1 <- c(1,            1,    1,    1,    2,                 2)
x2 <- c(1,            0,    0,    0,    1,                 1)
x3 <- c(1,            2,    2,    4,    1,                 2)
n  <- c(2,            1,    5,    5,    3,                 1)
y  <- c(mean(y[1:2]), y[3], y[4], y[5], mean(y[c(6:7,9)]), y[8])

newdf <- data.frame(x1, x2, x3, n, y)

I can figure this out with conditionals and loops, but I would prefer to learn more elegant way to do this.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

飘过的浮云 2024-12-09 03:18:12

通过“其他列中的相同值”，我认为您的意思是每个子集由子集每行中相同的 x1 值定义，而不是 x1 > 等于x2。谢谢你的例子来看看你的意思。

library("plyr")

要获得第一部分和第二部分

ddply(mydf[mydf$n==1,], .(x1, x2, x3), summarise, n = length(y), y = mean(y))

这可以用 rbind-ed 与 mydf 其中 n!=1 的部分来获得你所说的

rbind(
  ddply(mydf[mydf$n==1,], .(x1, x2, x3), summarise, n = length(y), y = mean(y)),
  mydf[mydf$n!=1,]
)

这不与您列出的顺序不同。如果这确实很重要，您可以添加一些辅助排序变量。

mydf$order = seq(length=nrow(mydf))
newdf <- rbind(
  ddply(mydf[mydf$n==1,], .(x1, x2, x3), summarise, 
    n = length(y), y = mean(y), order=min(order)),
  mydf[mydf$n!=1,]
)
newdf <- newdf[order(newdf$order),]
newdf$order <- NULL

By "identical values in other columns", I take it you mean that each subset is defined by the same value of x1 in each of the rows of the subset, not that x1 is equal to x2. Thanks for the example to see what you meant.

library("plyr")

To get parts one and two

ddply(mydf[mydf$n==1,], .(x1, x2, x3), summarise, n = length(y), y = mean(y))

This can be rbind-ed with the part of mydf where n!=1 to get what you said

rbind(
  ddply(mydf[mydf$n==1,], .(x1, x2, x3), summarise, n = length(y), y = mean(y)),
  mydf[mydf$n!=1,]
)

This doesn't have the same order as you listed. If that is really important, you can add some auxiliary sorting variables.

mydf$order = seq(length=nrow(mydf))
newdf <- rbind(
  ddply(mydf[mydf$n==1,], .(x1, x2, x3), summarise, 
    n = length(y), y = mean(y), order=min(order)),
  mydf[mydf$n!=1,]
)
newdf <- newdf[order(newdf$order),]
newdf$order <- NULL

回复收藏 0 原文

~没有更多了~