如何识别和汇总数据框中匹配组的数据集?
这是一个示例数据框:
set.seed(0)
x1 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2)
x2 <- c(1, 1, 0, 0, 0, 1, 1, 1, 1)
x3 <- c(1, 1, 2, 2, 4, 1, 1, 2, 1)
n <- c(1, 1, 1, 5, 5, 1, 1, 1, 1)
y <- rnorm(9)
mydf <- data.frame(x1, x2, x3, n, y)
我想做的是
- 识别 n=1 的行并且共享相同的 (x1, x2, x3) 值,
- 为每个子集返回一行,其中 y = Mean(y) 且 n = length (y)
- 保持其他行相同。
例如,新的数据框是
x1 <- c(1, 1, 1, 1, 2, 2)
x2 <- c(1, 0, 0, 0, 1, 1)
x3 <- c(1, 2, 2, 4, 1, 2)
n <- c(2, 1, 5, 5, 3, 1)
y <- c(mean(y[1:2]), y[3], y[4], y[5], mean(y[c(6:7,9)]), y[8])
newdf <- data.frame(x1, x2, x3, n, y)
我可以用条件和循环来解决这个问题,但我更愿意学习更优雅的方法来做到这一点。
Here is an example dataframe:
set.seed(0)
x1 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2)
x2 <- c(1, 1, 0, 0, 0, 1, 1, 1, 1)
x3 <- c(1, 1, 2, 2, 4, 1, 1, 2, 1)
n <- c(1, 1, 1, 5, 5, 1, 1, 1, 1)
y <- rnorm(9)
mydf <- data.frame(x1, x2, x3, n, y)
What I would like to do is
- identify rows with n=1 and which share identical values of (x1, x2, x3)
- return a single row for each subset with y = mean(y) and n = length(y)
- keep other rows the same.
for example, the new dataframe would be
x1 <- c(1, 1, 1, 1, 2, 2)
x2 <- c(1, 0, 0, 0, 1, 1)
x3 <- c(1, 2, 2, 4, 1, 2)
n <- c(2, 1, 5, 5, 3, 1)
y <- c(mean(y[1:2]), y[3], y[4], y[5], mean(y[c(6:7,9)]), y[8])
newdf <- data.frame(x1, x2, x3, n, y)
I can figure this out with conditionals and loops, but I would prefer to learn more elegant way to do this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
通过“其他列中的相同值”,我认为您的意思是每个子集由子集每行中相同的
x1
值定义,而不是x1
> 等于x2
。谢谢你的例子来看看你的意思。要获得第一部分和第二部分
这可以用
rbind
-ed 与mydf
其中n!=1
的部分来获得你所说的这不与您列出的顺序不同。如果这确实很重要,您可以添加一些辅助排序变量。
By "identical values in other columns", I take it you mean that each subset is defined by the same value of
x1
in each of the rows of the subset, not thatx1
is equal tox2
. Thanks for the example to see what you meant.To get parts one and two
This can be
rbind
-ed with the part ofmydf
wheren!=1
to get what you saidThis doesn't have the same order as you listed. If that is really important, you can add some auxiliary sorting variables.