在给定列上聚合数据框并显示另一列

发布于 2024-11-15 03:30:31 字数 610 浏览 7 评论 0原文

我在 R 中有一个以下形式的数据框：

> head(data)
  Group Score Info
1     1     1    a
2     1     2    b
3     1     3    c
4     2     4    d
5     2     3    e
6     2     1    f

我想使用 max 函数在 Score 列之后聚合它

> aggregate(data$Score, list(data$Group), max)

  Group.1         x
1       1         3
2       2         4

但我也想显示 与每个组的 Score 列的最大值关联的信息 列。我不知道该怎么做。我想要的输出是：

  Group.1         x        y
1       1         3        c
2       2         4        d

有什么提示吗？

原文

I have a dataframe in R of the following form:

> head(data)
  Group Score Info
1     1     1    a
2     1     2    b
3     1     3    c
4     2     4    d
5     2     3    e
6     2     1    f

I would like to aggregate it following the Score column using the max function

> aggregate(data$Score, list(data$Group), max)

  Group.1         x
1       1         3
2       2         4

But I also would like to display the Info column associated to the maximum value of the Score column for each group. I have no idea how to do this. My desired output would be:

  Group.1         x        y
1       1         3        c
2       2         4        d

Any hint?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

累赘 2024-11-22 03:30:31

基本的 R 解决方案是将 aggregate() 的输出与 merge() 步骤结合起来。我发现aggregate()的公式接口比标准接口更有用，部分原因是输出上的名称更好，所以我将使用它：

aggregate()< /code> 步骤是，

maxs <- aggregate(Score ~ Group, data = dat, FUN = max)

而 merge() 步骤很简单

merge(maxs, dat)

这给了我们所需的输出：

R> maxs <- aggregate(Score ~ Group, data = dat, FUN = max)
R> merge(maxs, dat)
  Group Score Info
1     1     3    c
2     2     4    d

当然，您可以将其粘贴到一行中（中间步骤更多地用于说明）：

merge(aggregate(Score ~ Group, data = dat, FUN = max), dat)

我使用公式界面的主要原因是它返回一个数据框，其中包含合并步骤的正确名称；这些是原始数据集 dat 中的列名称。我们需要让aggregate()的输出具有正确的名称，以便merge()知道原始数据帧和聚合数据帧中的哪些列匹配。

标准接口提供了奇怪的名称，无论您如何称呼它：

R> aggregate(dat$Score, list(dat$Group), max)
  Group.1 x
1       1 3
2       2 4
R> with(dat, aggregate(Score, list(Group), max))
  Group.1 x
1       1 3
2       2 4

我们可以在这些输出上使用 merge() ，但我们需要做更多的工作来告诉 R 哪些列匹配。

A base R solution is to combine the output of aggregate() with a merge() step. I find the formula interface to aggregate() a little more useful than the standard interface, partly because the names on the output are nicer, so I'll use that:

The aggregate() step is

maxs <- aggregate(Score ~ Group, data = dat, FUN = max)

and the merge() step is simply

merge(maxs, dat)

This gives us the desired output:

R> maxs <- aggregate(Score ~ Group, data = dat, FUN = max)
R> merge(maxs, dat)
  Group Score Info
1     1     3    c
2     2     4    d

You could, of course, stick this into a one-liner (the intermediary step was more for exposition):

merge(aggregate(Score ~ Group, data = dat, FUN = max), dat)

The main reason I used the formula interface is that it returns a data frame with the correct names for the merge step; these are the names of the columns from the original data set dat. We need to have the output of aggregate() have the correct names so that merge() knows which columns in the original and aggregated data frames match.

The standard interface gives odd names, whichever way you call it:

R> aggregate(dat$Score, list(dat$Group), max)
  Group.1 x
1       1 3
2       2 4
R> with(dat, aggregate(Score, list(Group), max))
  Group.1 x
1       1 3
2       2 4

We can use merge() on those outputs, but we need to do more work telling R which columns match up.

回复收藏 0 原文

游魂 2024-11-22 03:30:31

首先，使用 split 分割数据：

split(z,z$Group)

然后，对于每个块，选择得分最高的行：

lapply(split(z,z$Group),function(chunk) chunk[which.max(chunk$Score),])

最后减少回 data.frame do.calling rbind：

do.call(rbind,lapply(split(z,z$Group),function(chunk) chunk[which.max(chunk$Score),]))

结果：

  Group Score Info
1     1     3    c
2     2     4    d

一行，无需魔法，速度快，结果有好名字 =)

First, you split the data using split:

split(z,z$Group)

Than, for each chunk, select the row with max Score:

lapply(split(z,z$Group),function(chunk) chunk[which.max(chunk$Score),])

Finally reduce back to a data.frame do.calling rbind:

do.call(rbind,lapply(split(z,z$Group),function(chunk) chunk[which.max(chunk$Score),]))

Result:

  Group Score Info
1     1     3    c
2     2     4    d

One line, no magic spells, fast, result has good names =)

回复收藏 0 原文

不奢求什么 2024-11-22 03:30:31

这是使用 plyr 包的解决方案。

以下代码行本质上告诉 ddply 首先按组对数据进行分组，然后在每个组中返回一个子集，其中分数等于该组中的最高分数。

library(plyr)
ddply(data, .(Group), function(x)x[x$Score==max(x$Score), ])

  Group Score Info
1     1     3    c
2     2     4    d

并且，正如 @SachaEpskamp 指出的那样，这可以进一步简化为：（

ddply(df, .(Group), function(x)x[which.max(x$Score), ])

这还有一个优点，即 which.max 将返回多个最大行（如果有的话）。

Here is a solution using the plyr package.

The following line of code essentially tells ddply to first group your data by Group, and then within each group returns a subset where the Score equals the maximum score in that group.

library(plyr)
ddply(data, .(Group), function(x)x[x$Score==max(x$Score), ])

  Group Score Info
1     1     3    c
2     2     4    d

And, as @SachaEpskamp points out, this can be further simplified to:

ddply(df, .(Group), function(x)x[which.max(x$Score), ])

(which also has the advantage that which.max will return multiple max lines, if there are any).

回复收藏 0 原文

云巢 2024-11-22 03:30:31

添加到加文的答案：在合并之前，可以在不使用公式界面时让聚合使用正确的名称：

aggregate(data[,"score", drop=F], list(group=data$group), mean)

To add to Gavin's answer: prior to the merge, it is possible to get aggregate to use proper names when not using the formula interface:

aggregate(data[,"score", drop=F], list(group=data$group), mean)

回复收藏 0 原文

神魇的王 2024-11-22 03:30:31

plyr 包可用于此目的。使用 ddply() 函数，您可以将数据框拆分为一列或多列，并应用函数并返回数据框，然后使用 summarize() 函数，您可以使用分割后的数据框的列作为变量来制作新的数据框/；

dat <- read.table(textConnection('Group Score Info
1     1     1    a
2     1     2    b
3     1     3    c
4     2     4    d
5     2     3    e
6     2     1    f'))

library("plyr")

ddply(dat,.(Group),summarize,
    Max = max(Score),
    Info = Info[which.max(Score)])
  Group Max Info
1     1   3    c
2     2   4    d

The plyr package can be used for this. With the ddply() function you can split a data frame on one or more columns and apply a function and return a data frame, then with the summarize() function you can use the columns of the splitted data frame as variables to make the new data frame/;

dat <- read.table(textConnection('Group Score Info
1     1     1    a
2     1     2    b
3     1     3    c
4     2     4    d
5     2     3    e
6     2     1    f'))

library("plyr")

ddply(dat,.(Group),summarize,
    Max = max(Score),
    Info = Info[which.max(Score)])
  Group Max Info
1     1   3    c
2     2   4    d

回复收藏 0 原文

万劫不复 2024-11-22 03:30:31

一个迟到的答案，但是使用 data.table 的方法

library(data.table)
DT <- data.table(dat)

DT[, .SD[which.max(Score),], by = Group]

或者，如果可能有多个相同的最高分数

DT[, .SD[which(Score == max(Score)),], by = Group]

注意到（来自 ?data.table

.SD 是一个 data.table，其中包含每个组的 x 数据子集，不包括组列

A late answer, but and approach using data.table

library(data.table)
DT <- data.table(dat)

DT[, .SD[which.max(Score),], by = Group]

Or, if it is possible to have more than one equally highest score

DT[, .SD[which(Score == max(Score)),], by = Group]

Noting that (from ?data.table

.SD is a data.table containing the Subset of x's Data for each group, excluding the group column(s)

回复收藏 0 原文

苏璃陌 2024-11-22 03:30:31

这就是我对这个问题的基本看法。

my.df <- data.frame(group = rep(c(1,2), each = 3), 
        score = runif(6), info = letters[1:6])
my.agg <- with(my.df, aggregate(score, list(group), max))
my.df.split <- with(my.df, split(x = my.df, f = group))
my.agg$info <- unlist(lapply(my.df.split, FUN = function(x) {
            x[which(x$score == max(x$score)), "info"]
        }))

> my.agg
  Group.1         x info
1       1 0.9344336    a
2       2 0.7699763    e

This is how I baseically think of the problem.

my.df <- data.frame(group = rep(c(1,2), each = 3), 
        score = runif(6), info = letters[1:6])
my.agg <- with(my.df, aggregate(score, list(group), max))
my.df.split <- with(my.df, split(x = my.df, f = group))
my.agg$info <- unlist(lapply(my.df.split, FUN = function(x) {
            x[which(x$score == max(x$score)), "info"]
        }))

> my.agg
  Group.1         x info
1       1 0.9344336    a
2       2 0.7699763    e

回复收藏 0 原文

亚希 2024-11-22 03:30:31

我没有足够高的声誉来评论 Gavin Simpson 的答案，但我想警告一下，标准语法和聚合的公式语法之间对缺失值的默认处理似乎存在差异。代码>.

#Create some data with missing values 
a<-data.frame(day=rep(1,5),hour=c(1,2,3,3,4),val=c(1,NA,3,NA,5))
  day hour val
1   1    1   1
2   1    2  NA
3   1    3   3
4   1    3  NA
5   1    4   5

#Standard syntax
aggregate(a$val,by=list(day=a$day,hour=a$hour),mean,na.rm=T)
  day hour   x
1   1    1   1
2   1    2 NaN
3   1    3   3
4   1    4   5

#Formula syntax.  Note the index for hour 2 has been silently dropped.
aggregate(val ~ hour + day,data=a,mean,na.rm=T)
  hour day val
1    1   1   1
2    3   1   3
3    4   1   5

I don't have a high enough reputation to comment on Gavin Simpson's answer, but I wanted to warn that there seems to be a difference in the default treatment of missing values between the standard syntax and the formula syntax for aggregate.

#Create some data with missing values 
a<-data.frame(day=rep(1,5),hour=c(1,2,3,3,4),val=c(1,NA,3,NA,5))
  day hour val
1   1    1   1
2   1    2  NA
3   1    3   3
4   1    3  NA
5   1    4   5

#Standard syntax
aggregate(a$val,by=list(day=a$day,hour=a$hour),mean,na.rm=T)
  day hour   x
1   1    1   1
2   1    2 NaN
3   1    3   3
4   1    4   5

#Formula syntax.  Note the index for hour 2 has been silently dropped.
aggregate(val ~ hour + day,data=a,mean,na.rm=T)
  hour day val
1    1   1   1
2    3   1   3
3    4   1   5

回复收藏 0 原文

~没有更多了~