在给定列上聚合数据框并显示另一列
我在 R 中有一个以下形式的数据框:
> head(data)
Group Score Info
1 1 1 a
2 1 2 b
3 1 3 c
4 2 4 d
5 2 3 e
6 2 1 f
我想使用 max
函数在 Score
列之后聚合它
> aggregate(data$Score, list(data$Group), max)
Group.1 x
1 1 3
2 2 4
但我也想显示 与每个组的
列。我不知道该怎么做。我想要的输出是:Score
列的最大值关联的信息
Group.1 x y
1 1 3 c
2 2 4 d
有什么提示吗?
I have a dataframe in R of the following form:
> head(data)
Group Score Info
1 1 1 a
2 1 2 b
3 1 3 c
4 2 4 d
5 2 3 e
6 2 1 f
I would like to aggregate it following the Score
column using the max
function
> aggregate(data$Score, list(data$Group), max)
Group.1 x
1 1 3
2 2 4
But I also would like to display the Info
column associated to the maximum value of the Score
column for each group. I have no idea how to do this. My desired output would be:
Group.1 x y
1 1 3 c
2 2 4 d
Any hint?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
基本的 R 解决方案是将
aggregate()
的输出与merge()
步骤结合起来。我发现aggregate()
的公式接口比标准接口更有用,部分原因是输出上的名称更好,所以我将使用它:aggregate()< /code> 步骤是,
而
merge()
步骤很简单这给了我们所需的输出:
当然,您可以将其粘贴到一行中(中间步骤更多地用于说明):
我使用公式界面的主要原因是它返回一个数据框,其中包含合并步骤的正确
名称
;这些是原始数据集dat
中的列名称。我们需要让aggregate()的输出具有正确的名称,以便merge()知道原始数据帧和聚合数据帧中的哪些列匹配。标准接口提供了奇怪的名称,无论您如何称呼它:
我们可以在这些输出上使用 merge() ,但我们需要做更多的工作来告诉 R 哪些列匹配。
A base R solution is to combine the output of
aggregate()
with amerge()
step. I find the formula interface toaggregate()
a little more useful than the standard interface, partly because the names on the output are nicer, so I'll use that:The
aggregate()
step isand the
merge()
step is simplyThis gives us the desired output:
You could, of course, stick this into a one-liner (the intermediary step was more for exposition):
The main reason I used the formula interface is that it returns a data frame with the correct
names
for the merge step; these are the names of the columns from the original data setdat
. We need to have the output ofaggregate()
have the correct names so thatmerge()
knows which columns in the original and aggregated data frames match.The standard interface gives odd names, whichever way you call it:
We can use
merge()
on those outputs, but we need to do more work telling R which columns match up.首先,使用
split
分割数据:然后,对于每个块,选择得分最高的行:
最后减少回 data.frame
do.call
ingrbind
:结果:
一行,无需魔法,速度快,结果有好名字 =)
First, you split the data using
split
:Than, for each chunk, select the row with max Score:
Finally reduce back to a data.frame
do.call
ingrbind
:Result:
One line, no magic spells, fast, result has good names =)
这是使用
plyr
包的解决方案。以下代码行本质上告诉 ddply 首先按组对数据进行分组,然后在每个组中返回一个子集,其中分数等于该组中的最高分数。
并且,正如 @SachaEpskamp 指出的那样,这可以进一步简化为:(
这还有一个优点,即
which.max
将返回多个最大行(如果有的话)。Here is a solution using the
plyr
package.The following line of code essentially tells
ddply
to first group your data by Group, and then within each group returns a subset where the Score equals the maximum score in that group.And, as @SachaEpskamp points out, this can be further simplified to:
(which also has the advantage that
which.max
will return multiple max lines, if there are any).添加到加文的答案:在合并之前,可以在不使用公式界面时让聚合使用正确的名称:
To add to Gavin's answer: prior to the merge, it is possible to get aggregate to use proper names when not using the formula interface:
plyr
包可用于此目的。使用 ddply() 函数,您可以将数据框拆分为一列或多列,并应用函数并返回数据框,然后使用 summarize() 函数,您可以使用分割后的数据框的列作为变量来制作新的数据框/;The
plyr
package can be used for this. With theddply()
function you can split a data frame on one or more columns and apply a function and return a data frame, then with thesummarize()
function you can use the columns of the splitted data frame as variables to make the new data frame/;一个迟到的答案,但是使用
data.table
的方法或者,如果可能有多个相同的最高分数
注意到(来自
?data.table
A late answer, but and approach using
data.table
Or, if it is possible to have more than one equally highest score
Noting that (from
?data.table
这就是我对这个问题的基本看法。
This is how I
base
ically think of the problem.我没有足够高的声誉来评论 Gavin Simpson 的答案,但我想警告一下,标准语法和聚合的公式语法之间对缺失值的默认处理似乎存在差异。代码>.
I don't have a high enough reputation to comment on Gavin Simpson's answer, but I wanted to warn that there seems to be a difference in the default treatment of missing values between the standard syntax and the formula syntax for
aggregate
.