计算每组的行数并将结果添加到原始数据框
假设我有一个 data.frame
对象:
df <- data.frame(name=c('black','black','black','red','red'),
type=c('chair','chair','sofa','sofa','plate'),
num=c(4,5,12,4,3))
现在我想计算 name
和 type
的每个组合的行数(观察值) 。这可以像这样完成:
table(df[ , c("name","type")])
或者也可以使用plyr
(尽管我不确定如何)。
但是,如何将结果合并到原始数据框中呢?结果将如下所示:
df
# name type num count
# 1 black chair 4 2
# 2 black chair 5 2
# 3 black sofa 12 1
# 4 red sofa 4 1
# 5 red plate 3 1
其中 count
现在存储聚合结果。
使用 plyr
的解决方案学习起来也很有趣,尽管我想看看如何使用基础 R 来完成此任务。
Say I have a data.frame
object:
df <- data.frame(name=c('black','black','black','red','red'),
type=c('chair','chair','sofa','sofa','plate'),
num=c(4,5,12,4,3))
Now I want to count the number of rows (observations) of for each combination of name
and type
. This can be done like so:
table(df[ , c("name","type")])
or possibly also with plyr
, (though I am not sure how).
However, how do I get the results incorporated into the original data frame? So that the results will look like this:
df
# name type num count
# 1 black chair 4 2
# 2 black chair 5 2
# 3 black sofa 12 1
# 4 red sofa 4 1
# 5 red plate 3 1
where count
now stores the results from the aggregation.
A solution with plyr
could be interesting to learn as well, though I would like to see how this is done with base R.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
使用
data.table
:对于
data.table 1.8.2
之前的替代方案,请参阅编辑历史记录。使用
dplyr
:或者简单地:
使用
plyr
:Using
data.table
:For pre-
data.table 1.8.2
alternative, see edit history.Using
dplyr
:Or simply:
Using
plyr
:您可以使用
ave
:You can use
ave
:你可以这样做:
或者也许更直观地,
You can do this:
or perhaps more intuitively,
这应该可以完成你的工作:
This should do your work :
基本的
R
函数aggregate
将通过一行获取计数,但是将这些计数添加回原始的data.frame
似乎需要一点处理。The base
R
functionaggregate
will obtain the counts with a one-liner, but adding those counts back to the originaldata.frame
seems to take a bit of processing.使用sqldf包:
Using sqldf package:
另一种选择是使用
dplyr< 中的 add_tally /代码>。以下是一个可重现的示例:
于 2022 年 9 月 11 日使用 reprex v2.0.2 创建
Another option using add_tally from
dplyr
. Here is a reproducible example:Created on 2022-09-11 with reprex v2.0.2
两行替代方法是生成一个 0 变量,然后用
split<-
、split
和lengths
填充它,如下所示:返回所需的结果
本质上,RHS 计算每个名称-类型组合的长度,返回长度为 6 的命名向量,其中“red.chair”和“black.plate”均为 0。它通过
split <-
被馈送到 LHS,它获取向量并适当地添加给定点中的值。这本质上就是ave
的作用,您可以看到ave
的倒数第二行是然而,
lengths
是lengths
的优化版本代码>sapply(列表,长度)。A two line alternative is to generate a variable of 0s and then fill it in with
split<-
,split
, andlengths
like this:This returns the desired result
Essentially, the RHS calculates the lengths of each name-type combination, returning a named vector of length 6 with 0s for "red.chair" and "black.plate." This is fed to the LHS with
split <-
which takes the vector and appropriately adds the values in their given spots. This is essentially whatave
does, as you can see that the second to final line ofave
isHowever,
lengths
is an optimized version ofsapply(list, length)
.您距离将行计数合并到基础数据集中仅一步之遥。
使用
broom
包中的tidy()
函数,将频率表转换为数据框并使用df
进行内连接:You were just one step away from incorporating the row count into the base dataset.
Using the
tidy()
function from thebroom
package, convert the frequency table into a data frame and inner join withdf
:基本 R 中的一行简单代码:
为了清晰/高效,两行相同:
One simple line in base R:
Same in two lines, for clarity/efficiency:
在
collapse
中,使用fcount
。fcount
明显比任何其他选项更快。In
collapse
, withfcount
.fcount
is noticeably faster than any other options.另一种更概括的方法:
Another way that generalizes more: