汇总数据框忽略重复
我有一个数据框,其中一列中有重复的条目。我想根据该专栏总结其他专栏。我希望摘要在进行摘要时考虑每个唯一条目而不是总数。 例如在下面的数据框示例中,如果我想回答受访者有多少人是年轻人、中年人和老年人?的问题,“RefID”1-1在汇总时被视为1的计数“ageclass”=年轻,不解释为 5。
RefID Altitude Sex ageclass
1-1 Low F young
1-1 Low F young
1-1 Low F young
1-1 Low F young
1-1 Low F young
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-7 Low F old
1-7 Low F old
1-7 Low F old
1-7 Low F old
1-8 Low F old
1-8 Low F old
1-9 Low F old
1-9 Low F old
1-9 Low F old
谢谢。
I have a data frame in which there are repetitions of entries in one column. I want to summarize the other columns based on the that one column. I wish the summary to consider each unique entry and not the total count when making the summary.
For example in the data frame example below, if i want to answer the question on how many people surveyed are young,midage and old? "RefID" 1-1 is taken as a count of 1 in summarising "ageclass"=young and not interpreted as a count of 5.
RefID Altitude Sex ageclass
1-1 Low F young
1-1 Low F young
1-1 Low F young
1-1 Low F young
1-1 Low F young
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-7 Low F old
1-7 Low F old
1-7 Low F old
1-7 Low F old
1-8 Low F old
1-8 Low F old
1-9 Low F old
1-9 Low F old
1-9 Low F old
Thank You.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
plyr
包对此很有用。例如你可以这样做:The
plyr
package is useful for this. E.g. you could do:要获取数据框中的唯一条目,请参阅 ?uniqe :
您可以使用 by :
另请参阅
?summary
来了解结果。如果您对计数感兴趣,可以使用table
,例如:或作为摘要:
编辑:
您必须要小心一点,因为
unique()
会采用唯一的行。如果男性和女性都有 refID 1-1 ,那么您仍然会计算两次。但我认为您的数据不会出现这种情况。如果您确实想确定,可以执行以下操作:或采用此处提到的
plyr
解决方案。To get unique entries in a dataframe, see ?uniqe :
You can use by :
See also
?summary
to understand the outcome. If you are interested in counts, you can usetable
,eg :or for a summary :
EDIT :
you'll have to be a bit careful, as
unique()
takes the unique rows. If you have both a male and a female having refID 1-1 , then you'll still count it twice. But I presume that won't be the case in your data. If you really want to make sure, you can do :or take the
plyr
solution mentioned here.使用
subset
,您可以创建数据的子集;使用duplicated
,您可以得到一个逻辑向量,指示向量中是否已出现某个值。首先是一个小样本数据集:然后您可以创建一个子集,其中仅记录每个 ID 第一次输入的时间:
然后您可以总结:
With
subset
you make a subset of the data and withduplicated
you get a logical vector indicating if a value already occured in a vector. First a small sample dataset:Then you can make a subset in which only the first time each ID is entered is recorded:
Then you can summarize: