汇总数据框忽略重复

发布于 2024-10-17 10:19:51 字数 1294 浏览 0 评论 0原文

我有一个数据框,其中一列中有重复的条目。我想根据该专栏总结其他专栏。我希望摘要在进行摘要时考虑每个唯一条目而不是总数。 例如在下面的数据框示例中,如果我想回答受访者有多少人是年轻人、中年人和老年人?的问题,“RefID”1-1在汇总时被视为1的计数“ageclass”=年轻,不解释为 5。

RefID   Altitude    Sex ageclass
1-1 Low F   young
1-1 Low F   young
1-1 Low F   young
1-1 Low F   young
1-1 Low F   young
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-7 Low F   old
1-7 Low F   old
1-7 Low F   old
1-7 Low F   old
1-8 Low F   old
1-8 Low F   old
1-9 Low F   old
1-9 Low F   old
1-9 Low F   old

谢谢。

I have a data frame in which there are repetitions of entries in one column. I want to summarize the other columns based on the that one column. I wish the summary to consider each unique entry and not the total count when making the summary.
For example in the data frame example below, if i want to answer the question on how many people surveyed are young,midage and old? "RefID" 1-1 is taken as a count of 1 in summarising "ageclass"=young and not interpreted as a count of 5.

RefID   Altitude    Sex ageclass
1-1 Low F   young
1-1 Low F   young
1-1 Low F   young
1-1 Low F   young
1-1 Low F   young
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-7 Low F   old
1-7 Low F   old
1-7 Low F   old
1-7 Low F   old
1-8 Low F   old
1-8 Low F   old
1-9 Low F   old
1-9 Low F   old
1-9 Low F   old

Thank You.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

驱逐舰岛风号 2024-10-24 10:19:51

plyr 包对此很有用。例如你可以这样做:

> require(plyr)
> ddply( df, .(ageclass), summarise, Num = length(unique(RefID)))
  ageclass Num
1   midage   1
2      old   6
3    young   1

The plyr package is useful for this. E.g. you could do:

> require(plyr)
> ddply( df, .(ageclass), summarise, Num = length(unique(RefID)))
  ageclass Num
1   midage   1
2      old   6
3    young   1
纵山崖 2024-10-24 10:19:51

要获取数据框中的唯一条目,请参阅 ?uniqe :

Data <- unique(Mydata)

您可以使用 by :

by(Data,Data$ageclass,summary)

另请参阅 ?summary 来了解结果。如果您对计数感兴趣,可以使用 table ,例如:

table(Data$RefID,Data$ageclass)

或作为摘要:

margin.table(table(Data$RefID,Data$ageclass),margin=2)

编辑:
您必须要小心一点,因为 unique() 会采用唯一的行。如果男性和女性都有 refID 1-1 ,那么您仍然会计算两次。但我认为您的数据不会出现这种情况。如果您确实想确定,可以执行以下操作:

with(unique(Data[c(1,4)]),margin.table(table(RefID,ageclass),margin=2))

或采用此处提到的 plyr 解决方案。

To get unique entries in a dataframe, see ?uniqe :

Data <- unique(Mydata)

You can use by :

by(Data,Data$ageclass,summary)

See also ?summary to understand the outcome. If you are interested in counts, you can use table ,eg :

table(Data$RefID,Data$ageclass)

or for a summary :

margin.table(table(Data$RefID,Data$ageclass),margin=2)

EDIT :
you'll have to be a bit careful, as unique() takes the unique rows. If you have both a male and a female having refID 1-1 , then you'll still count it twice. But I presume that won't be the case in your data. If you really want to make sure, you can do :

with(unique(Data[c(1,4)]),margin.table(table(RefID,ageclass),margin=2))

or take the plyr solution mentioned here.

半衬遮猫 2024-10-24 10:19:51

使用subset,您可以创建数据的子集;使用duplicated,您可以得到一个逻辑向量,指示向量中是否已出现某个值。首先是一个小样本数据集:

df <- data.frame(
   ID=rep(1:5,each=5),
   attitude="low",
   sex=c(rep("F",10),rep("M",15)),
   age=c(rep("young",5),rep("middle",10),rep("old",10))
   )

然后您可以创建一个子集,其中仅记录每个 ID 第一次输入的时间:

df.sub <- subset(df,!duplicated(df$ID))

然后您可以总结:

> summary(df.sub$age)
middle    old  young 
     2      2      1 

With subset you make a subset of the data and with duplicated you get a logical vector indicating if a value already occured in a vector. First a small sample dataset:

df <- data.frame(
   ID=rep(1:5,each=5),
   attitude="low",
   sex=c(rep("F",10),rep("M",15)),
   age=c(rep("young",5),rep("middle",10),rep("old",10))
   )

Then you can make a subset in which only the first time each ID is entered is recorded:

df.sub <- subset(df,!duplicated(df$ID))

Then you can summarize:

> summary(df.sub$age)
middle    old  young 
     2      2      1 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文