按属性对行进行分组

发布于 2024-10-05 06:46:13 字数 589 浏览 1 评论 0原文

我有一个数据框，其中包含有关学生迟到各个班级的数据。每行包含有关迟到学生及其班级的数据：班级的日期和时间、班级名称、班级人数、迟到分钟数以及学生的性别。为了获得所有班级迟到学生的总百分比，我需要计算行数（迟到学生）并将其与上课的学生总数进行比较。

我不能简单地将所有行的班级规模相加；这将多次计算某个班级的学生人数，班级中每个迟到的学生都会计算一次。相反，我只需为班级的每次会议计算一次每个班级的人数。

示例

键：迟到分钟数、班级名称、出勤学生、迟到学生性别、迟到分钟数。

11/12/10 Stats 30 M 1
11/12/10 Stats 30 M 1
11/12/10 Stats 30 M 1
11/15/10 Stats 40 F 3
11/15/10 Stats 40 F 3
11/15/10 Stats 40 F 3
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2

本例中，有3次不同的班会和11名迟到的学生。如何确保每次班级会议的班级人数只计算一次？

原文

I have a data frame containing data about student lateness to various classes. Each row contains data about a late student and his class: date and time of the class, name of the class, class size, number of minutes late, and the gender of the student. In order to get the total percentage of late students for all classes, I need to count the number of rows (late students) and compare that with the total number of students that attended class.

I can't simply sum the class sizes for all of the rows; that would count the students of a given class several times, once for each late student in the class. Instead, I need to count each class size only once for each meeting of the class.

Example

Key: minutes late, class name, students in attendance, gender of tardy student, minutes late.

11/12/10 Stats 30 M 1
11/12/10 Stats 30 M 1
11/12/10 Stats 30 M 1
11/15/10 Stats 40 F 3
11/15/10 Stats 40 F 3
11/15/10 Stats 40 F 3
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2

In this case, there are three different class meetings and 11 late students. How could I make sure each class meeting's class size is only counted once?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情深缘浅 2024-10-12 06:46:14

编辑：我的解决方案可以变得更简单，首先计算每行的微不足道的延迟百分比，然后使用aggregate()按日期和类别对这些百分比求和：

> df2 <- within(df, pcLate <- 100 * (1 / Size)) 
> df2
         Date Class Size Sex MinsLate   pcLate
1  2010-11-12 Stats   30   M        1 3.333333
2  2010-11-12 Stats   30   M        1 3.333333
3  2010-11-12 Stats   30   M        1 3.333333
4  2010-11-15 Stats   40   F        3 2.500000
5  2010-11-15 Stats   40   F        3 2.500000
6  2010-11-15 Stats   40   F        3 2.500000
7  2010-11-16 Radar   22   M        2 4.545455
8  2010-11-16 Radar   22   M        2 4.545455
9  2010-11-16 Radar   22   M        2 4.545455
10 2010-11-16 Radar   22   M        2 4.545455
11 2010-11-16 Radar   22   M        2 4.545455
> with(df2, aggregate(pcLate, by = list(Date = Date, Class = Class), sum))
        Date Class        x
1 2010-11-16 Radar 22.72727
2 2010-11-12 Stats 10.00000
3 2010-11-15 Stats  7.50000

原始答案：

假设df包含您提供的示例数据，我们可以使用aggregate()通过几个步骤来完成此操作这让

首先，获取每班迟到学生的数量：

summ <- with(df, aggregate(MinsLate, by = list(Date = Date, Class = Class),
                           FUN = length))
names(summ)[3] <- "nLate"

这给了我们这个起点

> head(summ)
        Date Class nLate
1 2010-11-16 Radar     5
2 2010-11-12 Stats     3
3 2010-11-15 Stats     3

然后形成班级规模：

summ$Size <- with(df, aggregate(Size, by = list(Date = Date, Class = Class),
                                FUN = unique)$x)

我们到达这里：

> head(summ)
        Date Class nLate Size
1 2010-11-16 Radar     5   22
2 2010-11-12 Stats     3   30
3 2010-11-15 Stats     3   40

然后计算迟到的百分比：

summ <- within(summ, pcLate <- 100 * (nLate / Size))

结果是：

> head(summ)
        Date Class nLate Size   pcLate
1 2010-11-16 Radar     5   22 22.72727
2 2010-11-12 Stats     3   30 10.00000
3 2010-11-15 Stats     3   40  7.50000

如果你需要这样做很多，将其包装到一个函数中，

tardiness <- function(df) {
    out <- with(df, aggregate(MinsLate, by = list(Date = Date, Class = Class),
                              FUN = length))
    names(out)[3] <- "nLate"
    out$Size <- with(df, aggregate(Size, by = list(Date = Date, Class = Class),
                                   FUN = unique)$x)
    out <- within(out, pcLate <- 100 * (nLate / Size))
    out
}

该函数为我们完成所有步骤：

> tardiness(df)
        Date Class nLate Size   pcLate
1 2010-11-16 Radar     5   22 22.72727
2 2010-11-12 Stats     3   30 10.00000
3 2010-11-15 Stats     3   40  7.50000

Edit: My solution can be made a lot simpler by computing the trivial % late on a per row basis first, then use aggregate() to sum these percentages by Date and Class:

> df2 <- within(df, pcLate <- 100 * (1 / Size)) 
> df2
         Date Class Size Sex MinsLate   pcLate
1  2010-11-12 Stats   30   M        1 3.333333
2  2010-11-12 Stats   30   M        1 3.333333
3  2010-11-12 Stats   30   M        1 3.333333
4  2010-11-15 Stats   40   F        3 2.500000
5  2010-11-15 Stats   40   F        3 2.500000
6  2010-11-15 Stats   40   F        3 2.500000
7  2010-11-16 Radar   22   M        2 4.545455
8  2010-11-16 Radar   22   M        2 4.545455
9  2010-11-16 Radar   22   M        2 4.545455
10 2010-11-16 Radar   22   M        2 4.545455
11 2010-11-16 Radar   22   M        2 4.545455
> with(df2, aggregate(pcLate, by = list(Date = Date, Class = Class), sum))
        Date Class        x
1 2010-11-16 Radar 22.72727
2 2010-11-12 Stats 10.00000
3 2010-11-15 Stats  7.50000

Original Answer:

Assuming df contains the example data you provide, we can do this in a couple of steps using aggregate()

First, grab the number of late students per class:

summ <- with(df, aggregate(MinsLate, by = list(Date = Date, Class = Class),
                           FUN = length))
names(summ)[3] <- "nLate"

Which gives us this starting point

> head(summ)
        Date Class nLate
1 2010-11-16 Radar     5
2 2010-11-12 Stats     3
3 2010-11-15 Stats     3

Then form the class sizes:

summ$Size <- with(df, aggregate(Size, by = list(Date = Date, Class = Class),
                                FUN = unique)$x)

Which gets us to here:

> head(summ)
        Date Class nLate Size
1 2010-11-16 Radar     5   22
2 2010-11-12 Stats     3   30
3 2010-11-15 Stats     3   40

Then compute the percentage late:

summ <- within(summ, pcLate <- 100 * (nLate / Size))

Which results in:

> head(summ)
        Date Class nLate Size   pcLate
1 2010-11-16 Radar     5   22 22.72727
2 2010-11-12 Stats     3   30 10.00000
3 2010-11-15 Stats     3   40  7.50000

If you need to do this a lot, wrap it into a function

tardiness <- function(df) {
    out <- with(df, aggregate(MinsLate, by = list(Date = Date, Class = Class),
                              FUN = length))
    names(out)[3] <- "nLate"
    out$Size <- with(df, aggregate(Size, by = list(Date = Date, Class = Class),
                                   FUN = unique)$x)
    out <- within(out, pcLate <- 100 * (nLate / Size))
    out
}

this does all the steps for us:

> tardiness(df)
        Date Class nLate Size   pcLate
1 2010-11-16 Radar     5   22 22.72727
2 2010-11-12 Stats     3   30 10.00000
3 2010-11-15 Stats     3   40  7.50000

回复收藏 0 原文

暮年 2024-10-12 06:46:13

如果我正确理解你想要什么，那么使用 plyr 包比 tapply 或 by 更容易做到这一点，因为它理解什么相当于多元分组。例如：
<代码>

ddply(df, .(DATE,CLASS), transform, PERCENT_LATE=length(MINUTES.LATE)/CLASS.SIZE))

The argument to length here can be any of the column names. ddply will split your dataframe for each combination of DATE and CLASS factor levels. The number of rows in each mini dataframe should then correspond to how many late students there were (since there is an entry for each late student). That is where the length(any variable) comes in. Divide it by the class size column for the fraction.

If I understand what you want correctly, this is easier to do with the plyr package, rather than tapply or by because it understands what amounts to a multivariate grouping. For instance:

ddply(df, .(DATE,CLASS), transform, PERCENT_LATE=length(MINUTES.LATE)/CLASS.SIZE))

回复收藏 0 原文

醉生梦死 2024-10-12 06:46:13

遵循@Gavin的评论：冗余输出，使用总结：

df.out <- ddply(x, .(DATE, CLASS), summarise    
    , NLATE = length(c(DATE, CLASS)) / 2
    , SIZE = unique(CLASS.SIZE)
    , PCLATE = 100 * (length(c(DATE, CLASS)) / 2 )/ unique(CLASS.SIZE)
    )
> df.out
      DATE CLASS NLATE SIZE PCLATE
1 11/12/10 Stats     3   30  10.00
2 11/15/10 Stats     3   40   7.50
3 11/16/10 Radar     5   22  22.73

To follow on @Gavin's comment re: the redundant output, using summarise:

df.out <- ddply(x, .(DATE, CLASS), summarise    
    , NLATE = length(c(DATE, CLASS)) / 2
    , SIZE = unique(CLASS.SIZE)
    , PCLATE = 100 * (length(c(DATE, CLASS)) / 2 )/ unique(CLASS.SIZE)
    )
> df.out
      DATE CLASS NLATE SIZE PCLATE
1 11/12/10 Stats     3   30  10.00
2 11/15/10 Stats     3   40   7.50
3 11/16/10 Radar     5   22  22.73

回复收藏 0 原文

枕梦 2024-10-12 06:46:13

迟到总人数和班级人数的不同功能。需要使用“粘贴”策略来创建数据和类名的独特组合：

>  sum_late <- tapply( tst$V5, paste(tst$V1, tst$V2, sep="_"), length)
>  csize <- tapply( tst$V3, paste(tst$V1, tst$V2, sep="_"), head,1)
> pct_late <- 100*sum_late/csize
> pct_late
11/12/10_Stats 11/15/10_Stats 11/16/10_Radar 
      10.00000        7.50000       22.72727

或者使用聚合：

>  dfcount <- aggregate( tst$V5, list(tst$V1, tst$V2), length)
> dfcount$pct <- 100*aggregate( tst$V5, list(tst$V1, tst$V2), length)$x/aggregate( tst$V3, list(tst$V1, tst$V2), head,1)$x
> dfcount
   Group.1 Group.2 x      pct
1 11/16/10   Radar 5 22.72727
2 11/12/10   Stats 3 10.00000
3 11/15/10   Stats 3  7.50000

Different functions for sum number late and class size . Need to use a "paste" strategy to create unique combo's of data and class name:

>  sum_late <- tapply( tst$V5, paste(tst$V1, tst$V2, sep="_"), length)
>  csize <- tapply( tst$V3, paste(tst$V1, tst$V2, sep="_"), head,1)
> pct_late <- 100*sum_late/csize
> pct_late
11/12/10_Stats 11/15/10_Stats 11/16/10_Radar 
      10.00000        7.50000       22.72727

Or with aggregate:

>  dfcount <- aggregate( tst$V5, list(tst$V1, tst$V2), length)
> dfcount$pct <- 100*aggregate( tst$V5, list(tst$V1, tst$V2), length)$x/aggregate( tst$V3, list(tst$V1, tst$V2), head,1)$x
> dfcount
   Group.1 Group.2 x      pct
1 11/16/10   Radar 5 22.72727
2 11/12/10   Stats 3 10.00000
3 11/15/10   Stats 3  7.50000

回复收藏 0 原文

~没有更多了~