按属性对行进行分组
我有一个数据框,其中包含有关学生迟到各个班级的数据。每行包含有关迟到学生及其班级的数据:班级的日期和时间、班级名称、班级人数、迟到分钟数以及学生的性别。为了获得所有班级迟到学生的总百分比,我需要计算行数(迟到学生)并将其与上课的学生总数进行比较。
我不能简单地将所有行的班级规模相加;这将多次计算某个班级的学生人数,班级中每个迟到的学生都会计算一次。相反,我只需为班级的每次会议计算一次每个班级的人数。
示例
键:迟到分钟数、班级名称、出勤学生、迟到学生性别、迟到分钟数。
11/12/10 Stats 30 M 1
11/12/10 Stats 30 M 1
11/12/10 Stats 30 M 1
11/15/10 Stats 40 F 3
11/15/10 Stats 40 F 3
11/15/10 Stats 40 F 3
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
本例中,有3次不同的班会和11名迟到的学生。如何确保每次班级会议的班级人数只计算一次?
I have a data frame containing data about student lateness to various classes. Each row contains data about a late student and his class: date and time of the class, name of the class, class size, number of minutes late, and the gender of the student. In order to get the total percentage of late students for all classes, I need to count the number of rows (late students) and compare that with the total number of students that attended class.
I can't simply sum the class sizes for all of the rows; that would count the students of a given class several times, once for each late student in the class. Instead, I need to count each class size only once for each meeting of the class.
Example
Key: minutes late, class name, students in attendance, gender of tardy student, minutes late.
11/12/10 Stats 30 M 1
11/12/10 Stats 30 M 1
11/12/10 Stats 30 M 1
11/15/10 Stats 40 F 3
11/15/10 Stats 40 F 3
11/15/10 Stats 40 F 3
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
In this case, there are three different class meetings and 11 late students. How could I make sure each class meeting's class size is only counted once?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
编辑:我的解决方案可以变得更简单,首先计算每行的微不足道的延迟百分比,然后使用
aggregate()
按日期和类别对这些百分比求和:原始答案:
假设
df
包含您提供的示例数据,我们可以使用aggregate()通过几个步骤来完成此操作 这让
首先,获取每班迟到学生的数量:
这给了我们这个起点
然后形成班级规模:
我们到达这里:
然后计算迟到的百分比:
结果是:
如果你需要这样做很多,将其包装到一个函数中,
该函数为我们完成所有步骤:
Edit: My solution can be made a lot simpler by computing the trivial % late on a per row basis first, then use
aggregate()
to sum these percentages by Date and Class:Original Answer:
Assuming
df
contains the example data you provide, we can do this in a couple of steps usingaggregate()
First, grab the number of late students per class:
Which gives us this starting point
Then form the class sizes:
Which gets us to here:
Then compute the percentage late:
Which results in:
If you need to do this a lot, wrap it into a function
this does all the steps for us:
如果我正确理解你想要什么,那么使用 plyr 包比 tapply 或 by 更容易做到这一点,因为它理解什么相当于多元分组。例如:
<代码>
The argument to length here can be any of the column names. ddply will split your dataframe for each combination of DATE and CLASS factor levels. The number of rows in each mini dataframe should then correspond to how many late students there were (since there is an entry for each late student). That is where the length(any variable) comes in. Divide it by the class size column for the fraction.
If I understand what you want correctly, this is easier to do with the plyr package, rather than tapply or by because it understands what amounts to a multivariate grouping. For instance:
The argument to length here can be any of the column names. ddply will split your dataframe for each combination of DATE and CLASS factor levels. The number of rows in each mini dataframe should then correspond to how many late students there were (since there is an entry for each late student). That is where the length(any variable) comes in. Divide it by the class size column for the fraction.
遵循@Gavin的评论:冗余输出,使用总结:
To follow on @Gavin's comment re: the redundant output, using summarise:
迟到总人数和班级人数的不同功能。需要使用“粘贴”策略来创建数据和类名的独特组合:
或者使用聚合:
Different functions for sum number late and class size . Need to use a "paste" strategy to create unique combo's of data and class name:
Or with aggregate: