从 R 中的分层数据中提取特定数据
我有一个由 6 列组成的数据框。第 1 至 5 列每列都有离散名称/值,例如地区、年份、月份、年龄区间和性别。第六列是该特定组合的死亡计数。
District Gender Year Month Age.Group Total.Deaths
1 Eastern Female 2003 1 -1 0
2 Eastern Female 2003 1 -2 2
3 Eastern Female 2003 1 0 2
4 Eastern Female 2003 1 01-4 1
5 Eastern Female 2003 1 05-09 0
6 Eastern Female 2003 1 10-14 1
7 Eastern Female 2003 1 15-19 0
8 Eastern Female 2003 1 20-24 4
9 Eastern Female 2003 1 25-29 9
10 Eastern Female 2003 1 30-34 3
11 Eastern Female 2003 1 35-39 7
12 Eastern Female 2003 1 40-44 5
13 Eastern Female 2003 1 45-49 5
14 Eastern Female 2003 1 50-54 8
15 Eastern Female 2003 1 55-59 5
16 Eastern Female 2003 1 60-64 4
17 Eastern Female 2003 1 65-69 7
18 Eastern Female 2003 1 70-74 8
19 Eastern Female 2003 1 75-79 5
20 Eastern Female 2003 1 80-84 10
21 Eastern Female 2003 1 85+ 11
22 Eastern Female 2003 2 -1 0
23 Eastern Female 2003 2 -2 0
24 Eastern Female 2003 2 0 4
25 Eastern Female 2003 2 01-4 1
26 Eastern Female 2003 2 05-09 2
27 Eastern Female 2003 2 10-14 2
28 Eastern Female 2003 2 15-19 0
我想从这个大数据框中过滤或提取较小的数据框。 例如,我只想有四个年龄组。这四个年龄组将分别包含:
Group 0: Consisting of Age.Group -1, -2 and 0.
Group 1-4: Consisting of Age.Group 01-4
Group 5-14: Consisting of Age.Group 05-09 and 10-14
Group 15+: Consisting of Age.Group 15-19 to 85+
Total.Deaths
将是每个组的总和。
所以我希望它看起来像这样
District Gender Year Month Age.Group Total.Deaths
1 Eastern Female 2003 1 0 4
2 Eastern Female 2003 1 01-4 1
3 Eastern Female 2003 1 05-14 1
4 Eastern Female 2003 1 15+ 104
5 Eastern Female 2003 2 0 4
6 Eastern Female 2003 2 01-4 1
7 Eastern Female 2003 2 05-14 4
8 Eastern Female 2003 2 15+ ...
我有很多数据并且已经搜索了几天,但无法找到一个函数来帮助做到这一点。
I have a dataframe made up of 6 columns. Columns 1 to 5 each have discrete names/values, such as a district, year, month, age interval and gender. The sixth column is the number of death counts for that specific combination.
District Gender Year Month Age.Group Total.Deaths
1 Eastern Female 2003 1 -1 0
2 Eastern Female 2003 1 -2 2
3 Eastern Female 2003 1 0 2
4 Eastern Female 2003 1 01-4 1
5 Eastern Female 2003 1 05-09 0
6 Eastern Female 2003 1 10-14 1
7 Eastern Female 2003 1 15-19 0
8 Eastern Female 2003 1 20-24 4
9 Eastern Female 2003 1 25-29 9
10 Eastern Female 2003 1 30-34 3
11 Eastern Female 2003 1 35-39 7
12 Eastern Female 2003 1 40-44 5
13 Eastern Female 2003 1 45-49 5
14 Eastern Female 2003 1 50-54 8
15 Eastern Female 2003 1 55-59 5
16 Eastern Female 2003 1 60-64 4
17 Eastern Female 2003 1 65-69 7
18 Eastern Female 2003 1 70-74 8
19 Eastern Female 2003 1 75-79 5
20 Eastern Female 2003 1 80-84 10
21 Eastern Female 2003 1 85+ 11
22 Eastern Female 2003 2 -1 0
23 Eastern Female 2003 2 -2 0
24 Eastern Female 2003 2 0 4
25 Eastern Female 2003 2 01-4 1
26 Eastern Female 2003 2 05-09 2
27 Eastern Female 2003 2 10-14 2
28 Eastern Female 2003 2 15-19 0
I would like to filter, or extract, smaller dataframes from this big dataframe.
For example, I would like to only have four age groups. These four age groups will each contain:
Group 0: Consisting of Age.Group -1, -2 and 0.
Group 1-4: Consisting of Age.Group 01-4
Group 5-14: Consisting of Age.Group 05-09 and 10-14
Group 15+: Consisting of Age.Group 15-19 to 85+
The Total.Deaths
will then be the sum for each of these groups.
So I want it to look like this
District Gender Year Month Age.Group Total.Deaths
1 Eastern Female 2003 1 0 4
2 Eastern Female 2003 1 01-4 1
3 Eastern Female 2003 1 05-14 1
4 Eastern Female 2003 1 15+ 104
5 Eastern Female 2003 2 0 4
6 Eastern Female 2003 2 01-4 1
7 Eastern Female 2003 2 05-14 4
8 Eastern Female 2003 2 15+ ...
I have a lot of data and have searched for a few days, but unable to find a function to help be do this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
可能有一种更简洁的方法可以使用
car
包中的recode
来重新编码您的年龄变量,特别是因为您可以方便地使用排序级别对当前年龄变量进行编码很好地作为角色。但对于只有几个级别,我通常只是通过创建一个新的年龄变量来手动重新编码,并且这种方法是在 R 中“完成工作”的良好实践:然后我们可以使用 ddply 生成摘要> 和
总结
:一开始我很担心,因为我有 91 人死亡,而不是你所说的 104 人,但我用手算了一下,我认为 91 人是正确的。也许是一个错字。
There may be a pithier way of recoding your age variable using something like
recode
from thecar
package, particularly since you've conveniently got your current age variable coded with levels that sort nicely as characters. But for only a few levels, I often just recode them by hand by creating a new age variable, and this method is good practice for just 'getting stuff done' in R:Then we can generate summaries using
ddply
andsummarise
:I was worried at first because I got 91 deaths instead of 104 as you indicated, but I counted by hand and 91 is right I think. A typo, perhaps.