从 R 中的分层数据中提取特定数据

发布于 2024-11-19 09:35:22 字数 2982 浏览 2 评论 0原文

我有一个由 6 列组成的数据框。第 1 至 5 列每列都有离散名称/值,例如地区、年份、月份、年龄区间和性别。第六列是该特定组合的死亡计数。

               District Gender Year Month Age.Group Total.Deaths
1              Eastern  Female 2003     1        -1            0
2              Eastern  Female 2003     1        -2            2
3              Eastern  Female 2003     1         0            2
4              Eastern  Female 2003     1      01-4            1
5              Eastern  Female 2003     1     05-09            0
6              Eastern  Female 2003     1     10-14            1
7              Eastern  Female 2003     1     15-19            0
8              Eastern  Female 2003     1     20-24            4
9              Eastern  Female 2003     1     25-29            9
10             Eastern  Female 2003     1     30-34            3
11             Eastern  Female 2003     1     35-39            7
12             Eastern  Female 2003     1     40-44            5
13             Eastern  Female 2003     1     45-49            5
14             Eastern  Female 2003     1     50-54            8
15             Eastern  Female 2003     1     55-59            5
16             Eastern  Female 2003     1     60-64            4
17             Eastern  Female 2003     1     65-69            7
18             Eastern  Female 2003     1     70-74            8
19             Eastern  Female 2003     1     75-79            5
20             Eastern  Female 2003     1     80-84           10
21             Eastern  Female 2003     1       85+           11
22             Eastern  Female 2003     2        -1            0
23             Eastern  Female 2003     2        -2            0
24             Eastern  Female 2003     2         0            4
25             Eastern  Female 2003     2      01-4            1
26             Eastern  Female 2003     2     05-09            2
27             Eastern  Female 2003     2     10-14            2
28             Eastern  Female 2003     2     15-19            0

我想从这个大数据框中过滤或提取较小的数据框。 例如,我只想有四个年龄组。这四个年龄组将分别包含:

Group 0: Consisting of Age.Group -1, -2 and 0.
Group 1-4: Consisting of Age.Group 01-4
Group 5-14: Consisting of Age.Group 05-09 and 10-14
Group 15+: Consisting of Age.Group 15-19 to 85+

Total.Deaths 将是每个组的总和。

所以我希望它看起来像这样

               District Gender Year Month Age.Group Total.Deaths
1              Eastern  Female 2003     1         0            4
2              Eastern  Female 2003     1      01-4            1
3              Eastern  Female 2003     1     05-14            1
4              Eastern  Female 2003     1       15+            104
5              Eastern  Female 2003     2         0            4
6              Eastern  Female 2003     2      01-4            1
7              Eastern  Female 2003     2     05-14            4
8              Eastern  Female 2003     2       15+            ...

我有很多数据并且已经搜索了几天,但无法找到一个函数来帮助做到这一点。

I have a dataframe made up of 6 columns. Columns 1 to 5 each have discrete names/values, such as a district, year, month, age interval and gender. The sixth column is the number of death counts for that specific combination.

               District Gender Year Month Age.Group Total.Deaths
1              Eastern  Female 2003     1        -1            0
2              Eastern  Female 2003     1        -2            2
3              Eastern  Female 2003     1         0            2
4              Eastern  Female 2003     1      01-4            1
5              Eastern  Female 2003     1     05-09            0
6              Eastern  Female 2003     1     10-14            1
7              Eastern  Female 2003     1     15-19            0
8              Eastern  Female 2003     1     20-24            4
9              Eastern  Female 2003     1     25-29            9
10             Eastern  Female 2003     1     30-34            3
11             Eastern  Female 2003     1     35-39            7
12             Eastern  Female 2003     1     40-44            5
13             Eastern  Female 2003     1     45-49            5
14             Eastern  Female 2003     1     50-54            8
15             Eastern  Female 2003     1     55-59            5
16             Eastern  Female 2003     1     60-64            4
17             Eastern  Female 2003     1     65-69            7
18             Eastern  Female 2003     1     70-74            8
19             Eastern  Female 2003     1     75-79            5
20             Eastern  Female 2003     1     80-84           10
21             Eastern  Female 2003     1       85+           11
22             Eastern  Female 2003     2        -1            0
23             Eastern  Female 2003     2        -2            0
24             Eastern  Female 2003     2         0            4
25             Eastern  Female 2003     2      01-4            1
26             Eastern  Female 2003     2     05-09            2
27             Eastern  Female 2003     2     10-14            2
28             Eastern  Female 2003     2     15-19            0

I would like to filter, or extract, smaller dataframes from this big dataframe.
For example, I would like to only have four age groups. These four age groups will each contain:

Group 0: Consisting of Age.Group -1, -2 and 0.
Group 1-4: Consisting of Age.Group 01-4
Group 5-14: Consisting of Age.Group 05-09 and 10-14
Group 15+: Consisting of Age.Group 15-19 to 85+

The Total.Deaths will then be the sum for each of these groups.

So I want it to look like this

               District Gender Year Month Age.Group Total.Deaths
1              Eastern  Female 2003     1         0            4
2              Eastern  Female 2003     1      01-4            1
3              Eastern  Female 2003     1     05-14            1
4              Eastern  Female 2003     1       15+            104
5              Eastern  Female 2003     2         0            4
6              Eastern  Female 2003     2      01-4            1
7              Eastern  Female 2003     2     05-14            4
8              Eastern  Female 2003     2       15+            ...

I have a lot of data and have searched for a few days, but unable to find a function to help be do this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

神魇的王 2024-11-26 09:35:22

可能有一种更简洁的方法可以使用 car 包中的 recode 来重新编码您的年龄变量,特别是因为您可以方便地使用排序级别对当前年龄变量进行编码很好地作为角色。但对于只有几个级别,我通常只是通过创建一个新的年龄变量来手动重新编码,并且这种方法是在 R 中“完成工作”的良好实践:

#Reading your data in from a text file that I made via copy/paste
dat <- read.table("~/Desktop/soEx.txt",sep="",header=TRUE)

#Make sure Age.Group is ordered and init new age variable
dat$Age.Group <- factor(dat$Age.Group,ordered=TRUE)
dat$AgeGroupNew <- rep(NA,nrow(dat))

#The recoding
dat$AgeGroupNew[dat$Age.Group <= "0"] <- "0"
dat$AgeGroupNew[dat$Age.Group == "01-4"] <- "01-4"
dat$AgeGroupNew[dat$Age.Group >= "05-09" & dat$Age.Group <= "10-14" ] <- "05-14"
dat$AgeGroupNew[dat$Age.Group > "10-14" ] <- "15+"

然后我们可以使用 ddply 生成摘要> 和 总结

datNew <- ddply(dat,.(District,Gender,Year,Month,AgeGroupNew),summarise,
            TotalDeaths = sum(Total.Deaths))

一开始我很担心,因为我有 91 人死亡,而不是你所说的 104 人,但我用手算了一下,我认为 91 人是正确的。也许是一个错字。

There may be a pithier way of recoding your age variable using something like recode from the car package, particularly since you've conveniently got your current age variable coded with levels that sort nicely as characters. But for only a few levels, I often just recode them by hand by creating a new age variable, and this method is good practice for just 'getting stuff done' in R:

#Reading your data in from a text file that I made via copy/paste
dat <- read.table("~/Desktop/soEx.txt",sep="",header=TRUE)

#Make sure Age.Group is ordered and init new age variable
dat$Age.Group <- factor(dat$Age.Group,ordered=TRUE)
dat$AgeGroupNew <- rep(NA,nrow(dat))

#The recoding
dat$AgeGroupNew[dat$Age.Group <= "0"] <- "0"
dat$AgeGroupNew[dat$Age.Group == "01-4"] <- "01-4"
dat$AgeGroupNew[dat$Age.Group >= "05-09" & dat$Age.Group <= "10-14" ] <- "05-14"
dat$AgeGroupNew[dat$Age.Group > "10-14" ] <- "15+"

Then we can generate summaries using ddply and summarise:

datNew <- ddply(dat,.(District,Gender,Year,Month,AgeGroupNew),summarise,
            TotalDeaths = sum(Total.Deaths))

I was worried at first because I got 91 deaths instead of 104 as you indicated, but I counted by hand and 91 is right I think. A typo, perhaps.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文