突变新变量by_group(dplyr)时删除NAS
我正在研究一个带有欧盟-SILC数据的UNI项目。 我想创建一个新的变量,其中所有家庭都被分配给其相应的住房成本组,以创建一个堆叠的密度图,其收入分配与住房成本有关。
我遇到了两个问题:
- 我无法创建变量HCOST_GROUP,因为我的住房成本变量,这是将家庭分配给组的基础,其中有47个NA(在近70.000个观察值中)。创建新变量时,我尝试了许多不同的事情来删除NAS,但我一直收到错误消息。
- 因为我通常不想删除我没有住房成本的家庭,所以HCOST_GROUP变量将比我的收入变量短 - 我如何仅仅为了不包括我没有我没有的家庭的收入住房费用?
预先感谢!
这是我的代码(inkl错误消息)用于创建变量和图:
data <- data %>% filter(!is.na(hcost)) %>% group_by(country) %>%
mutate(hcost_group = quantcut(share_hc, q=c(0.1, 0.2, 0.3, 0.4)))
>
> ggplot(data=data, aes(x=decile, group=hcost_group, fill=hcost_group))
geom_density(adjust=1.5, position="fill") +
facet_wrap(~country)+
xlab("Einkommensdezil")+
ylab("Anteil der Gruppen nach Wohnkostenbelastung")+
scale_fill_discrete(name = "Wohnkostenbelastung (Anteil der Wohnkosten am EK)",
labels =
c("0-10%", "10-20%","20-30%",
"30-40%", "40-100%"))
我还尝试了“ na.rm = true”,“ na.omit()”以及“完整cases”。
编辑:
我意识到,我使用了一个错误的变量名称(上述代码),而突变不再给我一个错误。但是,新变量包含奇怪的数字。然后,该地块包含很多NAS。
这是复制我数据的代码:
reproduced_data <-
structure(
list(
country = c("AT",
"IT", "DE"),
income_y = c(9235.28, 29867, 31975),
hcost = c(558.16,
105, 466.33),
tenure = structure(
c(3L, 5L, 3L),
.Label = c("1",
"2", "3", "4", "5"),
class = "factor"
),
rooms = structure(
2:4,
.Label = c("1",
"2", "3", "4", "5", "6"),
class = "factor"
),
dwelling = structure(
c(4L,
2L, 3L),
.Label = c("1", "2", "3", "4"),
class = "factor"
),
leak = structure(c(2L,
1L, 2L), .Label = c("1", "2"), class = "factor"),
warm = structure(c(1L,
1L, 1L), .Label = c("1", "2"), class = "factor"),
bath = structure(
c(1L,
1L, 1L),
.Label = c("1", "2", "3"),
class = "factor"
),
toilet = structure(
c(1L,
1L, 1L),
.Label = c("1", "2", "3"),
class = "factor"
),
light = structure(c(2L,
1L, 2L), .Label = c("1", "2"), class = "factor"),
noise = structure(c(1L,
1L, 1L), .Label = c("1", "2"), class = "factor"),
pollution = structure(c(2L,
1L, 1L), .Label = c("1", "2"), class = "factor"),
crime = structure(c(2L,
1L, 2L), .Label = c("1", "2"), class = "factor"),
share_hc = c(72.5253592744345,
4.2187029162621, 17.5010476935106),
high_hcost = c("1", "0",
"0"),
decile = c(1L, 6L, 6L)
),
row.names = c(NA,-3L),
groups = structure(
list(
country = c("AT", "DE", "IT"),
.rows = structure(
list(1L,
3L, 2L),
ptype = integer(0),
class = c("vctrs_list_of",
"vctrs_vctr", "list")
)
),
row.names = c(NA,-3L),
class = c("tbl_df",
"tbl", "data.frame"),
.drop = TRUE
),
class = c("grouped_df",
"tbl_df", "tbl", "data.frame")
)
I am working on a uni project with EU-SILC data.
I want to create a new variable where all households are assigned to their corresponding housing cost group to create a stacked density plot with the income distribution in relation to housing cost.
I encountered two problems:
- I cannot create the variable hcost_group because my housing cost variable, which is the basis for assigning the households to the groups has 47 NAs (out of nearly 70.000 observations). I tried many different things to remove the NAs when creating the new variable but I keep getting an error message.
- As I don't want to generally remove the households for which I dont have housing cost the hcost_group variable will be shorter than my income variable - how can I just for the plot exclude the income of the households for which I don't have a housing cost?
Thanks a lot in advance!
Here is my code (inkl error messages) for creating the variable and the plot:
data <- data %>% filter(!is.na(hcost)) %>% group_by(country) %>%
mutate(hcost_group = quantcut(share_hc, q=c(0.1, 0.2, 0.3, 0.4)))
>
> ggplot(data=data, aes(x=decile, group=hcost_group, fill=hcost_group))
geom_density(adjust=1.5, position="fill") +
facet_wrap(~country)+
xlab("Einkommensdezil")+
ylab("Anteil der Gruppen nach Wohnkostenbelastung")+
scale_fill_discrete(name = "Wohnkostenbelastung (Anteil der Wohnkosten am EK)",
labels =
c("0-10%", "10-20%","20-30%",
"30-40%", "40-100%"))
I also tried "na.rm = TRUE", "na.omit()" and also "complete.cases".
EDIT:
I realized, that I used a wrong variable name (updated the code above) and mutate does not give me an error anymore. Nonetheless, the new variable contains weird numbers. And the plot then contains a lot of NAs.
Here is a code to reproduce my data:
reproduced_data <-
structure(
list(
country = c("AT",
"IT", "DE"),
income_y = c(9235.28, 29867, 31975),
hcost = c(558.16,
105, 466.33),
tenure = structure(
c(3L, 5L, 3L),
.Label = c("1",
"2", "3", "4", "5"),
class = "factor"
),
rooms = structure(
2:4,
.Label = c("1",
"2", "3", "4", "5", "6"),
class = "factor"
),
dwelling = structure(
c(4L,
2L, 3L),
.Label = c("1", "2", "3", "4"),
class = "factor"
),
leak = structure(c(2L,
1L, 2L), .Label = c("1", "2"), class = "factor"),
warm = structure(c(1L,
1L, 1L), .Label = c("1", "2"), class = "factor"),
bath = structure(
c(1L,
1L, 1L),
.Label = c("1", "2", "3"),
class = "factor"
),
toilet = structure(
c(1L,
1L, 1L),
.Label = c("1", "2", "3"),
class = "factor"
),
light = structure(c(2L,
1L, 2L), .Label = c("1", "2"), class = "factor"),
noise = structure(c(1L,
1L, 1L), .Label = c("1", "2"), class = "factor"),
pollution = structure(c(2L,
1L, 1L), .Label = c("1", "2"), class = "factor"),
crime = structure(c(2L,
1L, 2L), .Label = c("1", "2"), class = "factor"),
share_hc = c(72.5253592744345,
4.2187029162621, 17.5010476935106),
high_hcost = c("1", "0",
"0"),
decile = c(1L, 6L, 6L)
),
row.names = c(NA,-3L),
groups = structure(
list(
country = c("AT", "DE", "IT"),
.rows = structure(
list(1L,
3L, 2L),
ptype = integer(0),
class = c("vctrs_list_of",
"vctrs_vctr", "list")
)
),
row.names = c(NA,-3L),
class = c("tbl_df",
"tbl", "data.frame"),
.drop = TRUE
),
class = c("grouped_df",
"tbl_df", "tbl", "data.frame")
)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在第一个问题中,我相信问题不是NA ...(您不能看不到基础),看来您的量化函数缺少正确的Q参数。 Q等待一个整数...
在第二个问题中,用过滤数据制作数据框。
也不可能在group_by之前进行突变
In the first problem, I believe that the issue is not NA... (you can't say without seeing the base), it seems that your quantcut function is missing the correct q parameter. Q waits an integer...
In the second problem make a data frame with the filtered data.
it would also not be possible to make your mutate before the group_by
这与
mutate()
调用之前的随机+
有关吗?我还将确保将
hcost
存储为数字向量。Does this have something to do with the random
+
before themutate()
call?I would also ensure that
hcost
is stored as a numeric vector.