突变新变量by_group(dplyr)时删除NA​​S

发布于 2025-01-31 03:38:36 字数 3403 浏览 5 评论 0原文

我正在研究一个带有欧盟-SILC数据的UNI项目。 我想创建一个新的变量,其中所有家庭都被分配给其相应的住房成本组,以创建一个堆叠的密度图,其收入分配与住房成本有关。

我遇到了两个问题:

  1. 我无法创建变量HCOST_GROUP,因为我的住房成本变量,这是将家庭分配给组的基础,其中有47个NA(在近70.000个观察值中)。创建新变量时,我尝试了许多不同的事情来删除NAS,但我一直收到错误消息。
  2. 因为我通常不想删除我没有住房成本的家庭,所以HCOST_GROUP变量将比我的收入变量短 - 我如何仅仅为了不包括我没有我没有的家庭的收入住房费用?

预先感谢!

这是我的代码(inkl错误消息)用于创建变量和图:

data <- data %>% filter(!is.na(hcost)) %>% group_by(country) %>% 
   mutate(hcost_group = quantcut(share_hc, q=c(0.1, 0.2, 0.3, 0.4)))

> 
> ggplot(data=data, aes(x=decile, group=hcost_group, fill=hcost_group)) 
   geom_density(adjust=1.5, position="fill") +
   facet_wrap(~country)+
   xlab("Einkommensdezil")+
   ylab("Anteil der Gruppen nach Wohnkostenbelastung")+
   scale_fill_discrete(name = "Wohnkostenbelastung (Anteil der Wohnkosten am EK)",
                       labels = 
                         c("0-10%", "10-20%","20-30%",
                           "30-40%", "40-100%"))

我还尝试了“ na.rm = true”,“ na.omit()”以及“完整cases”。

编辑:

  • 我意识到,我使用了一个错误的变量名称(上述代码),而突变不再给我一个错误。但是,新变量包含奇怪的数字。然后,该地块包含很多NAS。

  • 这是复制我数据的代码:

reproduced_data <- 
  structure(
    list(
      country = c("AT",
                  "IT", "DE"),
      income_y = c(9235.28, 29867, 31975),
      hcost = c(558.16,
                105, 466.33),
      tenure = structure(
        c(3L, 5L, 3L),
        .Label = c("1",
                   "2", "3", "4", "5"),
        class = "factor"
      ),
      rooms = structure(
        2:4,
        .Label = c("1",
                   "2", "3", "4", "5", "6"),
        class = "factor"
      ),
      dwelling = structure(
        c(4L,
          2L, 3L),
        .Label = c("1", "2", "3", "4"),
        class = "factor"
      ),
      leak = structure(c(2L,
                         1L, 2L), .Label = c("1", "2"), class = "factor"),
      warm = structure(c(1L,
                         1L, 1L), .Label = c("1", "2"), class = "factor"),
      bath = structure(
        c(1L,
          1L, 1L),
        .Label = c("1", "2", "3"),
        class = "factor"
      ),
      toilet = structure(
        c(1L,
          1L, 1L),
        .Label = c("1", "2", "3"),
        class = "factor"
      ),
      light = structure(c(2L,
                          1L, 2L), .Label = c("1", "2"), class = "factor"),
      noise = structure(c(1L,
                          1L, 1L), .Label = c("1", "2"), class = "factor"),
      pollution = structure(c(2L,
                              1L, 1L), .Label = c("1", "2"), class = "factor"),
      crime = structure(c(2L,
                          1L, 2L), .Label = c("1", "2"), class = "factor"),
      share_hc = c(72.5253592744345,
                   4.2187029162621, 17.5010476935106),
      high_hcost = c("1", "0",
                     "0"),
      decile = c(1L, 6L, 6L)
    ),
    row.names = c(NA,-3L),
    groups = structure(
      list(
        country = c("AT", "DE", "IT"),
        .rows = structure(
          list(1L,
               3L, 2L),
          ptype = integer(0),
          class = c("vctrs_list_of",
                    "vctrs_vctr", "list")
        )
      ),
      row.names = c(NA,-3L),
      class = c("tbl_df",
                "tbl", "data.frame"),
      .drop = TRUE
    ),
    class = c("grouped_df",
              "tbl_df", "tbl", "data.frame")
  )

I am working on a uni project with EU-SILC data.
I want to create a new variable where all households are assigned to their corresponding housing cost group to create a stacked density plot with the income distribution in relation to housing cost.

I encountered two problems:

  1. I cannot create the variable hcost_group because my housing cost variable, which is the basis for assigning the households to the groups has 47 NAs (out of nearly 70.000 observations). I tried many different things to remove the NAs when creating the new variable but I keep getting an error message.
  2. As I don't want to generally remove the households for which I dont have housing cost the hcost_group variable will be shorter than my income variable - how can I just for the plot exclude the income of the households for which I don't have a housing cost?

Thanks a lot in advance!

Here is my code (inkl error messages) for creating the variable and the plot:

data <- data %>% filter(!is.na(hcost)) %>% group_by(country) %>% 
   mutate(hcost_group = quantcut(share_hc, q=c(0.1, 0.2, 0.3, 0.4)))

> 
> ggplot(data=data, aes(x=decile, group=hcost_group, fill=hcost_group)) 
   geom_density(adjust=1.5, position="fill") +
   facet_wrap(~country)+
   xlab("Einkommensdezil")+
   ylab("Anteil der Gruppen nach Wohnkostenbelastung")+
   scale_fill_discrete(name = "Wohnkostenbelastung (Anteil der Wohnkosten am EK)",
                       labels = 
                         c("0-10%", "10-20%","20-30%",
                           "30-40%", "40-100%"))

I also tried "na.rm = TRUE", "na.omit()" and also "complete.cases".

EDIT:

  • I realized, that I used a wrong variable name (updated the code above) and mutate does not give me an error anymore. Nonetheless, the new variable contains weird numbers. And the plot then contains a lot of NAs.

  • Here is a code to reproduce my data:

reproduced_data <- 
  structure(
    list(
      country = c("AT",
                  "IT", "DE"),
      income_y = c(9235.28, 29867, 31975),
      hcost = c(558.16,
                105, 466.33),
      tenure = structure(
        c(3L, 5L, 3L),
        .Label = c("1",
                   "2", "3", "4", "5"),
        class = "factor"
      ),
      rooms = structure(
        2:4,
        .Label = c("1",
                   "2", "3", "4", "5", "6"),
        class = "factor"
      ),
      dwelling = structure(
        c(4L,
          2L, 3L),
        .Label = c("1", "2", "3", "4"),
        class = "factor"
      ),
      leak = structure(c(2L,
                         1L, 2L), .Label = c("1", "2"), class = "factor"),
      warm = structure(c(1L,
                         1L, 1L), .Label = c("1", "2"), class = "factor"),
      bath = structure(
        c(1L,
          1L, 1L),
        .Label = c("1", "2", "3"),
        class = "factor"
      ),
      toilet = structure(
        c(1L,
          1L, 1L),
        .Label = c("1", "2", "3"),
        class = "factor"
      ),
      light = structure(c(2L,
                          1L, 2L), .Label = c("1", "2"), class = "factor"),
      noise = structure(c(1L,
                          1L, 1L), .Label = c("1", "2"), class = "factor"),
      pollution = structure(c(2L,
                              1L, 1L), .Label = c("1", "2"), class = "factor"),
      crime = structure(c(2L,
                          1L, 2L), .Label = c("1", "2"), class = "factor"),
      share_hc = c(72.5253592744345,
                   4.2187029162621, 17.5010476935106),
      high_hcost = c("1", "0",
                     "0"),
      decile = c(1L, 6L, 6L)
    ),
    row.names = c(NA,-3L),
    groups = structure(
      list(
        country = c("AT", "DE", "IT"),
        .rows = structure(
          list(1L,
               3L, 2L),
          ptype = integer(0),
          class = c("vctrs_list_of",
                    "vctrs_vctr", "list")
        )
      ),
      row.names = c(NA,-3L),
      class = c("tbl_df",
                "tbl", "data.frame"),
      .drop = TRUE
    ),
    class = c("grouped_df",
              "tbl_df", "tbl", "data.frame")
  )

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

若言繁花未落 2025-02-07 03:38:36

在第一个问题中,我相信问题不是NA ...(您不能看不到基础),看来您的量化函数缺少正确的Q参数。 Q等待一个整数...

在第二个问题中,用过滤数据制作数据框。

也不可能在group_by之前进行突变

In the first problem, I believe that the issue is not NA... (you can't say without seeing the base), it seems that your quantcut function is missing the correct q parameter. Q waits an integer...

In the second problem make a data frame with the filtered data.

it would also not be possible to make your mutate before the group_by

沫离伤花 2025-02-07 03:38:36

这与mutate()调用之前的随机+有关吗?

data <- data %>%
    drop_na(hcost) %>%
    group_by(country) %>%
    mutate(
        hcost_group = quantcut(hcost, q = c(.1, .2, .3, .4))
    )

我还将确保将hcost存储为数字向量。

Does this have something to do with the random + before the mutate() call?

data <- data %>%
    drop_na(hcost) %>%
    group_by(country) %>%
    mutate(
        hcost_group = quantcut(hcost, q = c(.1, .2, .3, .4))
    )

I would also ensure that hcost is stored as a numeric vector.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文