突变新变量by_group（dplyr）时删除NAS

发布于 2025-01-31 03:38:36 字数 3403 浏览 5 评论 0原文

我正在研究一个带有欧盟-SILC数据的UNI项目。我想创建一个新的变量，其中所有家庭都被分配给其相应的住房成本组，以创建一个堆叠的密度图，其收入分配与住房成本有关。

我遇到了两个问题：

我无法创建变量HCOST_GROUP，因为我的住房成本变量，这是将家庭分配给组的基础，其中有47个NA（在近70.000个观察值中）。创建新变量时，我尝试了许多不同的事情来删除NAS，但我一直收到错误消息。
因为我通常不想删除我没有住房成本的家庭，所以HCOST_GROUP变量将比我的收入变量短 - 我如何仅仅为了不包括我没有我没有的家庭的收入住房费用？

预先感谢！

这是我的代码（inkl错误消息）用于创建变量和图：

data <- data %>% filter(!is.na(hcost)) %>% group_by(country) %>% 
   mutate(hcost_group = quantcut(share_hc, q=c(0.1, 0.2, 0.3, 0.4)))

> 
> ggplot(data=data, aes(x=decile, group=hcost_group, fill=hcost_group)) 
   geom_density(adjust=1.5, position="fill") +
   facet_wrap(~country)+
   xlab("Einkommensdezil")+
   ylab("Anteil der Gruppen nach Wohnkostenbelastung")+
   scale_fill_discrete(name = "Wohnkostenbelastung (Anteil der Wohnkosten am EK)",
                       labels = 
                         c("0-10%", "10-20%","20-30%",
                           "30-40%", "40-100%"))

我还尝试了“ na.rm = true”，“ na.omit（）”以及“完整cases”。

编辑：

我意识到，我使用了一个错误的变量名称（上述代码），而突变不再给我一个错误。但是，新变量包含奇怪的数字。然后，该地块包含很多NAS。
这是复制我数据的代码：

reproduced_data <- 
  structure(
    list(
      country = c("AT",
                  "IT", "DE"),
      income_y = c(9235.28, 29867, 31975),
      hcost = c(558.16,
                105, 466.33),
      tenure = structure(
        c(3L, 5L, 3L),
        .Label = c("1",
                   "2", "3", "4", "5"),
        class = "factor"
      ),
      rooms = structure(
        2:4,
        .Label = c("1",
                   "2", "3", "4", "5", "6"),
        class = "factor"
      ),
      dwelling = structure(
        c(4L,
          2L, 3L),
        .Label = c("1", "2", "3", "4"),
        class = "factor"
      ),
      leak = structure(c(2L,
                         1L, 2L), .Label = c("1", "2"), class = "factor"),
      warm = structure(c(1L,
                         1L, 1L), .Label = c("1", "2"), class = "factor"),
      bath = structure(
        c(1L,
          1L, 1L),
        .Label = c("1", "2", "3"),
        class = "factor"
      ),
      toilet = structure(
        c(1L,
          1L, 1L),
        .Label = c("1", "2", "3"),
        class = "factor"
      ),
      light = structure(c(2L,
                          1L, 2L), .Label = c("1", "2"), class = "factor"),
      noise = structure(c(1L,
                          1L, 1L), .Label = c("1", "2"), class = "factor"),
      pollution = structure(c(2L,
                              1L, 1L), .Label = c("1", "2"), class = "factor"),
      crime = structure(c(2L,
                          1L, 2L), .Label = c("1", "2"), class = "factor"),
      share_hc = c(72.5253592744345,
                   4.2187029162621, 17.5010476935106),
      high_hcost = c("1", "0",
                     "0"),
      decile = c(1L, 6L, 6L)
    ),
    row.names = c(NA,-3L),
    groups = structure(
      list(
        country = c("AT", "DE", "IT"),
        .rows = structure(
          list(1L,
               3L, 2L),
          ptype = integer(0),
          class = c("vctrs_list_of",
                    "vctrs_vctr", "list")
        )
      ),
      row.names = c(NA,-3L),
      class = c("tbl_df",
                "tbl", "data.frame"),
      .drop = TRUE
    ),
    class = c("grouped_df",
              "tbl_df", "tbl", "data.frame")
  )

原文

I am working on a uni project with EU-SILC data.
I want to create a new variable where all households are assigned to their corresponding housing cost group to create a stacked density plot with the income distribution in relation to housing cost.

I encountered two problems:

I cannot create the variable hcost_group because my housing cost variable, which is the basis for assigning the households to the groups has 47 NAs (out of nearly 70.000 observations). I tried many different things to remove the NAs when creating the new variable but I keep getting an error message.
As I don't want to generally remove the households for which I dont have housing cost the hcost_group variable will be shorter than my income variable - how can I just for the plot exclude the income of the households for which I don't have a housing cost?

Thanks a lot in advance!

Here is my code (inkl error messages) for creating the variable and the plot:

data <- data %>% filter(!is.na(hcost)) %>% group_by(country) %>% 
   mutate(hcost_group = quantcut(share_hc, q=c(0.1, 0.2, 0.3, 0.4)))

> 
> ggplot(data=data, aes(x=decile, group=hcost_group, fill=hcost_group)) 
   geom_density(adjust=1.5, position="fill") +
   facet_wrap(~country)+
   xlab("Einkommensdezil")+
   ylab("Anteil der Gruppen nach Wohnkostenbelastung")+
   scale_fill_discrete(name = "Wohnkostenbelastung (Anteil der Wohnkosten am EK)",
                       labels = 
                         c("0-10%", "10-20%","20-30%",
                           "30-40%", "40-100%"))

I also tried "na.rm = TRUE", "na.omit()" and also "complete.cases".

EDIT:

I realized, that I used a wrong variable name (updated the code above) and mutate does not give me an error anymore. Nonetheless, the new variable contains weird numbers. And the plot then contains a lot of NAs.
Here is a code to reproduce my data:

reproduced_data <- 
  structure(
    list(
      country = c("AT",
                  "IT", "DE"),
      income_y = c(9235.28, 29867, 31975),
      hcost = c(558.16,
                105, 466.33),
      tenure = structure(
        c(3L, 5L, 3L),
        .Label = c("1",
                   "2", "3", "4", "5"),
        class = "factor"
      ),
      rooms = structure(
        2:4,
        .Label = c("1",
                   "2", "3", "4", "5", "6"),
        class = "factor"
      ),
      dwelling = structure(
        c(4L,
          2L, 3L),
        .Label = c("1", "2", "3", "4"),
        class = "factor"
      ),
      leak = structure(c(2L,
                         1L, 2L), .Label = c("1", "2"), class = "factor"),
      warm = structure(c(1L,
                         1L, 1L), .Label = c("1", "2"), class = "factor"),
      bath = structure(
        c(1L,
          1L, 1L),
        .Label = c("1", "2", "3"),
        class = "factor"
      ),
      toilet = structure(
        c(1L,
          1L, 1L),
        .Label = c("1", "2", "3"),
        class = "factor"
      ),
      light = structure(c(2L,
                          1L, 2L), .Label = c("1", "2"), class = "factor"),
      noise = structure(c(1L,
                          1L, 1L), .Label = c("1", "2"), class = "factor"),
      pollution = structure(c(2L,
                              1L, 1L), .Label = c("1", "2"), class = "factor"),
      crime = structure(c(2L,
                          1L, 2L), .Label = c("1", "2"), class = "factor"),
      share_hc = c(72.5253592744345,
                   4.2187029162621, 17.5010476935106),
      high_hcost = c("1", "0",
                     "0"),
      decile = c(1L, 6L, 6L)
    ),
    row.names = c(NA,-3L),
    groups = structure(
      list(
        country = c("AT", "DE", "IT"),
        .rows = structure(
          list(1L,
               3L, 2L),
          ptype = integer(0),
          class = c("vctrs_list_of",
                    "vctrs_vctr", "list")
        )
      ),
      row.names = c(NA,-3L),
      class = c("tbl_df",
                "tbl", "data.frame"),
      .drop = TRUE
    ),
    class = c("grouped_df",
              "tbl_df", "tbl", "data.frame")
  )

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

若言繁花未落 2025-02-07 03:38:36

在第一个问题中，我相信问题不是NA ...（您不能看不到基础），看来您的量化函数缺少正确的Q参数。 Q等待一个整数...

在第二个问题中，用过滤数据制作数据框。

也不可能在group_by之前进行突变

回复收藏 0 原文

沫离伤花 2025-02-07 03:38:36

这与mutate（）调用之前的随机+有关吗？

data <- data %>%
    drop_na(hcost) %>%
    group_by(country) %>%
    mutate(
        hcost_group = quantcut(hcost, q = c(.1, .2, .3, .4))
    )

我还将确保将hcost存储为数字向量。

Does this have something to do with the random + before the mutate() call?

data <- data %>%
    drop_na(hcost) %>%
    group_by(country) %>%
    mutate(
        hcost_group = quantcut(hcost, q = c(.1, .2, .3, .4))
    )

I would also ensure that hcost is stored as a numeric vector.

回复收藏 0 原文

~没有更多了~

关于作者

雄赳赳气昂昂

暂无简介

文章

26 人气

关注发私信

alipaysp_snBf0MSZIv

文章 0 评论 0

关注

梦断已成空

文章 0 评论 0

关注

瞎闹

文章 0 评论 0

关注

凯凯我们等你回来

文章 0 评论 0

关注

寄意

文章 0 评论 0

关注

似梦非梦

文章 0 评论 0

友情链接

文江博客

突变新变量by_group（dplyr）时删除NAS

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

突变新变量by_group（dplyr）时删除NA​​S

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

突变新变量by_group（dplyr）时删除NAS

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。