仅使用 Mutate 和 Case_When 来填充 NA 行

发布于 2025-01-15 16:59:04 字数 2077 浏览 2 评论 0原文

我已经盯着看几个小时了,我不知道在哪里可以找到像这样简单的问题的答案,所以我希望这不是一个重复的问题。

我有一个大数据框(936848 x 12),其中一列是一个编码名称,我可以从中导出其他列的值,在本例中,制造年份是根据列代码的第一个字符得出的。

数据框的小样本:

df <- data.frame(Code = c("AX123", "CL199", "GH679"), 
                 Year = c(NA, "2014", "2018")) 

我只想仅在值丢失时才根据列 Code 更改“年份”列。我不想覆盖年份列中的现有值。

由于这还涉及识别代码中字符串中的第一个字母,因此我使用 case_whenstartsWith

df <- df %>%
  filter(is.na(Year)) %>%
  mutate(Year = case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                          TRUE ~ NA_real_
                          ))

这将给出以下结果:

   Code Year
1 AX123 2013

我的问题是我编写此过滤器的方式输出数据框中所有非 NA 行。我想保持数据框不变,只填充 NA 行。

我正在考虑将其嵌套到 ifelse 函数中,仅当列为 NA 时才进行变异,但我对如何编写它感到困惑。

df <- df %>%
  mutate(ifelse(is.na(Year),
                case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                          TRUE ~ NA_real_
  )), "")

这显然会给出这个错误

Error: Problem with `mutate()` input `..1`.
i `..1 = ifelse(...)`.
x argument "no" is missing, with no default

我有很多类似的任务,我需要使用 ifelsegreplsubstring 等来检测中的字符代码列并在其他列中填写缺失的 NA。但是,由于许多已填充值的行是由于不遵循编码名称约定的规则的异常所致,因此我不想覆盖它们。

I've been goggling for hours and I'm not sure where to find the answer for something simple like this, so I hope this is not a duplicate question.

I have a large data frame (936848 x 12) with one column is a coded name from which I can derive the value of other column, in this case the year of manufactured based on the first character of column Code.

Small sample of the data frame:

df <- data.frame(Code = c("AX123", "CL199", "GH679"), 
                 Year = c(NA, "2014", "2018")) 

I just want to mutate the column Year based on column Code only if the value is missing. I did not want to overwrite existing value in column Year.

Since this also involve identifying the first alphabet in the string in Code, I use case_when and startsWith:

df <- df %>%
  filter(is.na(Year)) %>%
  mutate(Year = case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                          TRUE ~ NA_real_
                          ))

This will gives out this result:

   Code Year
1 AX123 2013

My problem is the way I write this filter out all non NA row in the dataframe. I want to keep the dataframe as it is, only to fill the NA row.

I'm thinking of nesting this to ifelse function, to mutate only if the column is NA, but I'm confusing myself on how to write it.

df <- df %>%
  mutate(ifelse(is.na(Year),
                case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                          TRUE ~ NA_real_
  )), "")

which will obviously give this error

Error: Problem with `mutate()` input `..1`.
i `..1 = ifelse(...)`.
x argument "no" is missing, with no default

I have a lot of similar tasks where I need to use ifelse, grepl, substring etc to detect the character in the code column and fill missing NA in other column. But because a lot of the row that are already fill with values are due to the exception from rules that does not follow the coded name convention, I did not want to overwrite them.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

巴黎盛开的樱花 2025-01-22 16:59:04

基本 R 替代方案:

# option 1: readable version
ix <- match(substr(df$Code[is.na(df$Year)],1,1), LETTERS[c(1,3:8,10)])
df$Year[is.na(df$Year)] <- ix + 2012

# option 2: direct version
df$Year[is.na(df$Year)] <- match(substr(df$Code[is.na(df$Year)],1,1), LETTERS[c(1,3:8,10)]) + 2012

给出以下结果:

<前><代码>> df
代码年份
1 AX123 2013
2 CL199 2014
3 GH679 2018

A base R alternative:

# option 1: readable version
ix <- match(substr(df$Code[is.na(df$Year)],1,1), LETTERS[c(1,3:8,10)])
df$Year[is.na(df$Year)] <- ix + 2012

# option 2: direct version
df$Year[is.na(df$Year)] <- match(substr(df$Code[is.na(df$Year)],1,1), LETTERS[c(1,3:8,10)]) + 2012

which gives the following result:

> df
   Code Year
1 AX123 2013
2 CL199 2014
3 GH679 2018
乖乖 2025-01-22 16:59:04

这是一种替代方法:

  1. 创建一个命名向量replacement
  2. 创建一个pattern来匹配
  3. 使用ifelse语句与str_detect匹配

replacement <- 2013:2020
names(replacement) <- LETTERS[c(1, 3:9)]
pattern <- paste(names(replacement), collapse = '|')

library(dplyr)
library(stringr)

df %>% 
  mutate(helper = substring(Code, 1, 1),
         Year = ifelse(is.na(Year) & str_detect(helper, pattern), 
                       replacement[match(helper, names(replacement))], Year)) %>% 
  select(-helper)
  Code Year
1 AX123 2013
2 CL199 2014
3 GH679 2018

Here is an alternative approach:

  1. Create a named vector replacement
  2. create a pattern to match
  3. Use an ifelse statement with str_detect and match

replacement <- 2013:2020
names(replacement) <- LETTERS[c(1, 3:9)]
pattern <- paste(names(replacement), collapse = '|')

library(dplyr)
library(stringr)

df %>% 
  mutate(helper = substring(Code, 1, 1),
         Year = ifelse(is.na(Year) & str_detect(helper, pattern), 
                       replacement[match(helper, names(replacement))], Year)) %>% 
  select(-helper)
  Code Year
1 AX123 2013
2 CL199 2014
3 GH679 2018
↘紸啶 2025-01-22 16:59:04

你几乎明白了。 ifelse 需要 3 个参数:

  • test (在您的情况下:is.na()
  • yes (在您的情况下:根据起始字符替换为 Year)
  • no (在您的情况下 ) case: copy Year)

df %>%
  mutate(Year1 = ifelse(is.na(Year),
                case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                ), Year))

输出:

   Code Year Year1
1 AX123 <NA>  2013
2 CL199 2014  2014
3 GH679 2018  2018

没有匹配字母的示例,如注释中所要求:

df <- data.frame(Code = c("AX123", "CL199", "GH679", "XX485"), 
                 Year = c(NA, "2014", "2018", NA))

df %>%
  mutate(Year1 = ifelse(is.na(Year),
                case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                ), Year))

输出

   Code Year Year1
1 AX123 <NA>  2013
2 CL199 2014  2014
3 GH679 2018  2018
4 XX485 <NA>  <NA>

You almost got it. ifelse requires 3 arguments:

  • test (in your case: is.na())
  • yes (in your case: replace with Year, as per starting character)
  • no (in your case: copy Year)

df %>%
  mutate(Year1 = ifelse(is.na(Year),
                case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                ), Year))

Output:

   Code Year Year1
1 AX123 <NA>  2013
2 CL199 2014  2014
3 GH679 2018  2018

Example for no matching letter, as asked for in the comments:

df <- data.frame(Code = c("AX123", "CL199", "GH679", "XX485"), 
                 Year = c(NA, "2014", "2018", NA))

df %>%
  mutate(Year1 = ifelse(is.na(Year),
                case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                ), Year))

Output

   Code Year Year1
1 AX123 <NA>  2013
2 CL199 2014  2014
3 GH679 2018  2018
4 XX485 <NA>  <NA>
瀟灑尐姊 2025-01-22 16:59:04

如果您只想操作数据帧的一部分,则可以在任何“<-”赋值的左侧部分索引其部分。

您可以通过数据框后面的括号 [] 来定义这些部分:

df[rows,columns]

有关索引的更多信息:
https://stats.oarc.ucla.edu/r/modules/subsetting -data/

在你的情况下,这可能是:

df[is.na(df$Year),] <- df %>%
  filter(is.na(Year)) %>%
  mutate(Year = case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                          TRUE ~ NA_real_))

If you only want to manipulate a part of a dataframe, you can index parts of it in the left part of any "<-" assignment.

you can define these parts by brackets [] behind the dataframe:

df[rows,columns]

more on indexing:
https://stats.oarc.ucla.edu/r/modules/subsetting-data/

In your case this could be:

df[is.na(df$Year),] <- df %>%
  filter(is.na(Year)) %>%
  mutate(Year = case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                          TRUE ~ NA_real_))
心作怪 2025-01-22 16:59:04

这是一种不同的方法,使用查找表和更新连接。应该执行得相当快。

df <- data.frame(Code = c("AX123", "CL199", "GH679"), 
                 Year = c(NA, 2014, 2018)) 

library(data.table)
# Create lookup table with regexes and years
lookup <- data.table(id = LETTERS[c(1,3:8,10)], newYear = 2013:2020)
# Make df a data.table  
setDT(df)
# Get the first letter of Code-column, to join on
df[, temp := substr(Code, 1, 1)]
# perform by-reference update join
df[is.na(Year), Year := lookup[df[is.na(Year), ], newYear, on = .(id = temp)]][]
# remove temp
df[, temp := NULL]
# Code Year
# 1: AX123 2013
# 2: CL199 2014
# 3: GH679 2018

Here is a different approach, using a lookup-table and an update join. Should perform pretty fast.

df <- data.frame(Code = c("AX123", "CL199", "GH679"), 
                 Year = c(NA, 2014, 2018)) 

library(data.table)
# Create lookup table with regexes and years
lookup <- data.table(id = LETTERS[c(1,3:8,10)], newYear = 2013:2020)
# Make df a data.table  
setDT(df)
# Get the first letter of Code-column, to join on
df[, temp := substr(Code, 1, 1)]
# perform by-reference update join
df[is.na(Year), Year := lookup[df[is.na(Year), ], newYear, on = .(id = temp)]][]
# remove temp
df[, temp := NULL]
# Code Year
# 1: AX123 2013
# 2: CL199 2014
# 3: GH679 2018
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文