仅使用 Mutate 和 Case_When 来填充 NA 行
我已经盯着看几个小时了,我不知道在哪里可以找到像这样简单的问题的答案,所以我希望这不是一个重复的问题。
我有一个大数据框(936848 x 12),其中一列是一个编码名称,我可以从中导出其他列的值,在本例中,制造年份是根据列代码的第一个字符得出的。
数据框的小样本:
df <- data.frame(Code = c("AX123", "CL199", "GH679"),
Year = c(NA, "2014", "2018"))
我只想仅在值丢失时才根据列 Code 更改“年份”列。我不想覆盖年份列中的现有值。
由于这还涉及识别代码中字符串中的第一个字母,因此我使用 case_when
和 startsWith
:
df <- df %>%
filter(is.na(Year)) %>%
mutate(Year = case_when(startsWith(Code, "A") ~ 2013,
startsWith(Code, "C") ~ 2014,
startsWith(Code, "D") ~ 2015,
startsWith(Code, "E") ~ 2016,
startsWith(Code, "F") ~ 2017,
startsWith(Code, "G") ~ 2018,
startsWith(Code, "H") ~ 2019,
startsWith(Code, "J") ~ 2020,
TRUE ~ NA_real_
))
这将给出以下结果:
Code Year
1 AX123 2013
我的问题是我编写此过滤器的方式输出数据框中所有非 NA 行。我想保持数据框不变,只填充 NA 行。
我正在考虑将其嵌套到 ifelse 函数中,仅当列为 NA 时才进行变异,但我对如何编写它感到困惑。
df <- df %>%
mutate(ifelse(is.na(Year),
case_when(startsWith(Code, "A") ~ 2013,
startsWith(Code, "C") ~ 2014,
startsWith(Code, "D") ~ 2015,
startsWith(Code, "E") ~ 2016,
startsWith(Code, "F") ~ 2017,
startsWith(Code, "G") ~ 2018,
startsWith(Code, "H") ~ 2019,
startsWith(Code, "J") ~ 2020,
TRUE ~ NA_real_
)), "")
这显然会给出这个错误
Error: Problem with `mutate()` input `..1`.
i `..1 = ifelse(...)`.
x argument "no" is missing, with no default
我有很多类似的任务,我需要使用 ifelse
、grepl
、substring
等来检测中的字符代码列并在其他列中填写缺失的 NA。但是,由于许多已填充值的行是由于不遵循编码名称约定的规则的异常所致,因此我不想覆盖它们。
I've been goggling for hours and I'm not sure where to find the answer for something simple like this, so I hope this is not a duplicate question.
I have a large data frame (936848 x 12) with one column is a coded name from which I can derive the value of other column, in this case the year of manufactured based on the first character of column Code.
Small sample of the data frame:
df <- data.frame(Code = c("AX123", "CL199", "GH679"),
Year = c(NA, "2014", "2018"))
I just want to mutate the column Year based on column Code only if the value is missing. I did not want to overwrite existing value in column Year.
Since this also involve identifying the first alphabet in the string in Code, I use case_when
and startsWith
:
df <- df %>%
filter(is.na(Year)) %>%
mutate(Year = case_when(startsWith(Code, "A") ~ 2013,
startsWith(Code, "C") ~ 2014,
startsWith(Code, "D") ~ 2015,
startsWith(Code, "E") ~ 2016,
startsWith(Code, "F") ~ 2017,
startsWith(Code, "G") ~ 2018,
startsWith(Code, "H") ~ 2019,
startsWith(Code, "J") ~ 2020,
TRUE ~ NA_real_
))
This will gives out this result:
Code Year
1 AX123 2013
My problem is the way I write this filter out all non NA row in the dataframe. I want to keep the dataframe as it is, only to fill the NA row.
I'm thinking of nesting this to ifelse function, to mutate only if the column is NA, but I'm confusing myself on how to write it.
df <- df %>%
mutate(ifelse(is.na(Year),
case_when(startsWith(Code, "A") ~ 2013,
startsWith(Code, "C") ~ 2014,
startsWith(Code, "D") ~ 2015,
startsWith(Code, "E") ~ 2016,
startsWith(Code, "F") ~ 2017,
startsWith(Code, "G") ~ 2018,
startsWith(Code, "H") ~ 2019,
startsWith(Code, "J") ~ 2020,
TRUE ~ NA_real_
)), "")
which will obviously give this error
Error: Problem with `mutate()` input `..1`.
i `..1 = ifelse(...)`.
x argument "no" is missing, with no default
I have a lot of similar tasks where I need to use ifelse
, grepl
, substring
etc to detect the character in the code column and fill missing NA in other column. But because a lot of the row that are already fill with values are due to the exception from rules that does not follow the coded name convention, I did not want to overwrite them.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
基本 R 替代方案:
给出以下结果:
A base R alternative:
which gives the following result:
这是一种替代方法:
replacement
pattern
来匹配ifelse
语句与str_detect
和匹配
Here is an alternative approach:
replacement
pattern
to matchifelse
statement withstr_detect
andmatch
你几乎明白了。
ifelse
需要 3 个参数:is.na()
)Year
)输出:
没有匹配字母的示例,如注释中所要求:
输出
You almost got it.
ifelse
requires 3 arguments:is.na()
)Year
)Output:
Example for no matching letter, as asked for in the comments:
Output
如果您只想操作数据帧的一部分,则可以在任何“<-”赋值的左侧部分索引其部分。
您可以通过数据框后面的括号
[]
来定义这些部分:有关索引的更多信息:
https://stats.oarc.ucla.edu/r/modules/subsetting -data/
在你的情况下,这可能是:
If you only want to manipulate a part of a dataframe, you can index parts of it in the left part of any "<-" assignment.
you can define these parts by brackets
[]
behind the dataframe:more on indexing:
https://stats.oarc.ucla.edu/r/modules/subsetting-data/
In your case this could be:
这是一种不同的方法,使用查找表和更新连接。应该执行得相当快。
Here is a different approach, using a lookup-table and an update join. Should perform pretty fast.