如何在 R 数据框中将 NA 值替换为零?
我有一个数据框,有些列具有 NA
值。
如何用零替换这些 NA
值?
I have a data frame and some columns have NA
values.
How do I replace these NA
values with zeroes?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(29)
请参阅我在 @gsk3 答案中的评论。一个简单的例子:
不需要申请
apply
。 =)编辑
您还应该查看
norm
包。它有很多用于缺失数据分析的好功能。 =)See my comment in @gsk3 answer. A simple example:
There's no need to apply
apply
. =)EDIT
You should also take a look at
norm
package. It has a lot of nice features for missing data analysis. =)dplyr 混合选项现在比 Base R 子集重新分配快约 30%。在 100M 数据点数据帧上
mutate_all(~replace(., is.na(.), 0))
比基本 Rd[is.na(d) 快半秒] <- 0
选项。特别要避免的是使用ifelse()
或if_else()
。 (完整的 600 项试验分析耗时超过 4.5 小时,主要是因为包含了这些方法。)请参阅下面的基准分析以了解完整的结果。如果您正在努力处理大量数据帧,
data.table
是最快的选择:比标准 Base R 方法快 40%。它还会就地修改数据,有效地允许您一次处理近两倍的数据。其他有用的 tidyverse 替换方法的集群
位置:
mutate_at(c(5:10), ~replace(., is .na(.), 0))
mutate_at(vars(var5:var10), ~replace(., is.na(.), 0))< /代码>
mutate_at(vars(contains("1")), ~replace(., is.na(.), 0))
contains()
,尝试ends_with()
、starts_with()
mutate_at(vars(matches("\\d{2}")), ~replace(., is.na(.), 0))
有条件地:
(仅更改单个类型,保留其他类型。)
mutate_if(is.integer, ~replace(., is.na(.), 0))
mutate_if(is.numeric, ~replace(., is.na(.), 0))
mutate_if(is.character, ~replace(., is.na(.), 0))
##完整分析 -
更新了 dplyr 0.8.0:函数使用 purrr 格式
~
符号:替换已弃用的funs()
参数。测试:
此分析的代码:
结果摘要
结果箱线图
试验的颜色编码散点图(y 轴为对数刻度)
其他高性能者的注释
关于 更大的是,Tidyr 的
replace_na
历史上曾被拉到前面。当前要运行 100M 数据点的集合,其性能几乎与 Base R For 循环一样好。我很好奇不同大小的数据帧会发生什么。可以在此处找到
mutate
和summarize
_at
和_all
函数变体的其他示例:https://rdrr.io/cran/dplyr/man/summarise_all.html此外,我在这里找到了有用的演示和示例集合:https ://blog.exploratory.io/dplyr-0-5-is-awesome-heres-why-be095fd4eb8a
归因和 特别感谢
:
local()
的使用,并且(与 Frank 的病人一起)也有帮助)无声强制在加速许多这些方法中所发挥的作用。coalesce()
函数并更新了分析。is.numeric()
真正测试的内容。(当然,如果您发现这些方法有用,也请给他们投赞成票。)
注意我对数字的使用:如果如果你确实有一个纯整数数据集,你的所有函数都会运行得更快。请参阅 alexiz_laz 的作品了解更多信息。 IRL,我不记得遇到过包含超过 10-15% 整数的数据集,因此我在全数字数据帧上运行这些测试。
使用的硬件
3.9 GHz CPU 和 24 GB RAM
The dplyr hybridized options are now around 30% faster than the Base R subset reassigns. On a 100M datapoint dataframe
mutate_all(~replace(., is.na(.), 0))
runs a half a second faster than the base Rd[is.na(d)] <- 0
option. What one wants to avoid specifically is using anifelse()
or anif_else()
. (The complete 600 trial analysis ran to over 4.5 hours mostly due to including these approaches.) Please see benchmark analyses below for the complete results.If you are struggling with massive dataframes,
data.table
is the fastest option of all: 40% faster than the standard Base R approach. It also modifies the data in place, effectively allowing you to work with nearly twice as much of the data at once.A clustering of other helpful tidyverse replacement approaches
Locationally:
mutate_at(c(5:10), ~replace(., is.na(.), 0))
mutate_at(vars(var5:var10), ~replace(., is.na(.), 0))
mutate_at(vars(contains("1")), ~replace(., is.na(.), 0))
contains()
, tryends_with()
,starts_with()
mutate_at(vars(matches("\\d{2}")), ~replace(., is.na(.), 0))
Conditionally:
(change just single type and leave other types alone.)
mutate_if(is.integer, ~replace(., is.na(.), 0))
mutate_if(is.numeric, ~replace(., is.na(.), 0))
mutate_if(is.character, ~replace(., is.na(.), 0))
##The Complete Analysis -
Updated for dplyr 0.8.0: functions use purrr format
~
symbols: replacing deprecatedfuns()
arguments.Approaches tested:
The code for this analysis:
Summary of Results
Boxplot of Results
Color-coded Scatterplot of Trials (with y-axis on a log scale)
A note on the other high performers
When the datasets get larger, Tidyr''s
replace_na
had historically pulled out in front. With the current collection of 100M data points to run through, it performs almost exactly as well as a Base R For Loop. I am curious to see what happens for different sized dataframes.Additional examples for the
mutate
andsummarize
_at
and_all
function variants can be found here: https://rdrr.io/cran/dplyr/man/summarise_all.htmlAdditionally, I found helpful demonstrations and collections of examples here: https://blog.exploratory.io/dplyr-0-5-is-awesome-heres-why-be095fd4eb8a
Attributions and Appreciations
With special thanks to:
local()
, and (with Frank's patient help, too) the role that silent coercion plays in speeding up many of these approaches.coalesce()
function in and update the analysis.data.table
functions well enough to finally include them in the lineup.is.numeric()
really tests.(Of course, please reach over and give them upvotes, too if you find those approaches useful.)
Note on my use of Numerics: If you do have a pure integer dataset, all of your functions will run faster. Please see alexiz_laz's work for more information. IRL, I can't recall encountering a data set containing more than 10-15% integers, so I am running these tests on fully numeric dataframes.
Hardware Used
3.9 GHz CPU with 24 GB RAM
对于单个向量:
对于 data.frame,从上面创建一个函数,然后将其
应用
到列。请下次提供一个可重现的示例,详细信息如下:
如何制作一个很棒的示例R 可重现的例子?
For a single vector:
For a data.frame, make a function out of the above, then
apply
it to the columns.Please provide a reproducible example next time as detailed here:
How to make a great R reproducible example?
dplyr 示例:
注意: 这适用于每个选定的列,如果我们需要对所有列执行此操作,请使用 mutate_each。
dplyr example:
Note: This works per selected column, if we need to do this for all column, see @reidjax's answer using mutate_each.
也可以使用
tidyr::replace_na
。编辑(dplyr > 1.0.0):
It is also possible to use
tidyr::replace_na
.Edit (dplyr > 1.0.0):
如果我们在导出时尝试替换 NA,例如写入 csv 时,那么我们可以使用:
If we are trying to replace
NA
s when exporting, for example when writing to csv, then we can use:我知道这个问题已经得到解答,但是这样做对某些人来说可能更有用:
定义这个函数:
现在,每当您需要将向量中的 NA 转换为零时,您都可以这样做:
I know the question is already answered, but doing it this way might be more useful to some:
Define this function:
Now whenever you need to convert NA's in a vector to zero's you can do:
使用
dplyr
0.5.0,您可以使用coalesce
函数,通过执行coalesce(向量,0)
。这会将vec
中的所有 NA 替换为 0:假设我们有一个包含
NA
的数据帧:With
dplyr
0.5.0, you can usecoalesce
function which can be easily integrated into%>%
pipeline by doingcoalesce(vec, 0)
. This replaces all NAs invec
with 0:Say we have a data frame with
NA
s:在矩阵或向量中使用
replace()
的更通用方法将NA
替换为0
例如:
这也是使用
的替代方法
dplyr
中的 >ifelse()More general approach of using
replace()
in matrix or vector to replaceNA
to0
For example:
This is also an alternative to using
ifelse()
indplyr
要替换数据框中的所有 NA,您可以使用:
df %>% Replace(is.na(.), 0)
To replace all NAs in a dataframe you can use:
df %>% replace(is.na(.), 0)
会对@ianmunoz 的帖子发表评论,但我没有足够的声誉。您可以结合
dplyr
的mutate_each
和replace
来处理NA
到0替换。使用 @aL3xa 的答案中的数据帧...
我们在这里使用标准评估(SE),这就是为什么我们需要在“
funs_
”上使用下划线。我们还使用lazyeval
的interp
/~
和.
引用“我们正在使用的所有内容”,即数据框。现在有零了!Would've commented on @ianmunoz's post but I don't have enough reputation. You can combine
dplyr
'smutate_each
andreplace
to take care of theNA
to0
replacement. Using the dataframe from @aL3xa's answer...We're using standard evaluation (SE) here which is why we need the underscore on "
funs_
." We also uselazyeval
'sinterp
/~
and the.
references "everything we are working with", i.e. the data frame. Now there are zeros!使用 imputeTS 包的另一个示例:
Another example using imputeTS package:
如果您想替换因子变量中的 NA,这可能很有用:
它将因子向量转换为数值向量,并添加另一个人工数值因子级别,然后将其转换回带有一个额外的“NA 级别”的因子向量”你的选择。
If you want to replace NAs in factor variables, this might be useful:
It transforms a factor-vector into a numeric vector and adds another artifical numeric factor level, which is then transformed back to a factor-vector with one extra "NA-level" of your choice.
用于此目的的专用函数
nafill
和setnafill
位于data.table
中。只要可用,它们就会分配要在多个线程上计算的列。
Dedicated functions,
nafill
andsetnafill
, for that purpose is indata.table
.Whenever available, they distribute columns to be computed on multiple threads.
无需使用任何库。
No need to use any library.
dplyr >= 1.0.0
在较新版本的 dplyr 中:
此代码将强制
0
为第一列中的字符。要根据列类型替换NA
,您可以在where
中使用类似 purrr 的公式:dplyr >= 1.0.0
In newer versions of
dplyr
:This code will coerce
0
to be character in the first column. To replaceNA
based on column type you can use a purrr-like formula inwhere
:您可以使用
replace()
例如:
You can use
replace()
For example:
cleaner
包有一个na_replace()
泛型,默认用零替换数字值,用FALSE
替换逻辑值、今天的日期等:它甚至支持矢量化替换:
文档:https://msberends.github.io/cleaner/reference/na_replace.html
The
cleaner
package has anna_replace()
generic, that at default replaces numeric values with zeroes, logicals withFALSE
, dates with today, etc.:It even supports vectorised replacements:
Documentation: https://msberends.github.io/cleaner/reference/na_replace.html
另一个与
tidyr
方法replace_na
兼容的dplyr
管道选项适用于多个列:您可以轻松限制为例如数字列:
Another
dplyr
pipe compatible option withtidyr
methodreplace_na
that works for several columns:You can easily restrict to e.g. numeric columns:
Datacamp 中提取的这个简单函数可以提供帮助:
从
This simple function extracted from Datacamp could help:
Then
编写它的一个简单方法是使用
hablar
: 中的if_na
:它返回:
An easy way to write it is with
if_na
fromhablar
:which returns:
另一种选择是使用
collapse::replace_NA
。默认情况下,replace_NA
将 NA 替换为 0。仅适用于某些列:
它也比任何其他答案都要快(请参阅此答案进行比较):
Another option is to use
collapse::replace_NA
. By default,replace_NA
replaces NAs with 0s.For only some columns:
It's also faster than any other answer (see this answer for a comparison):
如果您想在更改特定列中的 NA 后分配一个新名称(在本例中为 V3 列),请使用您也可以这样做
if you want to assign a new name after changing the NAs in a specific column in this case column V3, use you can do also like this
我想添加下一个解决方案,该解决方案使用流行的
Hmisc< /code> 包
。
可以看出,所有插补元数据都被分配为属性。这样以后就可以用了。
I wan to add a next solution which using a popular
Hmisc
package.There could be seen that all imputations metadata are allocated as attributes. Thus it could be used later.
这不完全是一个新的解决方案,但我喜欢编写内联 lambda 来处理我无法完全让包完成的事情。在这种情况下,
因为 R 不会像您在 Python 中看到的那样“传递对象”,所以该解决方案不会修改原始变量 df,因此将与大多数其他解决方案完全相同解决方案,但对特定软件包的复杂知识的需求要少得多。
请注意函数定义周围的括号!虽然这对我来说似乎有点多余,但由于函数定义是用大括号括起来的,因此需要在
magrittr
的括号内定义内联函数。This is not exactly a new solution, but I like to write inline lambdas that handle things that I can't quite get packages to do. In this case,
Because R does not ever "pass by object" like you might see in Python, this solution does not modify the original variable
df
, and so will do quite the same as most of the other solutions, but with much less need for intricate knowledge of particular packages.Note the parens around the function definition! Though it seems a bit redundant to me, since the function definition is surrounded in curly braces, it is required that inline functions are defined within parens for
magrittr
.这是一个更灵活的解决方案。无论您的数据框有多大,或者用
0
或zero
或其他任何方式表示零,它都可以工作。This is a more flexible solution. It works no matter how large your data frame is, or zero is indicated by
0
orzero
or whatsoever.另一种选择是使用
sapply
将所有NA
替换为零。以下是一些可重现的代码(数据来自 @aL3xa):创建于 2023 年 1 月 15 日,使用 reprex v2.0.2< /a>
请注意:从 R 4.1.0 开始,您可以使用
\(x)
代替函数(x)
。Another option using
sapply
to replace allNA
with zeros. Here is some reproducible code (data from @aL3xa):Created on 2023-01-15 with reprex v2.0.2
Please note: Since R 4.1.0 you can use
\(x)
instead offunction(x)
.在 data.frame 中,不需要通过 mutate 创建新列。
结果
in data.frame it is not necessary to create a new column by mutate.
result
我个人使用过这个并且效果很好:
I used this personally and works fine: