如何在 R 数据框中将 NA 值替换为零?

发布于 2024-12-16 17:24:37 字数 75 浏览 2 评论 0原文

我有一个数据框,有些列具有 NA 值。

如何用零替换这些 NA 值?

I have a data frame and some columns have NA values.

How do I replace these NA values with zeroes?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(29

青衫负雪 2024-12-23 17:24:37

请参阅我在 @gsk3 答案中的评论。一个简单的例子:

> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
> d <- as.data.frame(m)
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1   4  3 NA  3  7  6  6 10  6   5
2   9  8  9  5 10 NA  2  1  7   2
3   1  1  6  3  6 NA  1  4  1   6
4  NA  4 NA  7 10  2 NA  4  1   8
5   1  2  4 NA  2  6  2  6  7   4
6  NA  3 NA NA 10  2  1 10  8   4
7   4  4  9 10  9  8  9  4 10  NA
8   5  8  3  2  1  4  5  9  4   7
9   3  9 10  1  9  9 10  5  3   3
10  4  2  2  5 NA  9  7  2  5   5

> d[is.na(d)] <- 0

> d
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1   4  3  0  3  7  6  6 10  6   5
2   9  8  9  5 10  0  2  1  7   2
3   1  1  6  3  6  0  1  4  1   6
4   0  4  0  7 10  2  0  4  1   8
5   1  2  4  0  2  6  2  6  7   4
6   0  3  0  0 10  2  1 10  8   4
7   4  4  9 10  9  8  9  4 10   0
8   5  8  3  2  1  4  5  9  4   7
9   3  9 10  1  9  9 10  5  3   3
10  4  2  2  5  0  9  7  2  5   5

不需要申请apply。 =)

编辑

您还应该查看 norm 包。它有很多用于缺失数据分析的好功能。 =)

See my comment in @gsk3 answer. A simple example:

> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
> d <- as.data.frame(m)
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1   4  3 NA  3  7  6  6 10  6   5
2   9  8  9  5 10 NA  2  1  7   2
3   1  1  6  3  6 NA  1  4  1   6
4  NA  4 NA  7 10  2 NA  4  1   8
5   1  2  4 NA  2  6  2  6  7   4
6  NA  3 NA NA 10  2  1 10  8   4
7   4  4  9 10  9  8  9  4 10  NA
8   5  8  3  2  1  4  5  9  4   7
9   3  9 10  1  9  9 10  5  3   3
10  4  2  2  5 NA  9  7  2  5   5

> d[is.na(d)] <- 0

> d
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1   4  3  0  3  7  6  6 10  6   5
2   9  8  9  5 10  0  2  1  7   2
3   1  1  6  3  6  0  1  4  1   6
4   0  4  0  7 10  2  0  4  1   8
5   1  2  4  0  2  6  2  6  7   4
6   0  3  0  0 10  2  1 10  8   4
7   4  4  9 10  9  8  9  4 10   0
8   5  8  3  2  1  4  5  9  4   7
9   3  9 10  1  9  9 10  5  3   3
10  4  2  2  5  0  9  7  2  5   5

There's no need to apply apply. =)

EDIT

You should also take a look at norm package. It has a lot of nice features for missing data analysis. =)

ゝ偶尔ゞ 2024-12-23 17:24:37

dplyr 混合选项现在比 Base R 子集重新分配快约 30%。在 100M 数据点数据帧上 mutate_all(~replace(., is.na(.), 0)) 比基本 R d[is.na(d) 快半秒] <- 0 选项。特别要避免的是使用 ifelse()if_else()。 (完整的 600 项试验分析耗时超过 4.5 小时,主要是因为包含了这些方法。)请参阅下面的基准分析以了解完整的结果。

如果您正在努力处理大量数据帧,data.table 是最快的选择:比标准 Base R 方法快 40%。它还会就地修改数据,有效地允许您一次处理近两倍的数据。


其他有用的 tidyverse 替换方法的集群

位置:

  • 索引 mutate_at(c(5:10), ~replace(., is .na(.), 0))
  • 直接引用 mutate_at(vars(var5:var10), ~replace(., is.na(.), 0))< /代码>
  • 固定匹配 mutate_at(vars(contains("1")), ~replace(., is.na(.), 0))
  • 或代替 contains(),尝试 ends_with()starts_with()
  • 模式匹配 mutate_at(vars(matches("\\d{2}")), ~replace(., is.na(.), 0))

有条件地:
(仅更改单个类型,保留其他类型。)

  • 整数 mutate_if(is.integer, ~replace(., is.na(.), 0))
  • 数字 mutate_if(is.numeric, ~replace(., is.na(.), 0))
  • 字符串 mutate_if(is.character, ~replace(., is.na(.), 0))

##完整分析 -
更新了 dplyr 0.8.0:函数使用 purrr 格式 ~ 符号:替换已弃用的 funs() 参数。

测试:

# Base R: 
baseR.sbst.rssgn   <- function(x) { x[is.na(x)] <- 0; x }
baseR.replace      <- function(x) { replace(x, is.na(x), 0) }
baseR.for          <- function(x) { for(j in 1:ncol(x))
    x[[j]][is.na(x[[j]])] = 0 }

# tidyverse
## dplyr
dplyr_if_else      <- function(x) { mutate_all(x, ~if_else(is.na(.), 0, .)) }
dplyr_coalesce     <- function(x) { mutate_all(x, ~coalesce(., 0)) }

## tidyr
tidyr_replace_na   <- function(x) { replace_na(x, as.list(setNames(rep(0, 10), as.list(c(paste0("var", 1:10)))))) }

## hybrid 
hybrd.ifelse     <- function(x) { mutate_all(x, ~ifelse(is.na(.), 0, .)) }
hybrd.replace_na <- function(x) { mutate_all(x, ~replace_na(., 0)) }
hybrd.replace    <- function(x) { mutate_all(x, ~replace(., is.na(.), 0)) }
hybrd.rplc_at.idx<- function(x) { mutate_at(x, c(1:10), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.nse<- function(x) { mutate_at(x, vars(var1:var10), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.stw<- function(x) { mutate_at(x, vars(starts_with("var")), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.ctn<- function(x) { mutate_at(x, vars(contains("var")), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.mtc<- function(x) { mutate_at(x, vars(matches("\\d+")), ~replace(., is.na(.), 0)) }
hybrd.rplc_if    <- function(x) { mutate_if(x, is.numeric, ~replace(., is.na(.), 0)) }

# data.table   
library(data.table)
DT.for.set.nms   <- function(x) { for (j in names(x))
    set(x,which(is.na(x[[j]])),j,0) }
DT.for.set.sqln  <- function(x) { for (j in seq_len(ncol(x)))
    set(x,which(is.na(x[[j]])),j,0) }
DT.nafill        <- function(x) { nafill(df, fill=0)}
DT.setnafill     <- function(x) { setnafill(df, fill=0)}

此分析的代码:

library(microbenchmark)
# 20% NA filled dataframe of 10 Million rows and 10 columns
set.seed(42) # to recreate the exact dataframe
dfN <- as.data.frame(matrix(sample(c(NA, as.numeric(1:4)), 1e7*10, replace = TRUE),
                            dimnames = list(NULL, paste0("var", 1:10)), 
                            ncol = 10))
# Running 600 trials with each replacement method 
# (the functions are excecuted locally - so that the original dataframe remains unmodified in all cases)
perf_results <- microbenchmark(
    hybrd.ifelse     = hybrd.ifelse(copy(dfN)),
    dplyr_if_else    = dplyr_if_else(copy(dfN)),
    hybrd.replace_na = hybrd.replace_na(copy(dfN)),
    baseR.sbst.rssgn = baseR.sbst.rssgn(copy(dfN)),
    baseR.replace    = baseR.replace(copy(dfN)),
    dplyr_coalesce   = dplyr_coalesce(copy(dfN)),
    tidyr_replace_na = tidyr_replace_na(copy(dfN)),
    hybrd.replace    = hybrd.replace(copy(dfN)),
    hybrd.rplc_at.ctn= hybrd.rplc_at.ctn(copy(dfN)),
    hybrd.rplc_at.nse= hybrd.rplc_at.nse(copy(dfN)),
    baseR.for        = baseR.for(copy(dfN)),
    hybrd.rplc_at.idx= hybrd.rplc_at.idx(copy(dfN)),
    DT.for.set.nms   = DT.for.set.nms(copy(dfN)),
    DT.for.set.sqln  = DT.for.set.sqln(copy(dfN)),
    times = 600L
)

结果摘要

<前><代码>>打印(性能结果)
单位:毫秒
expr 最小 lq 平均值 中值 uq 最大 neval
杂交.ifelse 6171.0439 6339.7046 6425.221 6407.397 6496.992 7052.851 600
dplyr_if_else 3737.4954 3877.0983 3953.857 3946.024 4023.301 4539.428 600
混合替换_na 1497.8653 1706.1119 1748.464 1745.282 1789.804 2127.166 600
基地R.sbst.rssgn 1480.5098 1686.1581 1730.006 1728.477 1772.951 2010.215 600
替换 1457.4016 1681.5583 1725.481 1722.069 1766.916 2089.627 600
dplyr_合并 1227.6150 1483.3520 1524.245 1519.454 1561.488 1996.859 600
tidyr_replace_na 1248.3292 1473.1707 1521.889 1520.108 1570.382 1995.768 600
混合替换 913.1865 1197.3133 1233.336 1238.747 1276.141 1438.646 600
混合 rplc_at.ctn 916.9339 1192.9885 1224.733 1227.628 1268.644 1466.085 600
混合rplc_at.nse 919.0270 1191.0541 1228.749 1228.635 1275.103 2882.040 600
基本R.for 869.3169 1180.8311 1216.958 1224.407 1264.737 1459.726 600
hybrd.rplc_at.idx 839.8915 1189.7465 1223.326 1228.329 1266.375 1565.794 600
DT.for.set.nms 761.6086 915.8166 1015.457 1001.772 1106.315 1363.044 600
DT.for.set.sqln 787.3535 918.8733 1017.812 1002.042 1122.474 1321.860 600

结果箱线图

ggplot(perf_results, aes(x=expr, y=time/10^9)) +
    geom_boxplot() +
    xlab('Expression') +
    ylab('Elapsed Time (Seconds)') +
    scale_y_continuous(breaks = seq(0,7,1)) +
    coord_flip()

经过时间的箱线图比较

试验的颜色编码散点图(y 轴为对数刻度)

qplot(y=time/10^9, data=perf_results, colour=expr) + 
    labs(y = "log10 Scaled Elapsed Time per Trial (secs)", x = "Trial Number") +
    coord_cartesian(ylim = c(0.75, 7.5)) +
    scale_y_log10(breaks=c(0.75, 0.875, 1, 1.25, 1.5, 1.75, seq(2, 7.5)))

所有试验时间的散点图

其他高性能者的注释

关于 更大的是,Tidyrreplace_na 历史上曾被拉到前面。当前要运行 100M 数据点的集合,其性能几乎与 Base R For 循环一样好。我很好奇不同大小的数据帧会发生什么。

可以在此处找到 mutatesummarize _at_all 函数变体的其他示例:https://rdrr.io/cran/dplyr/man/summarise_all.html
此外,我在这里找到了有用的演示和示例集合:https ://blog.exploratory.io/dplyr-0-5-is-awesome-heres-why-be095fd4eb8a

归因和 特别感谢

  • Tyler RinkerAkrun 用于演示微基准测试。
  • alexis_laz 致力于帮助我理解 local() 的使用,并且(与 Frank 的病人一起)也有帮助)无声强制在加速许多这些方法中所发挥的作用。
  • ArthurYip 为 poke 添加了较新的 coalesce() 函数并更新了分析。
  • Gregor 推动我们很好地弄清楚了 data.table 函数,最终将它们纳入了阵容中。
  • 基本 R For 循环:alexis_laz
  • data.table For 循环:Matt_Dowle
  • 罗马 for解释 is.numeric() 真正测试的内容。

(当然,如果您发现这些方法有用,也请给他们投赞成票。)

注意我对数字的使用:如果如果你确实有一个纯整数数据集,你的所有函数都会运行得更快。请参阅 alexiz_laz 的作品了解更多信息。 IRL,我不记得遇到过包含超过 10-15% 整数的数据集,因此我在全数字数据帧上运行这些测试。

使用的硬件
3.9 GHz CPU 和 24 GB RAM

The dplyr hybridized options are now around 30% faster than the Base R subset reassigns. On a 100M datapoint dataframe mutate_all(~replace(., is.na(.), 0)) runs a half a second faster than the base R d[is.na(d)] <- 0 option. What one wants to avoid specifically is using an ifelse() or an if_else(). (The complete 600 trial analysis ran to over 4.5 hours mostly due to including these approaches.) Please see benchmark analyses below for the complete results.

If you are struggling with massive dataframes, data.table is the fastest option of all: 40% faster than the standard Base R approach. It also modifies the data in place, effectively allowing you to work with nearly twice as much of the data at once.


A clustering of other helpful tidyverse replacement approaches

Locationally:

  • index mutate_at(c(5:10), ~replace(., is.na(.), 0))
  • direct reference mutate_at(vars(var5:var10), ~replace(., is.na(.), 0))
  • fixed match mutate_at(vars(contains("1")), ~replace(., is.na(.), 0))
  • or in place of contains(), try ends_with(),starts_with()
  • pattern match mutate_at(vars(matches("\\d{2}")), ~replace(., is.na(.), 0))

Conditionally:
(change just single type and leave other types alone.)

  • integers mutate_if(is.integer, ~replace(., is.na(.), 0))
  • numbers mutate_if(is.numeric, ~replace(., is.na(.), 0))
  • strings mutate_if(is.character, ~replace(., is.na(.), 0))

##The Complete Analysis -
Updated for dplyr 0.8.0: functions use purrr format ~ symbols: replacing deprecated funs() arguments.

Approaches tested:

# Base R: 
baseR.sbst.rssgn   <- function(x) { x[is.na(x)] <- 0; x }
baseR.replace      <- function(x) { replace(x, is.na(x), 0) }
baseR.for          <- function(x) { for(j in 1:ncol(x))
    x[[j]][is.na(x[[j]])] = 0 }

# tidyverse
## dplyr
dplyr_if_else      <- function(x) { mutate_all(x, ~if_else(is.na(.), 0, .)) }
dplyr_coalesce     <- function(x) { mutate_all(x, ~coalesce(., 0)) }

## tidyr
tidyr_replace_na   <- function(x) { replace_na(x, as.list(setNames(rep(0, 10), as.list(c(paste0("var", 1:10)))))) }

## hybrid 
hybrd.ifelse     <- function(x) { mutate_all(x, ~ifelse(is.na(.), 0, .)) }
hybrd.replace_na <- function(x) { mutate_all(x, ~replace_na(., 0)) }
hybrd.replace    <- function(x) { mutate_all(x, ~replace(., is.na(.), 0)) }
hybrd.rplc_at.idx<- function(x) { mutate_at(x, c(1:10), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.nse<- function(x) { mutate_at(x, vars(var1:var10), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.stw<- function(x) { mutate_at(x, vars(starts_with("var")), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.ctn<- function(x) { mutate_at(x, vars(contains("var")), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.mtc<- function(x) { mutate_at(x, vars(matches("\\d+")), ~replace(., is.na(.), 0)) }
hybrd.rplc_if    <- function(x) { mutate_if(x, is.numeric, ~replace(., is.na(.), 0)) }

# data.table   
library(data.table)
DT.for.set.nms   <- function(x) { for (j in names(x))
    set(x,which(is.na(x[[j]])),j,0) }
DT.for.set.sqln  <- function(x) { for (j in seq_len(ncol(x)))
    set(x,which(is.na(x[[j]])),j,0) }
DT.nafill        <- function(x) { nafill(df, fill=0)}
DT.setnafill     <- function(x) { setnafill(df, fill=0)}

The code for this analysis:

library(microbenchmark)
# 20% NA filled dataframe of 10 Million rows and 10 columns
set.seed(42) # to recreate the exact dataframe
dfN <- as.data.frame(matrix(sample(c(NA, as.numeric(1:4)), 1e7*10, replace = TRUE),
                            dimnames = list(NULL, paste0("var", 1:10)), 
                            ncol = 10))
# Running 600 trials with each replacement method 
# (the functions are excecuted locally - so that the original dataframe remains unmodified in all cases)
perf_results <- microbenchmark(
    hybrd.ifelse     = hybrd.ifelse(copy(dfN)),
    dplyr_if_else    = dplyr_if_else(copy(dfN)),
    hybrd.replace_na = hybrd.replace_na(copy(dfN)),
    baseR.sbst.rssgn = baseR.sbst.rssgn(copy(dfN)),
    baseR.replace    = baseR.replace(copy(dfN)),
    dplyr_coalesce   = dplyr_coalesce(copy(dfN)),
    tidyr_replace_na = tidyr_replace_na(copy(dfN)),
    hybrd.replace    = hybrd.replace(copy(dfN)),
    hybrd.rplc_at.ctn= hybrd.rplc_at.ctn(copy(dfN)),
    hybrd.rplc_at.nse= hybrd.rplc_at.nse(copy(dfN)),
    baseR.for        = baseR.for(copy(dfN)),
    hybrd.rplc_at.idx= hybrd.rplc_at.idx(copy(dfN)),
    DT.for.set.nms   = DT.for.set.nms(copy(dfN)),
    DT.for.set.sqln  = DT.for.set.sqln(copy(dfN)),
    times = 600L
)

Summary of Results

> print(perf_results)
Unit: milliseconds
              expr       min        lq     mean   median       uq      max neval
      hybrd.ifelse 6171.0439 6339.7046 6425.221 6407.397 6496.992 7052.851   600
     dplyr_if_else 3737.4954 3877.0983 3953.857 3946.024 4023.301 4539.428   600
  hybrd.replace_na 1497.8653 1706.1119 1748.464 1745.282 1789.804 2127.166   600
  baseR.sbst.rssgn 1480.5098 1686.1581 1730.006 1728.477 1772.951 2010.215   600
     baseR.replace 1457.4016 1681.5583 1725.481 1722.069 1766.916 2089.627   600
    dplyr_coalesce 1227.6150 1483.3520 1524.245 1519.454 1561.488 1996.859   600
  tidyr_replace_na 1248.3292 1473.1707 1521.889 1520.108 1570.382 1995.768   600
     hybrd.replace  913.1865 1197.3133 1233.336 1238.747 1276.141 1438.646   600
 hybrd.rplc_at.ctn  916.9339 1192.9885 1224.733 1227.628 1268.644 1466.085   600
 hybrd.rplc_at.nse  919.0270 1191.0541 1228.749 1228.635 1275.103 2882.040   600
         baseR.for  869.3169 1180.8311 1216.958 1224.407 1264.737 1459.726   600
 hybrd.rplc_at.idx  839.8915 1189.7465 1223.326 1228.329 1266.375 1565.794   600
    DT.for.set.nms  761.6086  915.8166 1015.457 1001.772 1106.315 1363.044   600
   DT.for.set.sqln  787.3535  918.8733 1017.812 1002.042 1122.474 1321.860   600

Boxplot of Results

ggplot(perf_results, aes(x=expr, y=time/10^9)) +
    geom_boxplot() +
    xlab('Expression') +
    ylab('Elapsed Time (Seconds)') +
    scale_y_continuous(breaks = seq(0,7,1)) +
    coord_flip()

Boxplot Comparison of Elapsed Time

Color-coded Scatterplot of Trials (with y-axis on a log scale)

qplot(y=time/10^9, data=perf_results, colour=expr) + 
    labs(y = "log10 Scaled Elapsed Time per Trial (secs)", x = "Trial Number") +
    coord_cartesian(ylim = c(0.75, 7.5)) +
    scale_y_log10(breaks=c(0.75, 0.875, 1, 1.25, 1.5, 1.75, seq(2, 7.5)))

Scatterplot of All Trial Times

A note on the other high performers

When the datasets get larger, Tidyr''s replace_na had historically pulled out in front. With the current collection of 100M data points to run through, it performs almost exactly as well as a Base R For Loop. I am curious to see what happens for different sized dataframes.

Additional examples for the mutate and summarize _at and _all function variants can be found here: https://rdrr.io/cran/dplyr/man/summarise_all.html
Additionally, I found helpful demonstrations and collections of examples here: https://blog.exploratory.io/dplyr-0-5-is-awesome-heres-why-be095fd4eb8a

Attributions and Appreciations

With special thanks to:

  • Tyler Rinker and Akrun for demonstrating microbenchmark.
  • alexis_laz for working on helping me understand the use of local(), and (with Frank's patient help, too) the role that silent coercion plays in speeding up many of these approaches.
  • ArthurYip for the poke to add the newer coalesce() function in and update the analysis.
  • Gregor for the nudge to figure out the data.table functions well enough to finally include them in the lineup.
  • Base R For loop: alexis_laz
  • data.table For Loops: Matt_Dowle
  • Roman for explaining what is.numeric() really tests.

(Of course, please reach over and give them upvotes, too if you find those approaches useful.)

Note on my use of Numerics: If you do have a pure integer dataset, all of your functions will run faster. Please see alexiz_laz's work for more information. IRL, I can't recall encountering a data set containing more than 10-15% integers, so I am running these tests on fully numeric dataframes.

Hardware Used
3.9 GHz CPU with 24 GB RAM

抚笙 2024-12-23 17:24:37

对于单个向量:

x <- c(1,2,NA,4,5)
x[is.na(x)] <- 0

对于 data.frame,从上面创建一个函数,然后将其应用到列。

请下次提供一个可重现的示例,详细信息如下:

如何制作一个很棒的示例R 可重现的例子?

For a single vector:

x <- c(1,2,NA,4,5)
x[is.na(x)] <- 0

For a data.frame, make a function out of the above, then apply it to the columns.

Please provide a reproducible example next time as detailed here:

How to make a great R reproducible example?

べ映画 2024-12-23 17:24:37

dplyr 示例:

library(dplyr)

df1 <- df1 %>%
    mutate(myCol1 = if_else(is.na(myCol1), 0, myCol1))

注意: 这适用于每个选定的列,如果我们需要对所有列执行此操作,请使用 mutate_each

dplyr example:

library(dplyr)

df1 <- df1 %>%
    mutate(myCol1 = if_else(is.na(myCol1), 0, myCol1))

Note: This works per selected column, if we need to do this for all column, see @reidjax's answer using mutate_each.

甜是你 2024-12-23 17:24:37

也可以使用 tidyr::replace_na

    library(tidyr)
    df <- df %>% mutate_all(funs(replace_na(.,0)))

编辑(dplyr > 1.0.0):

df %>% mutate(across(everything(), .fns = ~replace_na(.,0))) 

It is also possible to use tidyr::replace_na.

    library(tidyr)
    df <- df %>% mutate_all(funs(replace_na(.,0)))

Edit (dplyr > 1.0.0):

df %>% mutate(across(everything(), .fns = ~replace_na(.,0))) 
暖风昔人 2024-12-23 17:24:37

如果我们在导出时尝试替换 NA,例如写入 csv 时,那么我们可以使用:

  write.csv(data, "data.csv", na = "0")

If we are trying to replace NAs when exporting, for example when writing to csv, then we can use:

  write.csv(data, "data.csv", na = "0")
童话里做英雄 2024-12-23 17:24:37

我知道这个问题已经得到解答,但是这样做对某些人来说可能更有用:

定义这个函数:

na.zero <- function (x) {
    x[is.na(x)] <- 0
    return(x)
}

现在,每当您需要将向量中的 NA 转换为零时,您都可以这样做:

na.zero(some.vector)

I know the question is already answered, but doing it this way might be more useful to some:

Define this function:

na.zero <- function (x) {
    x[is.na(x)] <- 0
    return(x)
}

Now whenever you need to convert NA's in a vector to zero's you can do:

na.zero(some.vector)
失而复得 2024-12-23 17:24:37

使用 dplyr 0.5.0,您可以使用 coalesce 函数,通过执行 coalesce(向量,0)。这会将 vec 中的所有 NA 替换为 0:

假设我们有一个包含 NA 的数据帧:

library(dplyr)
df <- data.frame(v = c(1, 2, 3, NA, 5, 6, 8))

df
#    v
# 1  1
# 2  2
# 3  3
# 4 NA
# 5  5
# 6  6
# 7  8

df %>% mutate(v = coalesce(v, 0))
#   v
# 1 1
# 2 2
# 3 3
# 4 0
# 5 5
# 6 6
# 7 8

With dplyr 0.5.0, you can use coalesce function which can be easily integrated into %>% pipeline by doing coalesce(vec, 0). This replaces all NAs in vec with 0:

Say we have a data frame with NAs:

library(dplyr)
df <- data.frame(v = c(1, 2, 3, NA, 5, 6, 8))

df
#    v
# 1  1
# 2  2
# 3  3
# 4 NA
# 5  5
# 6  6
# 7  8

df %>% mutate(v = coalesce(v, 0))
#   v
# 1 1
# 2 2
# 3 3
# 4 0
# 5 5
# 6 6
# 7 8
许久 2024-12-23 17:24:37

在矩阵或向量中使用 replace() 的更通用方法将 NA 替换为 0

例如:

> x <- c(1,2,NA,NA,1,1)
> x1 <- replace(x,is.na(x),0)
> x1
[1] 1 2 0 0 1 1

这也是使用 的替代方法dplyr 中的 >ifelse()

df = data.frame(col = c(1,2,NA,NA,1,1))
df <- df %>%
   mutate(col = replace(col,is.na(col),0))

More general approach of using replace() in matrix or vector to replace NA to 0

For example:

> x <- c(1,2,NA,NA,1,1)
> x1 <- replace(x,is.na(x),0)
> x1
[1] 1 2 0 0 1 1

This is also an alternative to using ifelse() in dplyr

df = data.frame(col = c(1,2,NA,NA,1,1))
df <- df %>%
   mutate(col = replace(col,is.na(col),0))
肥爪爪 2024-12-23 17:24:37

要替换数据框中的所有 NA,您可以使用:

df %>% Replace(is.na(.), 0)

To replace all NAs in a dataframe you can use:

df %>% replace(is.na(.), 0)

酒几许 2024-12-23 17:24:37

会对@ianmunoz 的帖子发表评论,但我没有足够的声誉。您可以结合 dplyrmutate_eachreplace 来处理 NA0替换。使用 @aL3xa 的答案中的数据帧...

> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
> d <- as.data.frame(m)
> d

    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1   4  8  1  9  6  9 NA  8  9   8
2   8  3  6  8  2  1 NA NA  6   3
3   6  6  3 NA  2 NA NA  5  7   7
4  10  6  1  1  7  9  1 10  3  10
5  10  6  7 10 10  3  2  5  4   6
6   2  4  1  5  7 NA NA  8  4   4
7   7  2  3  1  4 10 NA  8  7   7
8   9  5  8 10  5  3  5  8  3   2
9   9  1  8  7  6  5 NA NA  6   7
10  6 10  8  7  1  1  2  2  5   7

> d %>% mutate_each( funs_( interp( ~replace(., is.na(.),0) ) ) )

    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1   4  8  1  9  6  9  0  8  9   8
2   8  3  6  8  2  1  0  0  6   3
3   6  6  3  0  2  0  0  5  7   7
4  10  6  1  1  7  9  1 10  3  10
5  10  6  7 10 10  3  2  5  4   6
6   2  4  1  5  7  0  0  8  4   4
7   7  2  3  1  4 10  0  8  7   7
8   9  5  8 10  5  3  5  8  3   2
9   9  1  8  7  6  5  0  0  6   7
10  6 10  8  7  1  1  2  2  5   7

我们在这里使用标准评估(SE),这就是为什么我们需要在“funs_”上使用下划线。我们还使用 lazyevalinterp/~. 引用“我们正在使用的所有内容”,即数据框。现在有零了!

Would've commented on @ianmunoz's post but I don't have enough reputation. You can combine dplyr's mutate_each and replace to take care of the NA to 0 replacement. Using the dataframe from @aL3xa's answer...

> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
> d <- as.data.frame(m)
> d

    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1   4  8  1  9  6  9 NA  8  9   8
2   8  3  6  8  2  1 NA NA  6   3
3   6  6  3 NA  2 NA NA  5  7   7
4  10  6  1  1  7  9  1 10  3  10
5  10  6  7 10 10  3  2  5  4   6
6   2  4  1  5  7 NA NA  8  4   4
7   7  2  3  1  4 10 NA  8  7   7
8   9  5  8 10  5  3  5  8  3   2
9   9  1  8  7  6  5 NA NA  6   7
10  6 10  8  7  1  1  2  2  5   7

> d %>% mutate_each( funs_( interp( ~replace(., is.na(.),0) ) ) )

    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1   4  8  1  9  6  9  0  8  9   8
2   8  3  6  8  2  1  0  0  6   3
3   6  6  3  0  2  0  0  5  7   7
4  10  6  1  1  7  9  1 10  3  10
5  10  6  7 10 10  3  2  5  4   6
6   2  4  1  5  7  0  0  8  4   4
7   7  2  3  1  4 10  0  8  7   7
8   9  5  8 10  5  3  5  8  3   2
9   9  1  8  7  6  5  0  0  6   7
10  6 10  8  7  1  1  2  2  5   7

We're using standard evaluation (SE) here which is why we need the underscore on "funs_." We also use lazyeval's interp/~ and the . references "everything we are working with", i.e. the data frame. Now there are zeros!

触ぅ动初心 2024-12-23 17:24:37

使用 imputeTS 包的另一个示例:

library(imputeTS)
na.replace(yourDataframe, 0)

Another example using imputeTS package:

library(imputeTS)
na.replace(yourDataframe, 0)
疧_╮線 2024-12-23 17:24:37

如果您想替换因子变量中的 NA,这可能很有用:

n <- length(levels(data.vector))+1

data.vector <- as.numeric(data.vector)
data.vector[is.na(data.vector)] <- n
data.vector <- as.factor(data.vector)
levels(data.vector) <- c("level1","level2",...,"leveln", "NAlevel") 

它将因子向量转换为数值向量,并添加另一个人工数值因子级别,然后将其转换回带有一个额外的“NA 级别”的因子向量”你的选择。

If you want to replace NAs in factor variables, this might be useful:

n <- length(levels(data.vector))+1

data.vector <- as.numeric(data.vector)
data.vector[is.na(data.vector)] <- n
data.vector <- as.factor(data.vector)
levels(data.vector) <- c("level1","level2",...,"leveln", "NAlevel") 

It transforms a factor-vector into a numeric vector and adds another artifical numeric factor level, which is then transformed back to a factor-vector with one extra "NA-level" of your choice.

临走之时 2024-12-23 17:24:37

用于此目的的专用函数 nafillsetnafill 位于 data.table 中。
只要可用,它们就会分配要在多个线程上计算的列。

library(data.table)

ans_df <- nafill(df, fill=0)

# or even faster, in-place
setnafill(df, fill=0)

Dedicated functions, nafill and setnafill, for that purpose is in data.table.
Whenever available, they distribute columns to be computed on multiple threads.

library(data.table)

ans_df <- nafill(df, fill=0)

# or even faster, in-place
setnafill(df, fill=0)
逆光飞翔i 2024-12-23 17:24:37

无需使用任何库。

df <- data.frame(a=c(1,3,5,NA))

df$a[is.na(df$a)] <- 0

df

No need to use any library.

df <- data.frame(a=c(1,3,5,NA))

df$a[is.na(df$a)] <- 0

df
隔岸观火 2024-12-23 17:24:37

dplyr >= 1.0.0

在较新版本的 dplyr 中:

across() 取代了“范围变体”系列,如 summarise_at()、summarise_if() 和 summarise_all()。

df <- data.frame(a = c(LETTERS[1:3], NA), b = c(NA, 1:3))

library(tidyverse)

df %>% 
  mutate(across(where(anyNA), ~ replace_na(., 0)))

  a b
1 A 0
2 B 1
3 C 2
4 0 3

此代码将强制 0 为第一列中的字符。要根据列类型替换 NA,您可以在 where 中使用类似 purrr 的公式:

df %>% 
  mutate(across(where(~ anyNA(.) & is.character(.)), ~ replace_na(., "0")))

dplyr >= 1.0.0

In newer versions of dplyr:

across() supersedes the family of "scoped variants" like summarise_at(), summarise_if(), and summarise_all().

df <- data.frame(a = c(LETTERS[1:3], NA), b = c(NA, 1:3))

library(tidyverse)

df %>% 
  mutate(across(where(anyNA), ~ replace_na(., 0)))

  a b
1 A 0
2 B 1
3 C 2
4 0 3

This code will coerce 0 to be character in the first column. To replace NA based on column type you can use a purrr-like formula in where:

df %>% 
  mutate(across(where(~ anyNA(.) & is.character(.)), ~ replace_na(., "0")))
半透明的墙 2024-12-23 17:24:37

您可以使用 replace()

例如:

> x <- c(-1,0,1,0,NA,0,1,1)
> x1 <- replace(x,5,1)
> x1
[1] -1  0  1  0  1  0  1  1

> x1 <- replace(x,5,mean(x,na.rm=T))
> x1
[1] -1.00  0.00  1.00  0.00  0.29  0.00 1.00  1.00

You can use replace()

For example:

> x <- c(-1,0,1,0,NA,0,1,1)
> x1 <- replace(x,5,1)
> x1
[1] -1  0  1  0  1  0  1  1

> x1 <- replace(x,5,mean(x,na.rm=T))
> x1
[1] -1.00  0.00  1.00  0.00  0.29  0.00 1.00  1.00
浴红衣 2024-12-23 17:24:37

cleaner 包有一个 na_replace() 泛型,默认用零替换数字值,用 FALSE 替换逻辑值、今天的日期等:

library(dplyr)
library(cleaner)

starwars %>% na_replace()
na_replace(starwars)

它甚至支持矢量化替换:

mtcars[1:6, c("mpg", "hp")] <- NA
na_replace(mtcars, mpg, hp, replacement = c(999, 123))

文档:https://msberends.github.io/cleaner/reference/na_replace.html

The cleaner package has an na_replace() generic, that at default replaces numeric values with zeroes, logicals with FALSE, dates with today, etc.:

library(dplyr)
library(cleaner)

starwars %>% na_replace()
na_replace(starwars)

It even supports vectorised replacements:

mtcars[1:6, c("mpg", "hp")] <- NA
na_replace(mtcars, mpg, hp, replacement = c(999, 123))

Documentation: https://msberends.github.io/cleaner/reference/na_replace.html

好多鱼好多余 2024-12-23 17:24:37

另一个与 tidyr 方法 replace_na 兼容的 dplyr 管道选项适用于多个列:

require(dplyr)
require(tidyr)

m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
d <- as.data.frame(m)

myList <- setNames(lapply(vector("list", ncol(d)), function(x) x <- 0), names(d))

df <- d %>% replace_na(myList)

您可以轻松限制为例如数字列:

d$str <- c("string", NA)

myList <- myList[sapply(d, is.numeric)]

df <- d %>% replace_na(myList)

Another dplyr pipe compatible option with tidyrmethod replace_na that works for several columns:

require(dplyr)
require(tidyr)

m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
d <- as.data.frame(m)

myList <- setNames(lapply(vector("list", ncol(d)), function(x) x <- 0), names(d))

df <- d %>% replace_na(myList)

You can easily restrict to e.g. numeric columns:

d$str <- c("string", NA)

myList <- myList[sapply(d, is.numeric)]

df <- d %>% replace_na(myList)
与风相奔跑 2024-12-23 17:24:37

Datacamp 中提取的这个简单函数可以提供帮助:

replace_missings <- function(x, replacement) {
  is_miss <- is.na(x)
  x[is_miss] <- replacement

  message(sum(is_miss), " missings replaced by the value ", replacement)
  x
}

replace_missings(df, replacement = 0)

This simple function extracted from Datacamp could help:

replace_missings <- function(x, replacement) {
  is_miss <- is.na(x)
  x[is_miss] <- replacement

  message(sum(is_miss), " missings replaced by the value ", replacement)
  x
}

Then

replace_missings(df, replacement = 0)
满栀 2024-12-23 17:24:37

编写它的一个简单方法是使用 hablar: 中的 if_na:

library(dplyr)
library(hablar)

df <- tibble(a = c(1, 2, 3, NA, 5, 6, 8))

df %>% 
  mutate(a = if_na(a, 0))

它返回:

      a
  <dbl>
1     1
2     2
3     3
4     0
5     5
6     6
7     8

An easy way to write it is with if_na from hablar:

library(dplyr)
library(hablar)

df <- tibble(a = c(1, 2, 3, NA, 5, 6, 8))

df %>% 
  mutate(a = if_na(a, 0))

which returns:

      a
  <dbl>
1     1
2     2
3     3
4     0
5     5
6     6
7     8
十级心震 2024-12-23 17:24:37

另一种选择是使用collapse::replace_NA。默认情况下,replace_NA 将 NA 替换为 0。

library(collapse)
replace_NA(df)

仅适用于某些列:

replace_NA(df, cols = c("V1", "V5")) 
#Alternatively, one can use a function, indices or a logical vector to select the columns

它也比任何其他答案都要快(请参阅此答案进行比较):

set.seed(42) # to recreate the exact dataframe
dfN <- as.data.frame(matrix(sample(c(NA, as.numeric(1:4)), 1e7*10, replace = TRUE),
                            dimnames = list(NULL, paste0("var", 1:10)), 
                            ncol = 10))

microbenchmark(collapse = replace_NA(dfN))

# Unit: milliseconds
#      expr      min      lq     mean  median       uq     max neval
#  collapse 508.9198 621.405 751.3413 714.835 859.5437 1298.69   100

Another option is to use collapse::replace_NA. By default, replace_NA replaces NAs with 0s.

library(collapse)
replace_NA(df)

For only some columns:

replace_NA(df, cols = c("V1", "V5")) 
#Alternatively, one can use a function, indices or a logical vector to select the columns

It's also faster than any other answer (see this answer for a comparison):

set.seed(42) # to recreate the exact dataframe
dfN <- as.data.frame(matrix(sample(c(NA, as.numeric(1:4)), 1e7*10, replace = TRUE),
                            dimnames = list(NULL, paste0("var", 1:10)), 
                            ncol = 10))

microbenchmark(collapse = replace_NA(dfN))

# Unit: milliseconds
#      expr      min      lq     mean  median       uq     max neval
#  collapse 508.9198 621.405 751.3413 714.835 859.5437 1298.69   100
过期情话 2024-12-23 17:24:37

如果您想在更改特定列中的 NA 后分配一个新名称(在本例中为 V3 列),请使用您也可以这样做

my.data.frame$the.new.column.name <- ifelse(is.na(my.data.frame$V3),0,1)

if you want to assign a new name after changing the NAs in a specific column in this case column V3, use you can do also like this

my.data.frame$the.new.column.name <- ifelse(is.na(my.data.frame$V3),0,1)
南汐寒笙箫 2024-12-23 17:24:37

我想添加下一个解决方案,该解决方案使用流行的 Hmisc< /code> 包

library(Hmisc)
data(airquality)
# imputing with 0 - all columns
# although my favorite one for simple imputations is Hmisc::impute(x, "random")
> dd <- data.frame(Map(function(x) Hmisc::impute(x, 0), airquality))
> str(dd[[1]])
 'impute' Named num [1:153] 41 36 12 18 0 28 23 19 8 0 ...
 - attr(*, "names")= chr [1:153] "1" "2" "3" "4" ...
 - attr(*, "imputed")= int [1:37] 5 10 25 26 27 32 33 34 35 36 ...
> dd[[1]][1:10]
  1   2   3   4   5   6   7   8   9  10 
 41  36  12  18  0*  28  23  19   8  0* 

可以看出,所有插补元数据都被分配为属性。这样以后就可以用了。

I wan to add a next solution which using a popular Hmisc package.

library(Hmisc)
data(airquality)
# imputing with 0 - all columns
# although my favorite one for simple imputations is Hmisc::impute(x, "random")
> dd <- data.frame(Map(function(x) Hmisc::impute(x, 0), airquality))
> str(dd[[1]])
 'impute' Named num [1:153] 41 36 12 18 0 28 23 19 8 0 ...
 - attr(*, "names")= chr [1:153] "1" "2" "3" "4" ...
 - attr(*, "imputed")= int [1:37] 5 10 25 26 27 32 33 34 35 36 ...
> dd[[1]][1:10]
  1   2   3   4   5   6   7   8   9  10 
 41  36  12  18  0*  28  23  19   8  0* 

There could be seen that all imputations metadata are allocated as attributes. Thus it could be used later.

初相遇 2024-12-23 17:24:37

这不完全是一个新的解决方案,但我喜欢编写内联 lambda 来处理我无法完全让包完成的事情。在这种情况下,

df %>%
   (function(x) { x[is.na(x)] <- 0; return(x) })

因为 R 不会像您在 Python 中看到的那样“传递对象”,所以该解决方案不会修改原始变量 df,因此将与大多数其他解决方案完全相同解决方案,但对特定软件包的复杂知识的需求要少得多。

请注意函数定义周围的括号!虽然这对我来说似乎有点多余,但由于函数定义是用大括号括起来的,因此需要在 magrittr 的括号内定义内联函数。

This is not exactly a new solution, but I like to write inline lambdas that handle things that I can't quite get packages to do. In this case,

df %>%
   (function(x) { x[is.na(x)] <- 0; return(x) })

Because R does not ever "pass by object" like you might see in Python, this solution does not modify the original variable df, and so will do quite the same as most of the other solutions, but with much less need for intricate knowledge of particular packages.

Note the parens around the function definition! Though it seems a bit redundant to me, since the function definition is surrounded in curly braces, it is required that inline functions are defined within parens for magrittr.

秋叶绚丽 2024-12-23 17:24:37

这是一个更灵活的解决方案。无论您的数据框有多大,或者用 0zero 或其他任何方式表示零,它都可以工作。

library(dplyr) # make sure dplyr ver is >= 1.00

df %>%
    mutate(across(everything(), na_if, 0)) # if 0 is indicated by `zero` then replace `0` with `zero`

This is a more flexible solution. It works no matter how large your data frame is, or zero is indicated by 0 or zero or whatsoever.

library(dplyr) # make sure dplyr ver is >= 1.00

df %>%
    mutate(across(everything(), na_if, 0)) # if 0 is indicated by `zero` then replace `0` with `zero`
伤痕我心 2024-12-23 17:24:37

另一种选择是使用 sapply 将所有 NA 替换为零。以下是一些可重现的代码(数据来自 @aL3xa):

set.seed(7) # for reproducibility
m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
d <- as.data.frame(m)
d
#>    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> 1   9  7  5  5  7  7  4  6  6   7
#> 2   2  5 10  7  8  9  8  8  1   8
#> 3   6  7  4 10  4  9  6  8 NA  10
#> 4   1 10  3  7  5  7  7  7 NA   8
#> 5   9  9 10 NA  7 10  1  5 NA   5
#> 6   5  2  5 10  8  1  1  5 10   3
#> 7   7  3  9  3  1  6  7  3  1  10
#> 8   7  7  6  8  4  4  5 NA  8   7
#> 9   2  1  1  2  7  5  9 10  9   3
#> 10  7  5  3  4  9  2  7  6 NA   5
d[sapply(d, \(x) is.na(x))] <- 0
d
#>    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> 1   9  7  5  5  7  7  4  6  6   7
#> 2   2  5 10  7  8  9  8  8  1   8
#> 3   6  7  4 10  4  9  6  8  0  10
#> 4   1 10  3  7  5  7  7  7  0   8
#> 5   9  9 10  0  7 10  1  5  0   5
#> 6   5  2  5 10  8  1  1  5 10   3
#> 7   7  3  9  3  1  6  7  3  1  10
#> 8   7  7  6  8  4  4  5  0  8   7
#> 9   2  1  1  2  7  5  9 10  9   3
#> 10  7  5  3  4  9  2  7  6  0   5

创建于 2023 年 1 月 15 日,使用 reprex v2.0.2< /a>


请注意:从 R 4.1.0 开始,您可以使用 \(x) 代替函数(x)

Another option using sapply to replace all NA with zeros. Here is some reproducible code (data from @aL3xa):

set.seed(7) # for reproducibility
m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
d <- as.data.frame(m)
d
#>    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> 1   9  7  5  5  7  7  4  6  6   7
#> 2   2  5 10  7  8  9  8  8  1   8
#> 3   6  7  4 10  4  9  6  8 NA  10
#> 4   1 10  3  7  5  7  7  7 NA   8
#> 5   9  9 10 NA  7 10  1  5 NA   5
#> 6   5  2  5 10  8  1  1  5 10   3
#> 7   7  3  9  3  1  6  7  3  1  10
#> 8   7  7  6  8  4  4  5 NA  8   7
#> 9   2  1  1  2  7  5  9 10  9   3
#> 10  7  5  3  4  9  2  7  6 NA   5
d[sapply(d, \(x) is.na(x))] <- 0
d
#>    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> 1   9  7  5  5  7  7  4  6  6   7
#> 2   2  5 10  7  8  9  8  8  1   8
#> 3   6  7  4 10  4  9  6  8  0  10
#> 4   1 10  3  7  5  7  7  7  0   8
#> 5   9  9 10  0  7 10  1  5  0   5
#> 6   5  2  5 10  8  1  1  5 10   3
#> 7   7  3  9  3  1  6  7  3  1  10
#> 8   7  7  6  8  4  4  5  0  8   7
#> 9   2  1  1  2  7  5  9 10  9   3
#> 10  7  5  3  4  9  2  7  6  0   5

Created on 2023-01-15 with reprex v2.0.2


Please note: Since R 4.1.0 you can use \(x) instead of function(x).

陪我终i 2024-12-23 17:24:37

在 data.frame 中,不需要通过 mutate 创建新列。

library(tidyverse)    
k <- c(1,2,80,NA,NA,51)
j <- c(NA,NA,3,31,12,NA)
        
df <- data.frame(k,j)%>%
   replace_na(list(j=0))#convert only column j, for example
    

结果

k   j
1   0           
2   0           
80  3           
NA  31          
NA  12          
51  0   

in data.frame it is not necessary to create a new column by mutate.

library(tidyverse)    
k <- c(1,2,80,NA,NA,51)
j <- c(NA,NA,3,31,12,NA)
        
df <- data.frame(k,j)%>%
   replace_na(list(j=0))#convert only column j, for example
    

result

k   j
1   0           
2   0           
80  3           
NA  31          
NA  12          
51  0   
烟织青萝梦 2024-12-23 17:24:37

我个人使用过这个并且效果很好:

players_wd$APPROVED_WD[is.na(players_wd$APPROVED_WD)] <- 0

I used this personally and works fine:

players_wd$APPROVED_WD[is.na(players_wd$APPROVED_WD)] <- 0
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文