用最新的非 NA 值替换 NA
在 data.frame
(或 data.table
)中,我想用最接近的先前非 NA 值“向前填充”NA。一个使用向量(而不是 data.frame
)的简单示例如下:
> y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
我想要一个允许我构造 的函数
这样:fill.NAs()
yy
> yy
[1] NA NA NA 2 2 2 2 3 3 3 4 4
我需要对许多(总共约 1 Tb)小型 data.frame
(约 30-50 Mb)重复此操作,其中一行 NA 就是全部它的条目是。解决这个问题的好方法是什么?
我提出的丑陋的解决方案使用这个函数:
last <- function (x){
x[length(x)]
}
fill.NAs <- function(isNA){
if (isNA[1] == 1) {
isNA[1:max({which(isNA==0)[1]-1},1)] <- 0 # first is NAs
# can't be forward filled
}
isNA.neg <- isNA.pos <- isNA.diff <- diff(isNA)
isNA.pos[isNA.diff < 0] <- 0
isNA.neg[isNA.diff > 0] <- 0
which.isNA.neg <- which(as.logical(isNA.neg))
if (length(which.isNA.neg)==0) return(NULL) # generates warnings later, but works
which.isNA.pos <- which(as.logical(isNA.pos))
which.isNA <- which(as.logical(isNA))
if (length(which.isNA.neg)==length(which.isNA.pos)){
replacement <- rep(which.isNA.pos[2:length(which.isNA.neg)],
which.isNA.neg[2:max(length(which.isNA.neg)-1,2)] -
which.isNA.pos[1:max(length(which.isNA.neg)-1,1)])
replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
} else {
replacement <- rep(which.isNA.pos[1:length(which.isNA.neg)], which.isNA.neg - which.isNA.pos[1:length(which.isNA.neg)])
replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
}
replacement
}
函数 fill.NAs
的使用如下:
y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
isNA <- as.numeric(is.na(y))
replacement <- fill.NAs(isNA)
if (length(replacement)){
which.isNA <- which(as.logical(isNA))
to.replace <- which.isNA[which(isNA==0)[1]:length(which.isNA)]
y[to.replace] <- y[replacement]
}
输出
> y
[1] NA 2 2 2 2 3 3 3 4 4 4
...这似乎有效。但是,伙计,这很丑吗!有什么建议吗?
In a data.frame
(or data.table
), I would like to "fill forward" NAs with the closest previous non-NA value. A simple example, using vectors (instead of a data.frame
) is the following:
> y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
I would like a function fill.NAs()
that allows me to construct yy
such that:
> yy
[1] NA NA NA 2 2 2 2 3 3 3 4 4
I need to repeat this operation for many (total ~1 Tb) small sized data.frame
s (~30-50 Mb), where a row is NA is all its entries are. What is a good way to approach the problem?
The ugly solution I cooked up uses this function:
last <- function (x){
x[length(x)]
}
fill.NAs <- function(isNA){
if (isNA[1] == 1) {
isNA[1:max({which(isNA==0)[1]-1},1)] <- 0 # first is NAs
# can't be forward filled
}
isNA.neg <- isNA.pos <- isNA.diff <- diff(isNA)
isNA.pos[isNA.diff < 0] <- 0
isNA.neg[isNA.diff > 0] <- 0
which.isNA.neg <- which(as.logical(isNA.neg))
if (length(which.isNA.neg)==0) return(NULL) # generates warnings later, but works
which.isNA.pos <- which(as.logical(isNA.pos))
which.isNA <- which(as.logical(isNA))
if (length(which.isNA.neg)==length(which.isNA.pos)){
replacement <- rep(which.isNA.pos[2:length(which.isNA.neg)],
which.isNA.neg[2:max(length(which.isNA.neg)-1,2)] -
which.isNA.pos[1:max(length(which.isNA.neg)-1,1)])
replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
} else {
replacement <- rep(which.isNA.pos[1:length(which.isNA.neg)], which.isNA.neg - which.isNA.pos[1:length(which.isNA.neg)])
replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
}
replacement
}
The function fill.NAs
is used as follows:
y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
isNA <- as.numeric(is.na(y))
replacement <- fill.NAs(isNA)
if (length(replacement)){
which.isNA <- which(as.logical(isNA))
to.replace <- which.isNA[which(isNA==0)[1]:length(which.isNA)]
y[to.replace] <- y[replacement]
}
Output
> y
[1] NA 2 2 2 2 3 3 3 4 4 4
... which seems to work. But, man, is it ugly! Any suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(23)
我个人使用这个功能。我不知道它有多快或多慢。但它无需使用库即可完成其工作。
如果您想在数据框中应用此函数,如果您的数据框称为 df 那么只需
I personally use this function. I do not know how fast or slow it is. But it does its job without having to use libraries.
if you want to apply this function in a dataframe, if your dataframe is called df then simply
我将其发布在这里,因为这可能对遇到与所问问题类似的问题的其他人有所帮助。
使用
vctrs
包的最新tidyverse
解决方案可以与mutate
组合以创建新列Returns
While 将“填充方向”更改为 < code>'up' 结果:
可能还想尝试
"downup"
或"updown"
请注意,此解决方案仍处于实验生命周期,因此语法可能会改变。
I'm posting this here as this might be helpful for others with problems similar to the asked question.
The most recent
tidyverse
solution using thevctrs
package can be compined withmutate
to create a new columnReturns
While changing the 'filling direction' to
'up'
results in:Might wanna also try
"downup"
or"updown"
Please note that this solution is still in experimental life cycle so the syntax might change.
您可以使用我的函数
roll_na_fill()
,它针对包含多个组的数据进行了优化。基准示例
创建于 2023 年 11 月 12 日,使用 reprex v2.0.2
You can use my function
roll_na_fill()
which is optimised for data consisting of many groups.Example benchmark
Created on 2023-11-12 with reprex v2.0.2
我尝试了以下操作:
nullIdx 获取 idx 编号,其中 masterData$RequiredColumn 具有 Null/ NA 值。
在下一行中,我们将其替换为相应的 Idx-1 值,即每个 NULL/ NA 之前的最后一个好值
I tried the below:
nullIdx gets the idx number where ever masterData$RequiredColumn has a Null/ NA value.
In the next line we replace it with the corresponding Idx-1 value, i.e. the last good value before each NULL/ NA
这对我有用,尽管我不确定它是否比其他建议更有效。
This worked for me, although I'm not sure whether it is more efficient than other suggestions.
参加派对已经太晚了,但是一个非常简洁且可扩展的答案,可与library(data.table)一起使用,因此可以用作dt[,SomeVariable:= FunctionBellow, by = list(group) ]。
Too late to the party, but a very concise and expandable answer for use with
library(data.table)
and therefore usable asdt[,SomeVariable:= FunctionBellow, by = list(group)]
.另一个
Base R
解决方案可能是:输出:
Another
Base R
solution could be:OUTPUT:
您可能想使用 zoo< 中的
na.locf()
函数/a> 包将最后的观察结果向前推进以替换您的 NA 值。以下是帮助页面中其使用示例的开头:
You probably want to use the
na.locf()
function from the zoo package to carry the last observation forward to replace your NA values.Here is the beginning of its usage example from the help page:
抱歉挖出一个老问题。
我找不到在火车上完成这项工作的函数,所以我自己写了一个。
我很自豪地发现它快了一点。
但它的灵活性较差。
但它与
ave
配合得很好,这正是我所需要的。编辑
当这成为我最喜欢的答案时,我经常被提醒我不使用自己的函数,因为我经常需要动物园的
maxgap
参数。因为当我使用 dplyr + 日期时,zoo 在边缘情况下遇到了一些奇怪的问题,我无法调试,所以我今天回到这个来改进我的旧功能。我在这里对改进后的函数和所有其他条目进行了基准测试。对于基本功能集,
tidyr::fill
是最快的,同时也不会失败边缘情况。 @BrandonBertelsen 的 Rcpp 条目仍然更快,但它在输入类型方面不灵活(由于对all.equal
的误解,他错误地测试了边缘情况)。如果您需要 maxgap,我下面的函数比 Zoo 更快(并且没有日期方面的奇怪问题)。
我发布了我的测试文档。
新函数
我还将该函数放入我的 formr 包 中(仅限 Github)。
Sorry for digging up an old question.
I couldn't look up the function to do this job on the train, so I wrote one myself.
I was proud to find out that it's a tiny bit faster.
It's less flexible though.
But it plays nice with
ave
, which is what I needed.Edit
As this became my most upvoted answer, I was reminded often that I don't use my own function, because I often need zoo's
maxgap
argument. Because zoo has some weird problems in edge cases when I use dplyr + dates that I couldn't debug, I came back to this today to improve my old function.I benchmarked my improved function and all the other entries here. For the basic set of features,
tidyr::fill
is fastest while also not failing the edge cases. The Rcpp entry by @BrandonBertelsen is faster still, but it's inflexible regarding the input's type (he tested edge cases incorrectly due to a misunderstanding ofall.equal
).If you need
maxgap
, my function below is faster than zoo (and doesn't have the weird problems with dates).I put up the documentation of my tests.
new function
I've also put the function in my formr package (Github only).
data.table 解决方案:
此方法也可以与前向填充零一起使用:
此方法对于大规模数据以及您希望按组执行前向填充的情况非常有用,这对于
data.table
来说是微不足道的。只需将组添加到by
子句中的cumsum
逻辑之前即可。a
data.table
solution:this approach could work with forward filling zeros as well:
this method becomes very useful on data at scale and where you would want to perform a forward fill by group(s), which is trivial with
data.table
. just add the group(s) to theby
clause prior to thecumsum
logic.tidyr
包(tidyverse
包套件的一部分)有一个简单的方法来做到这一点:The
tidyr
package (part of thetidyverse
suite of packages) has a simple way to do that:您可以使用
data.table
函数nafill
(可从data.table >= 1.12.3
获取)。如果您的向量是
data.table
中的一列,您还可以使用setnafill
通过引用来更新它:如果您在多个列中都有
NA
......您可以一次性填写:
注意:
该功能很可能很快就会得到扩展;请参阅未解决的问题 nafill、setnafill 用于字符、因子和其他类型,您可以在其中还可以找到临时解决方法。
You can use the
data.table
functionnafill
, available fromdata.table >= 1.12.3
.If your vector is a column in a
data.table
, you can also update it by reference withsetnafill
:If you have
NA
in several columns......you can fill them by reference in one go:
Note that:
The functionality will most likely soon be extended; see the open issue nafill, setnafill for character, factor and other types, where you also find a temporary workaround.
抛砖引玉:
设置一个基本示例和基准:
并运行一些基准:
以防万一:
更新
对于数字向量,该函数有点不同:
Throwing my hat in:
Setup a basic sample and a benchmark:
And run some benchmarks:
Just in case:
Update
For a numeric vector, the function is a bit different:
处理大数据量时,为了更加高效,我们可以使用data.table包。
Dealing with a big data volume, in order to be more efficient, we can use the data.table package.
这对我有用:
速度也合理:
This has worked for me:
speed is reasonable too:
拥有领先的
NA
有点麻烦,但是当领先的术语不丢失时,我发现一种非常可读(并且矢量化)的 LOCF 方法是:na.omit(y)[cumsum(!is.na(y))]
一个可读性稍差的修改通常有效:
c(NA, na.omit(y))[cumsum(!is.na(y))+1]
给出所需的输出:
c(NA, 2, 2, 2, 2, 3, 3, 4 , 4, 4)
Having a leading
NA
is a bit of a wrinkle, but I find a very readable (and vectorized) way of doing LOCF when the leading term is not missing is:na.omit(y)[cumsum(!is.na(y))]
A slightly less readable modification works in general:
c(NA, na.omit(y))[cumsum(!is.na(y))+1]
gives the desired output:
c(NA, 2, 2, 2, 2, 3, 3, 4, 4, 4)
试试这个功能。它不需要 ZOO 包:
示例:
Try this function. It does not require the ZOO package:
Example:
有许多软件包提供
na.locf
(NA
最后观察结转)函数:xts
-xts::na。 locf
zoo
-zoo::na.locf
imputeTS
-imputeTS::na.locf
以及此函数以不同方式命名的其他包。
There are a bunch of packages offering
na.locf
(NA
Last Observation Carried Forward) functions:xts
-xts::na.locf
zoo
-zoo::na.locf
imputeTS
-imputeTS::na.locf
spacetime
-spacetime::na.locf
And also other packages where this function is named differently.
跟进 Brandon Bertelsen 的 Rcpp 贡献。对我来说,NumericVector 版本不起作用:它仅替换了第一个 NA。这是因为
ina
向量仅在函数开头计算一次。相反,我们可以采用与 IntegerVector 函数完全相同的方法。以下对我有用:
如果您需要 CharacterVector 版本,相同的基本方法也适用:
Following up on Brandon Bertelsen's Rcpp contributions. For me, the NumericVector version didn't work: it only replaced the first NA. This is because the
ina
vector is only evaluated once, at the beginning of the function.Instead, one can take the exact same approach as for the IntegerVector function. The following worked for me:
In case you need a CharacterVector version, the same basic approach also works:
这是@AdamO 解决方案的修改。这个运行速度更快,因为它绕过了 na.omit 函数。这将覆盖向量
y
中的NA
值(前导NA
除外)。Here is a modification of @AdamO's solution. This one runs faster, because it bypasses the
na.omit
function. This will overwrite theNA
values in vectory
(except for leadingNA
s).我想添加下一个使用
runner
r cran 包的解决方案。整个包进行了优化,主要是用cpp编写的。从而提供很高的效率。
I want to add a next solution which using the
runner
r cran package.The whole package is optimized and major of it was written in cpp. Thus offer a great efficiency.
base中的一个选项,源自@Montgomery-Clift和@AdamO的答案,用最新的非
NA
替换NA
> value 可能是:当仅存在几个
NA
时,可以用最新的非 NA 值 的值覆盖它们,而不是创建新的向量。当速度很重要时,可以使用 RCPP 编写传播循环中最后一个非 NA 值的循环。为了灵活地选择输入类型,可以使用模板来完成。
这些函数可以在
lapply
内部使用,将它们应用到data.frame
的所有列。使用专门针对数据类型的 Rcpp 的其他答案如下所示,但也在更新输入向量。
基准测试
结果
根据填充的 NA 数量,
data.table::nafill
或vctrs::vec_fill_missing
是最快的。An option in base, derive from the answers of @Montgomery-Clift and @AdamO, replacing
NA
's with latest non-NA
value could be:When only a few
NA
exist they could be overwritten with the values of the latest non-NA value instead of creating a new vector.When speed is important a loop propagating the last non-NA value in a loop could be written using RCPP. To be flexible on the input type this can be done using a template.
Those functions can be used inside
lapply
to apply them on all columns of adata.frame
.Other answers using Rcpp, specialized on a data type, look like the following but are updating also the input vector.
Benchmark
Result
Depending on how many NA's are filled up either
data.table::nafill
orvctrs::vec_fill_missing
are the fastest.Reduce 是一个很好的函数式编程概念,对于类似的任务可能很有用。不幸的是,在 R 中,它比上面答案中的
repeat.before
慢约 70 倍。Reduce is a nice functional programming concept that may be useful for similar tasks. Unfortunately in R it is ~70 times slower than
repeat.before
in the above answer.