在数据框中延续最后的观察结果?
我希望对我正在处理的数据集实施“最后观察结转”,该数据集末尾有缺失值。
这是一个简单的代码来做到这一点(后面的问题):
LOCF <- function(x)
{
# Last Observation Carried Forward (for a left to right series)
LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
x[LOCF:length(x)] <- x[LOCF]
return(x)
}
# example:
LOCF(c(1,2,3,4,NA,NA))
LOCF(c(1,NA,3,4,NA,NA))
现在这对于简单的向量非常有效。但是,如果我尝试在数据帧上使用它:
a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
a
t(apply(a, 1, LOCF)) # will make a mess
它将把我的数据帧变成字符矩阵。
你能想出一种在 data.frame 上执行 LOCF 而不将其转换为矩阵的方法吗? (我可以使用循环等来纠正混乱,但希望有一个更优雅的解决方案)
I wish to implement a "Last Observation Carried Forward" for a data set I am working on which has missing values at the end of it.
Here is a simple code to do it (question after it):
LOCF <- function(x)
{
# Last Observation Carried Forward (for a left to right series)
LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
x[LOCF:length(x)] <- x[LOCF]
return(x)
}
# example:
LOCF(c(1,2,3,4,NA,NA))
LOCF(c(1,NA,3,4,NA,NA))
Now this works great for simple vectors. But if I where to try and use it on a data frame:
a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
a
t(apply(a, 1, LOCF)) # will make a mess
It will turn my data frame into a character matrix.
Can you think of a way to do LOCF on a data.frame, without turning it into a matrix? (I could use loops and such to correct the mess, but would love for a more elegant solution)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
这已经存在:
This already exists:
如果您不想仅为 na.locf 函数加载像 Zoo 这样的大包,这里有一个简短的解决方案,如果输入向量中存在一些前导 NA,该解决方案也适用。
If you do not want to load a big package like zoo just for the na.locf function, here is a short solution which also works if there are some leading NAs in the input vector.
添加新的
tidyr::fill()
函数,用于将列中的最后一个观察结果移至填充NA
:Adding the new
tidyr::fill()
function for carrying forward the last observation in a column to fill inNA
s:有很多包正是实现了这个功能。
(基本功能相同,但附加选项有些差异)
为 @Alex 添加了这些方法的基准:
我使用了 microbenchmark 包和 tsNH4 时间序列,其中有 4552 个观测值。
结果如下:
,对于这种情况,来自 imputeTS 的 na_locf 是最快的 - 紧随其后的是来自 Zoo 的 na.locf0。其他方法明显慢一些。但请注意,这只是针对一个特定时间序列制定的基准。 (添加了您可以针对特定用例进行测试的代码)
结果为图表:
如果您想使用自选的时间序列重新创建基准,则这里是代码
There are a bunch of packages implementing exactly this functionality.
(with same basic functionality, but some differences in additional options)
Added a benchmark of these methods for @Alex:
I used the microbenchmark package and the tsNH4 time series, which has 4552 observations.
These are the results:
So for this case na_locf from imputeTS was the fastest - closely followed by na.locf0 from zoo. The other methods were significantly slower. But be careful it is only a benchmark made with one specific time series. (added the code that you can test for your specific use case)
Results as a plot:
Here is the code, if you want to recreate the benchmark with a self selected time series:
这个问题很老了,但对于后代来说......最好的解决方案是使用带有 roll=T 的 data.table 包。
This question is old but for posterity... the best solution is to use data.table package with the roll=T.
我最终使用循环解决了这个问题:
I ended up solving this using a loop:
您可以使用
lapply()
代替apply()
,然后将结果列表转换为data.frame
。Instead of
apply()
you can uselapply()
and then transform the resulting list todata.frame
.