在数据框中延续最后的观察结果?

发布于 2024-08-31 03:16:38 字数 662 浏览 3 评论 0原文

我希望对我正在处理的数据集实施“最后观察结转”,该数据集末尾有缺失值。

这是一个简单的代码来做到这一点(后面的问题):

LOCF <- function(x)
{
    # Last Observation Carried Forward (for a left to right series)
    LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
    x[LOCF:length(x)] <- x[LOCF]
    return(x)
}


# example:
LOCF(c(1,2,3,4,NA,NA))
LOCF(c(1,NA,3,4,NA,NA))

现在这对于简单的向量非常有效。但是,如果我尝试在数据帧上使用它:

a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
a
t(apply(a, 1, LOCF)) # will make a mess

它将把我的数据帧变成字符矩阵。

你能想出一种在 data.frame 上执行 LOCF 而不将其转换为矩阵的方法吗? (我可以使用循环等来纠正混乱,但希望有一个更优雅的解决方案)

I wish to implement a "Last Observation Carried Forward" for a data set I am working on which has missing values at the end of it.

Here is a simple code to do it (question after it):

LOCF <- function(x)
{
    # Last Observation Carried Forward (for a left to right series)
    LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
    x[LOCF:length(x)] <- x[LOCF]
    return(x)
}


# example:
LOCF(c(1,2,3,4,NA,NA))
LOCF(c(1,NA,3,4,NA,NA))

Now this works great for simple vectors. But if I where to try and use it on a data frame:

a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
a
t(apply(a, 1, LOCF)) # will make a mess

It will turn my data frame into a character matrix.

Can you think of a way to do LOCF on a data.frame, without turning it into a matrix? (I could use loops and such to correct the mess, but would love for a more elegant solution)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

烟酒忠诚 2024-09-07 03:16:38

这已经存在:

library(zoo)
na.locf(data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA)))

This already exists:

library(zoo)
na.locf(data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA)))
辞取 2024-09-07 03:16:38

如果您不想仅为 na.locf 函数加载像 Zoo 这样的大包,这里有一个简短的解决方案,如果输入向量中存在一些前导 NA,该解决方案也适用。

na.locf <- function(x) {
  v <- !is.na(x)
  c(NA, x[v])[cumsum(v)+1]
}

If you do not want to load a big package like zoo just for the na.locf function, here is a short solution which also works if there are some leading NAs in the input vector.

na.locf <- function(x) {
  v <- !is.na(x)
  c(NA, x[v])[cumsum(v)+1]
}
故事还在继续 2024-09-07 03:16:38

添加新的 tidyr::fill() 函数,用于将列中的最后一个观察结果移至填充 NA

a <- data.frame(col1 = rep("a",4), col2 = 1:4, 
                col3 = 1:4, col4 = c(1,NA,NA,NA))
a
#   col1 col2 col3 col4
# 1    a    1    1    1
# 2    a    2    2   NA
# 3    a    3    3   NA
# 4    a    4    4   NA

a %>% tidyr::fill(col4)
#   col1 col2 col3 col4
# 1    a    1    1    1
# 2    a    2    2    1
# 3    a    3    3    1
# 4    a    4    4    1

Adding the new tidyr::fill() function for carrying forward the last observation in a column to fill in NAs:

a <- data.frame(col1 = rep("a",4), col2 = 1:4, 
                col3 = 1:4, col4 = c(1,NA,NA,NA))
a
#   col1 col2 col3 col4
# 1    a    1    1    1
# 2    a    2    2   NA
# 3    a    3    3   NA
# 4    a    4    4   NA

a %>% tidyr::fill(col4)
#   col1 col2 col3 col4
# 1    a    1    1    1
# 2    a    2    2    1
# 3    a    3    3    1
# 4    a    4    4    1
隐诗 2024-09-07 03:16:38

有很多包正是实现了这个功能。
(基本功能相同,但附加选项有些差异)

  • spacetime::na.locf
  • imputeTS::na_locf
  • zoo::na.locf
  • xts::na.locf
  • tidyr::fill

为 @Alex 添加了这些方法的基准:

我使用了 microbenchmark 包和 tsNH4 时间序列,其中有 4552 个观测值。
结果如下:

,对于这种情况,来自 imputeTS 的 na_locf 是最快的 - 紧随其后的是来自 Zoo 的 na.locf0。其他方法明显慢一些。但请注意,这只是针对一个特定时间序列制定的基准。 (添加了您可以针对特定用例进行测试的代码)

结果为图表:

如果您想使用自选的时间序列重新创建基准,则这里是代码

library(microbenchmark)
library(imputeTS)
library(zoo)
library(xts)
library(spacetime)
library(tidyr)

# Create a data.frame from tsNH series 
df <- as.data.frame(tsNH4)

res <- microbenchmark(imputeTS::na_locf(tsNH4),
                    zoo::na.locf0(tsNH4),
                    zoo::na.locf(tsNH4), 
                    tidyr::fill(df, everything()), 
                    spacetime::na.locf(tsNH4), 
                    times = 100)
ggplot2::autoplot(res)

plot(res)

# code just to show each methods produces correct output
spacetime::na.locf(tsNH4)
imputeTS::na_locf(tsNH4)
zoo::na.locf(tsNH4)
zoo::na.locf0(tsNH4)
tidyr::fill(df, everything())

There are a bunch of packages implementing exactly this functionality.
(with same basic functionality, but some differences in additional options)

  • spacetime::na.locf
  • imputeTS::na_locf
  • zoo::na.locf
  • xts::na.locf
  • tidyr::fill

Added a benchmark of these methods for @Alex:

I used the microbenchmark package and the tsNH4 time series, which has 4552 observations.
These are the results:
enter image description here

So for this case na_locf from imputeTS was the fastest - closely followed by na.locf0 from zoo. The other methods were significantly slower. But be careful it is only a benchmark made with one specific time series. (added the code that you can test for your specific use case)

Results as a plot:
enter image description here

Here is the code, if you want to recreate the benchmark with a self selected time series:

library(microbenchmark)
library(imputeTS)
library(zoo)
library(xts)
library(spacetime)
library(tidyr)

# Create a data.frame from tsNH series 
df <- as.data.frame(tsNH4)

res <- microbenchmark(imputeTS::na_locf(tsNH4),
                    zoo::na.locf0(tsNH4),
                    zoo::na.locf(tsNH4), 
                    tidyr::fill(df, everything()), 
                    spacetime::na.locf(tsNH4), 
                    times = 100)
ggplot2::autoplot(res)

plot(res)

# code just to show each methods produces correct output
spacetime::na.locf(tsNH4)
imputeTS::na_locf(tsNH4)
zoo::na.locf(tsNH4)
zoo::na.locf0(tsNH4)
tidyr::fill(df, everything())
烟─花易冷 2024-09-07 03:16:38

这个问题很老了,但对于后代来说......最好的解决方案是使用带有 roll=T 的 data.table 包。

This question is old but for posterity... the best solution is to use data.table package with the roll=T.

无名指的心愿 2024-09-07 03:16:38

我最终使用循环解决了这个问题:

fillInTheBlanks <- function(S) {
  L <- !is.na(S)
  c(S[L][1], S[L])[cumsum(L)+1]
}


LOCF.DF <- function(xx)
{
    # won't work well if the first observation is NA

    orig.class <- lapply(xx, class)

    new.xx <- data.frame(t( apply(xx,1, fillInTheBlanks) ))

    for(i in seq_along(orig.class))
    {
        if(orig.class[[i]] == "factor") new.xx[,i] <- as.factor(new.xx[,i])
        if(orig.class[[i]] == "numeric") new.xx[,i] <- as.numeric(new.xx[,i])
        if(orig.class[[i]] == "integer") new.xx[,i] <- as.integer(new.xx[,i])   
    }

    #t(na.locf(t(a)))

    return(new.xx)
}

a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
LOCF.DF(a)

I ended up solving this using a loop:

fillInTheBlanks <- function(S) {
  L <- !is.na(S)
  c(S[L][1], S[L])[cumsum(L)+1]
}


LOCF.DF <- function(xx)
{
    # won't work well if the first observation is NA

    orig.class <- lapply(xx, class)

    new.xx <- data.frame(t( apply(xx,1, fillInTheBlanks) ))

    for(i in seq_along(orig.class))
    {
        if(orig.class[[i]] == "factor") new.xx[,i] <- as.factor(new.xx[,i])
        if(orig.class[[i]] == "numeric") new.xx[,i] <- as.numeric(new.xx[,i])
        if(orig.class[[i]] == "integer") new.xx[,i] <- as.integer(new.xx[,i])   
    }

    #t(na.locf(t(a)))

    return(new.xx)
}

a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
LOCF.DF(a)
翻身的咸鱼 2024-09-07 03:16:38

您可以使用 lapply() 代替 apply(),然后将结果列表转换为 data.frame

LOCF <- function(x) {
    # Last Observation Carried Forward (for a left to right series)
    LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
    x[LOCF:length(x)] <- x[LOCF]
    return(x)
}

a <- data.frame(rep("a",4), 1:4, 1:4, c(1, NA, NA, NA))
a
data.frame(lapply(a, LOCF))

Instead of apply() you can use lapply() and then transform the resulting list to data.frame.

LOCF <- function(x) {
    # Last Observation Carried Forward (for a left to right series)
    LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
    x[LOCF:length(x)] <- x[LOCF]
    return(x)
}

a <- data.frame(rep("a",4), 1:4, 1:4, c(1, NA, NA, NA))
a
data.frame(lapply(a, LOCF))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文