在数据框中延续最后的观察结果？

发布于 2024-08-31 03:16:38 字数 662 浏览 6 评论 0原文

我希望对我正在处理的数据集实施“最后观察结转”，该数据集末尾有缺失值。

这是一个简单的代码来做到这一点（后面的问题）：

LOCF <- function(x)
{
    # Last Observation Carried Forward (for a left to right series)
    LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
    x[LOCF:length(x)] <- x[LOCF]
    return(x)
}


# example:
LOCF(c(1,2,3,4,NA,NA))
LOCF(c(1,NA,3,4,NA,NA))

现在这对于简单的向量非常有效。但是，如果我尝试在数据帧上使用它：

a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
a
t(apply(a, 1, LOCF)) # will make a mess

它将把我的数据帧变成字符矩阵。

你能想出一种在 data.frame 上执行 LOCF 而不将其转换为矩阵的方法吗？（我可以使用循环等来纠正混乱，但希望有一个更优雅的解决方案）

原文

I wish to implement a "Last Observation Carried Forward" for a data set I am working on which has missing values at the end of it.

Here is a simple code to do it (question after it):

LOCF <- function(x)
{
    # Last Observation Carried Forward (for a left to right series)
    LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
    x[LOCF:length(x)] <- x[LOCF]
    return(x)
}


# example:
LOCF(c(1,2,3,4,NA,NA))
LOCF(c(1,NA,3,4,NA,NA))

Now this works great for simple vectors. But if I where to try and use it on a data frame:

a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
a
t(apply(a, 1, LOCF)) # will make a mess

It will turn my data frame into a character matrix.

Can you think of a way to do LOCF on a data.frame, without turning it into a matrix? (I could use loops and such to correct the mess, but would love for a more elegant solution)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烟酒忠诚 2024-09-07 03:16:38

这已经存在：

library(zoo)
na.locf(data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA)))

This already exists:

library(zoo)
na.locf(data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA)))

回复收藏 0 原文

辞取 2024-09-07 03:16:38

如果您不想仅为 na.locf 函数加载像 Zoo 这样的大包，这里有一个简短的解决方案，如果输入向量中存在一些前导 NA，该解决方案也适用。

na.locf <- function(x) {
  v <- !is.na(x)
  c(NA, x[v])[cumsum(v)+1]
}

If you do not want to load a big package like zoo just for the na.locf function, here is a short solution which also works if there are some leading NAs in the input vector.

na.locf <- function(x) {
  v <- !is.na(x)
  c(NA, x[v])[cumsum(v)+1]
}

回复收藏 0 原文

故事还在继续 2024-09-07 03:16:38

添加新的 tidyr::fill() 函数，用于将列中的最后一个观察结果移至填充 NA：

a <- data.frame(col1 = rep("a",4), col2 = 1:4, 
                col3 = 1:4, col4 = c(1,NA,NA,NA))
a
#   col1 col2 col3 col4
# 1    a    1    1    1
# 2    a    2    2   NA
# 3    a    3    3   NA
# 4    a    4    4   NA

a %>% tidyr::fill(col4)
#   col1 col2 col3 col4
# 1    a    1    1    1
# 2    a    2    2    1
# 3    a    3    3    1
# 4    a    4    4    1

Adding the new tidyr::fill() function for carrying forward the last observation in a column to fill in NAs:

a <- data.frame(col1 = rep("a",4), col2 = 1:4, 
                col3 = 1:4, col4 = c(1,NA,NA,NA))
a
#   col1 col2 col3 col4
# 1    a    1    1    1
# 2    a    2    2   NA
# 3    a    3    3   NA
# 4    a    4    4   NA

a %>% tidyr::fill(col4)
#   col1 col2 col3 col4
# 1    a    1    1    1
# 2    a    2    2    1
# 3    a    3    3    1
# 4    a    4    4    1

回复收藏 0 原文

隐诗 2024-09-07 03:16:38

有很多包正是实现了这个功能。
（基本功能相同，但附加选项有些差异）

spacetime::na.locf
imputeTS::na_locf
zoo::na.locf
xts::na.locf
tidyr::fill

为 @Alex 添加了这些方法的基准：

我使用了 microbenchmark 包和 tsNH4 时间序列，其中有 4552 个观测值。
结果如下：

，对于这种情况，来自 imputeTS 的 na_locf 是最快的 - 紧随其后的是来自 Zoo 的 na.locf0。其他方法明显慢一些。但请注意，这只是针对一个特定时间序列制定的基准。（添加了您可以针对特定用例进行测试的代码）

结果为图表：

如果您想使用自选的时间序列重新创建基准，则这里是代码

library(microbenchmark)
library(imputeTS)
library(zoo)
library(xts)
library(spacetime)
library(tidyr)

# Create a data.frame from tsNH series 
df <- as.data.frame(tsNH4)

res <- microbenchmark(imputeTS::na_locf(tsNH4),
                    zoo::na.locf0(tsNH4),
                    zoo::na.locf(tsNH4), 
                    tidyr::fill(df, everything()), 
                    spacetime::na.locf(tsNH4), 
                    times = 100)
ggplot2::autoplot(res)

plot(res)

# code just to show each methods produces correct output
spacetime::na.locf(tsNH4)
imputeTS::na_locf(tsNH4)
zoo::na.locf(tsNH4)
zoo::na.locf0(tsNH4)
tidyr::fill(df, everything())

There are a bunch of packages implementing exactly this functionality.
(with same basic functionality, but some differences in additional options)

spacetime::na.locf
imputeTS::na_locf
zoo::na.locf
xts::na.locf
tidyr::fill

Added a benchmark of these methods for @Alex:

I used the microbenchmark package and the tsNH4 time series, which has 4552 observations.
These are the results:

So for this case na_locf from imputeTS was the fastest - closely followed by na.locf0 from zoo. The other methods were significantly slower. But be careful it is only a benchmark made with one specific time series. (added the code that you can test for your specific use case)

Results as a plot:

Here is the code, if you want to recreate the benchmark with a self selected time series:

library(microbenchmark)
library(imputeTS)
library(zoo)
library(xts)
library(spacetime)
library(tidyr)

# Create a data.frame from tsNH series 
df <- as.data.frame(tsNH4)

res <- microbenchmark(imputeTS::na_locf(tsNH4),
                    zoo::na.locf0(tsNH4),
                    zoo::na.locf(tsNH4), 
                    tidyr::fill(df, everything()), 
                    spacetime::na.locf(tsNH4), 
                    times = 100)
ggplot2::autoplot(res)

plot(res)

# code just to show each methods produces correct output
spacetime::na.locf(tsNH4)
imputeTS::na_locf(tsNH4)
zoo::na.locf(tsNH4)
zoo::na.locf0(tsNH4)
tidyr::fill(df, everything())

回复收藏 0 原文

烟─花易冷 2024-09-07 03:16:38

这个问题很老了，但对于后代来说......最好的解决方案是使用带有 roll=T 的 data.table 包。

回复收藏 0 原文

无名指的心愿 2024-09-07 03:16:38

我最终使用循环解决了这个问题：

fillInTheBlanks <- function(S) {
  L <- !is.na(S)
  c(S[L][1], S[L])[cumsum(L)+1]
}


LOCF.DF <- function(xx)
{
    # won't work well if the first observation is NA

    orig.class <- lapply(xx, class)

    new.xx <- data.frame(t( apply(xx,1, fillInTheBlanks) ))

    for(i in seq_along(orig.class))
    {
        if(orig.class[[i]] == "factor") new.xx[,i] <- as.factor(new.xx[,i])
        if(orig.class[[i]] == "numeric") new.xx[,i] <- as.numeric(new.xx[,i])
        if(orig.class[[i]] == "integer") new.xx[,i] <- as.integer(new.xx[,i])   
    }

    #t(na.locf(t(a)))

    return(new.xx)
}

a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
LOCF.DF(a)

I ended up solving this using a loop:

fillInTheBlanks <- function(S) {
  L <- !is.na(S)
  c(S[L][1], S[L])[cumsum(L)+1]
}


LOCF.DF <- function(xx)
{
    # won't work well if the first observation is NA

    orig.class <- lapply(xx, class)

    new.xx <- data.frame(t( apply(xx,1, fillInTheBlanks) ))

    for(i in seq_along(orig.class))
    {
        if(orig.class[[i]] == "factor") new.xx[,i] <- as.factor(new.xx[,i])
        if(orig.class[[i]] == "numeric") new.xx[,i] <- as.numeric(new.xx[,i])
        if(orig.class[[i]] == "integer") new.xx[,i] <- as.integer(new.xx[,i])   
    }

    #t(na.locf(t(a)))

    return(new.xx)
}

a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
LOCF.DF(a)

回复收藏 0 原文

翻身的咸鱼 2024-09-07 03:16:38

您可以使用 lapply() 代替 apply()，然后将结果列表转换为 data.frame。

LOCF <- function(x) {
    # Last Observation Carried Forward (for a left to right series)
    LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
    x[LOCF:length(x)] <- x[LOCF]
    return(x)
}

a <- data.frame(rep("a",4), 1:4, 1:4, c(1, NA, NA, NA))
a
data.frame(lapply(a, LOCF))

Instead of apply() you can use lapply() and then transform the resulting list to data.frame.

LOCF <- function(x) {
    # Last Observation Carried Forward (for a left to right series)
    LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
    x[LOCF:length(x)] <- x[LOCF]
    return(x)
}

a <- data.frame(rep("a",4), 1:4, 1:4, c(1, NA, NA, NA))
a
data.frame(lapply(a, LOCF))

回复收藏 0 原文

~没有更多了~