使用 na.approx 在数据框中插入 NA 值

发布于 2024-12-03 04:02:24 字数 1101 浏览 0 评论 0原文

我试图通过使用 na.approx() 插值从数据框中删除 NA,但无法删除所有 NA

我的数据帧是 4096x4096,其中 270.15 作为无效值的标志。我需要在所有点上连续的数据来提供气象模型。昨天我询问并获得了关于如何基于另一个数据帧替换数据帧中的值的答案。但之后我来到 na.approx() ,然后决定用 NA 替换 270.15 值,并尝试 na.approx()插值数据。但问题是为什么 na.approx() 没有取代所有 NA。

这就是我正在做的事情:

  • 使用 hdf5load 读取原始 hdf 文件
  • 对数据帧进行子集化 (4094x4096)
  • 用 NA 替换标志值

    <前><代码>> sst4[sst4 == 270.15] = 不适用
  • 检查第一列(或任何其他列)

    <前><代码>>摘要(sst4[,1]) 分钟。第一曲。第三曲区中位数平均值。最大限度。不适用的 271.3 276.4 285.9 285.5 292.3 302.8 1345.0
  • 运行 na.approx

    <前><代码>> sst4=na.approx(sst4,na.rm="FALSE")
  • 检查第一列

    <前><代码>>摘要(sst4[,1]) 分钟。第一曲。第三曲区中位数平均值。最大限度。不适用的 271.3 276.5 286.3 285.9 292.6 302.8 411.0

如您所见,411 NA 尚未删除。为什么?它们都对应于前导/结束列值吗?

head(sst4[,1])
[1] NA NA NA NA NA NA
tail(sst4[,1])
[1] NA NA NA NA NA NA

na.approx 是否需要在 NA 之前和之后具有有效值才能进行插值?我需要设置任何其他 na.approx 选项吗?

非常感谢

I am trying to remove NAs from my data frame by interpolation with na.approx() but can't remove all of the NAs.

My data frame is a 4096x4096 with 270.15 as flag for non valid value. I need data to be continous in all points to feed a meteorological model. Yesterday I asked, and obtained an answer, on how to replace values in a data frame based in another data frame. But after that I came to na.approx() and then decided to replace the 270.15 values with NA and try na.approx() to interpolate data. But the question is why na.approx() does not replace all NAs.

This is what I am doing:

  • Read the original hdf file with hdf5load
  • Subset the data frame (4094x4096)
  • Substitute flag value with NA

    > sst4[sst4 == 270.15 ] = NA
    
  • Check first column (or any other)

    > summary(sst4[,1])
    
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
    271.3   276.4   285.9   285.5   292.3   302.8  1345.0
    
  • Run na.approx

    > sst4=na.approx(sst4,na.rm="FALSE")
    
  • Check first column

    > summary(sst4[,1]) 
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
    271.3   276.5   286.3   285.9   292.6   302.8   411.0
    

As you can see 411 NA's have not been removed. Why? Do they all correspond to leading/ending column values?

head(sst4[,1])
[1] NA NA NA NA NA NA
tail(sst4[,1])
[1] NA NA NA NA NA NA

Is it needed by na.approx to have valid values before and after NA to interpolate? Do I need to set any other na.approx option?

Thank you very much

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

年华零落成诗 2024-12-10 04:02:24

默认情况下,na.approx() 遵循 approx() 函数,仅对值进行插值,而不是外推值。但是,如 approx() 帮助页面中所述,您可以指定 rule = 2 来推断为最接近极值的常量值。继续 Richie Cotton 的示例:

na.approx(m, rule = 2)
         [,1]     [,2]      [,3]     [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239 35.47206  6.178627 38.41037
[3,] 64.01658 50.77592  6.178627 38.41037
[4,] 90.82078 66.07978  6.178627 38.41037

同样,您可以明确使用“最后观察结转”。

na.locf(na.approx(m))
## "first observation carry backwards" too:
na.locf(na.locf(na.approx(m)), fromLast = TRUE)

na.approx() follows the approx() function in only interpolating values, not extrapolating them, by default. However, as described in the help page for approx(), you can specify rule = 2 to extrapolate as a constant value of the nearest extreme. Following on from Richie Cotton's example:

na.approx(m, rule = 2)
         [,1]     [,2]      [,3]     [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239 35.47206  6.178627 38.41037
[3,] 64.01658 50.77592  6.178627 38.41037
[4,] 90.82078 66.07978  6.178627 38.41037

Equivalently, you can use "last observation carry forward" explicitly.

na.locf(na.approx(m))
## "first observation carry backwards" too:
na.locf(na.locf(na.approx(m)), fromLast = TRUE)
ら栖息 2024-12-10 04:02:24

一个小的、可重现的示例:

library(zoo)
set.seed(1)
m <- matrix(runif(16, 0, 100), nrow = 4)
missing_values <- sample(16, 7)
m[missing_values] <- NA
m
         [,1]     [,2]      [,3]     [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239       NA  6.178627 38.41037
[3,]       NA       NA        NA       NA
[4,] 90.82078 66.07978        NA       NA

na.approx(m)
         [,1]     [,2]      [,3]     [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239 35.47206  6.178627 38.41037
[3,] 64.01658 50.77592        NA       NA
[4,] 90.82078 66.07978        NA       NA

m[4, 4] <- 50
na.approx(m)
         [,1]     [,2]      [,3]     [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239 35.47206  6.178627 38.41037
[3,] 64.01658 50.77592        NA 44.20519
[4,] 90.82078 66.07978        NA 50.00000

是的,看起来您确实需要知道列的开始/结束值,否则插值不起作用。你能猜出你的边界值吗?

另一个编辑:因此默认情况下,您需要知道列的开始值和结束值。但是,可以通过传递 rule = 2 来让 na.approx 始终填充空白。请参阅菲利克斯的回答。根据 Gabor 的评论,您还可以使用 na.fill 提供默认值。最后,您可以在两个方向上插值边界条件(见下文)或猜测边界条件。


编辑:进一步的想法。由于 na.approx 仅在列中插值,并且您的数据是空间数据,因此也许在行中插值也很有用。然后你就可以取平均值了。

当整个列都是 NA 时,na.approx 会失败,因此我们创建一个更大的数据集。

set.seed(1)
m <- matrix(runif(64, 0, 100), nrow = 8)
missing_values <- sample(64, 15)
m[missing_values] <- NA

两种方式运行 na.approx

by_col <- na.approx(m)
by_row <- t(na.approx(t(m)))

找出最好的猜测。

default <- 50
best_guess <- ifelse(is.na(by_row), 
  ifelse(
    is.na(by_col), 
    default,              #neither known
    by_col                #only by_col known
  ), 
  ifelse(
    is.na(by_col), 
    by_row,               #only by_row known
    (by_row + by_col) / 2 #both known
  )
)

A small, reproducible example:

library(zoo)
set.seed(1)
m <- matrix(runif(16, 0, 100), nrow = 4)
missing_values <- sample(16, 7)
m[missing_values] <- NA
m
         [,1]     [,2]      [,3]     [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239       NA  6.178627 38.41037
[3,]       NA       NA        NA       NA
[4,] 90.82078 66.07978        NA       NA

na.approx(m)
         [,1]     [,2]      [,3]     [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239 35.47206  6.178627 38.41037
[3,] 64.01658 50.77592        NA       NA
[4,] 90.82078 66.07978        NA       NA

m[4, 4] <- 50
na.approx(m)
         [,1]     [,2]      [,3]     [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239 35.47206  6.178627 38.41037
[3,] 64.01658 50.77592        NA 44.20519
[4,] 90.82078 66.07978        NA 50.00000

Yup, looks like you do need the start/end values of columns to be known or the interpolation doesn't work. Can you guess values for your boundaries?

ANOTHER EDIT: So by default, you need the start and end values of columns to be known. However it is possible to get na.approx to always fill in the blanks by passing rule = 2. See Felix's answer. You can also use na.fill to provide a default value, as per Gabor's comment. Finally, you can interpolate boundary conditions in two directions (see below) or guess boundary conditions.


EDIT: A further thought. Since na.approx is only interpolating in columns, and your data is spacial, perhaps interpolating in rows would be useful too. Then you could take the average.

na.approx fails when whole columns are NA, so we create a bigger dataset.

set.seed(1)
m <- matrix(runif(64, 0, 100), nrow = 8)
missing_values <- sample(64, 15)
m[missing_values] <- NA

Run na.approx both ways.

by_col <- na.approx(m)
by_row <- t(na.approx(t(m)))

Find out the best guess.

default <- 50
best_guess <- ifelse(is.na(by_row), 
  ifelse(
    is.na(by_col), 
    default,              #neither known
    by_col                #only by_col known
  ), 
  ifelse(
    is.na(by_col), 
    by_row,               #only by_row known
    (by_row + by_col) / 2 #both known
  )
)
慕巷 2024-12-10 04:02:24

我认为你应该尝试设置 na.rm=TRUE

来自文档

na.rm 逻辑。是否应该删除领先的 NA?

http://www.oga-lab.net/RGM2 /func.php?rd_id=zoo:na.approx

I think you should try to set na.rm=TRUE

From the docs

na.rm logical. Should leading NAs be removed?

http://www.oga-lab.net/RGM2/func.php?rd_id=zoo:na.approx

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文