使用季节性周期对时间序列中的缺失值进行插值
我有一个时间序列,我想智能地插入缺失值。特定时间的价值受到多日趋势及其在每日周期中的位置的影响。
这是一个示例,其中 myzoo
中缺少第十个观察值。
start <- as.POSIXct("2010-01-01")
freq <- as.difftime(6, units = "hours")
dayvals <- (1:4)*10
timevals <- c(3, 1, 2, 4)
index <- seq(from = start, by = freq, length.out = 16)
obs <- (rep(dayvals, each = 4) + rep(timevals, times = 4))
myzoo <- zoo(obs, index)
myzoo[10] <- NA
如果我必须实现此功能,我会使用附近日期的关闭时间的某种加权平均值,或者添加该天的值到适合更大趋势的函数线,但我希望已经存在一些适用于这种情况的包或函数?
编辑:稍微修改代码以澄清我的问题。有一些 na.*
方法可以从最近的邻居进行插值,但在这种情况下,它们无法识别缺失值是在当天最低值的时间。也许解决方案是将数据重塑为宽格式,然后进行插值,但我不想完全忽略同一天的连续值。值得注意的是,diff(myzoo, lag = 4)
返回一个 10 的向量。解决方案可能在于 reshape
、na.spline
和 diff.inv
的某种组合,但我就是无法弄清楚。
以下是三种不起作用的方法:
编辑2。使用以下代码生成的图像。
myzoo <- zoo(obs, index)
myzoo[10] <- NA # knock out the missing point
plot(myzoo, type="o", pch=16) # plot solid line
points(na.approx(myzoo)[10], col = "red")
points(na.locf(myzoo)[10], col = "blue")
points(na.spline(myzoo)[10], col = "green")
myzoo[10] <- 31 # replace the missing point
lines(myzoo, type = "o", lty=3, pch=16) # dashed line over the gap
legend(x = "topleft",
legend = c("na.spline", "na.locf", "na.approx"),
col=c("green","blue","red"), pch = 1)
I have a time series for which I want to intelligently interpolate the missing values. The value at a particular time is influenced by a multi-day trend, as well as its position in the daily cycle.
Here is an example in which the tenth observation is missing from myzoo
start <- as.POSIXct("2010-01-01")
freq <- as.difftime(6, units = "hours")
dayvals <- (1:4)*10
timevals <- c(3, 1, 2, 4)
index <- seq(from = start, by = freq, length.out = 16)
obs <- (rep(dayvals, each = 4) + rep(timevals, times = 4))
myzoo <- zoo(obs, index)
myzoo[10] <- NA
If I had to implement this, I'd use some kind of weighted mean of close times on nearby days, or add a value for the day to a function line fitted to the larger trend, but I hope there already exist some package or functions that apply to this situation?
EDIT: Modified the code slightly to clarify my problem. There are na.*
methods that interpolate from nearest neighbors, but in this case they do not recognize that the missing value is at the time that is the lowest value of the day. Maybe the solution is to reshape the data to wide format and then interpolate, but I wouldn't like to completely disregard the contiguous values from the same day. It is worth noting that diff(myzoo, lag = 4)
returns a vector of 10's. The solution may lie with some combination of reshape
, na.spline
, and diff.inv
, but I just can't figure it out.
Here are three approaches that don't work:
EDIT2. Image produced using the following code.
myzoo <- zoo(obs, index)
myzoo[10] <- NA # knock out the missing point
plot(myzoo, type="o", pch=16) # plot solid line
points(na.approx(myzoo)[10], col = "red")
points(na.locf(myzoo)[10], col = "blue")
points(na.spline(myzoo)[10], col = "green")
myzoo[10] <- 31 # replace the missing point
lines(myzoo, type = "o", lty=3, pch=16) # dashed line over the gap
legend(x = "topleft",
legend = c("na.spline", "na.locf", "na.approx"),
col=c("green","blue","red"), pch = 1)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
试试这个:
这个想法是使用时间序列的基本结构模型,该模型使用卡尔曼滤波器很好地处理缺失值。然后使用卡尔曼平滑来估计时间序列中的每个点,包括任何遗漏的点。
我必须将你的 Zoo 对象转换为频率为 4 的 ts 对象才能使用 StructTS。您可能想再次将拟合值更改回动物园。
Try this:
The idea is to use a basic structural model for the time series, which handles the missing value fine using a Kalman filter. Then a Kalman smooth is used to estimate each point in the time series, including any omitted.
I had to convert your zoo object to a ts object with frequency 4 in order to use StructTS. You may want to change the fitted values back to zoo again.
在这种情况下,我认为您需要在 ARIMA 模型中进行季节性修正。这里没有足够的日期来适应季节性模型,但这应该可以帮助您开始。
在我的测试中,ARMA(3, 3) 非常接近,但这只是运气。对于较长的时间序列,您应该能够校准季节性校正,以便为您提供良好的预测。事先了解信号和季节性校正的潜在机制将有助于获得更好的样本性能。
In this case, I think you want a seasonality correction in the ARIMA model. There's not enough date here to fit the seasonal model, but this should get you started.
In my tests the ARMA(3, 3) is really close, but that's just luck. With a longer time series you should be able to calibrate the seasonal correction to give you good predictions. It would be helpful to have a good prior on what the underlying mechanisms for both the signal and the seasonal correction to get better out of sample performance.
forecast::na.interp
是一个很好的方法。来自文档本文针对实时序列评估了几种插值方法,并发现
na.interp< /code> 既准确又高效:
另外值得注意的是,Rob Hyndman 编写了
forecast
包,并在提供此问题的答案后包含了na.interp
。na.interp
很可能是对此方法的改进,尽管它在本例中表现较差(可能是由于在StructTS
中指定了周期,其中na .interp 计算出来)。
forecast::na.interp
is a good approach. From the documentationThis paper evaluates several interpolation methods against real time series, and finds that
na.interp
is both accurate and efficient:Also worth noting that Rob Hyndman wrote the
forecast
package, and includedna.interp
after providing his answer to this question. It's likely thatna.interp
is an improvement upon this approach, even though it performed worse in this instance (probably due to specifying the period inStructTS
, wherena.interp
figures it out).imputeTS 包有一种对 ARIMA 模型的状态空间表示进行卡尔曼平滑的方法 - 这可能是解决此问题的一个好方法。
也可以直接与动物园时间序列对象一起使用。您还可以在此函数中使用您自己的 ARIMA 模型。如果您认为您可以做得更好,那么“auto.arima”。可以这样完成:
但在这种情况下,您必须将 Zoo 对象转换回 ts,因为 arima() 只接受 ts。
Package
imputeTS
has a method for Kalman Smoothing on the state space representation of an ARIMA model - which might be a good solution for this problem.Also works directly with zoo time series objects. You could also use your own ARIMA models in this function. If you think you can do better then "auto.arima". This would be done this way:
But in this case you have to convert the zoo onject back to ts, since arima() only accepts ts.