为什么观测值的数量会改变具有固定系数的 sarimax 模型的预测?
在训练了 sarimax 模型后,我希望能够在未来使用它进行新的观测预测,而无需重新训练它。然而,我注意到我在新应用的预测中使用的观察数量改变了预测。
根据我的理解,只要给出足够的观测值以允许正确计算自回归和移动平均值,该模型甚至不会使用早期的历史观测值来通知自己,因为系数没有被重新训练。在 (3,0,1) 示例中,我认为它需要至少 3 个观测值才能应用其训练系数。然而,情况似乎并非如此,我怀疑我是否正确理解了该模型。
作为示例和测试,我将经过训练的 sarimax 应用于完全相同的数据,并删除了最初的几个观察值,以使用以下代码测试行数对预测的影响:
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX, SARIMAXResults
y = [348, 363, 435, 491, 505, 404, 359, 310, 337, 360, 342, 406, 396, 420, 472, 548, 559, 463, 407, 362, 405, 417, 391, 419, 461, 472, 535, 622, 606, 508, 461, 390, 432]
ynew = y[10:]
print(ynew)
model = SARIMAX(endog=y, order=(3,0,1))
model = model.fit()
print(model.params)
pred1 = model.predict(start=len(y), end = len(y)+7)
model2 = model.apply(ynew)
print(model.params)
pred2 = model2.predict(start=len(ynew), end = len(ynew)+7)
print(pd.DataFrame({'pred1': pred1, 'pred2':pred2}))
结果如下:
pred1 pred2
0 472.246996 472.711770
1 494.753955 495.745968
2 498.092585 499.427285
3 489.428531 490.862153
4 477.678527 479.035869
5 469.023243 470.239459
6 465.576002 466.673790
7 466.338141 467.378903
基于此,这意味着,如果我要根据经过训练的模型和新的观察结果进行预测,那么观察数量本身的变化就会影响预测的完整性。
对此有何解释?考虑到新观测值数量的变化,将经过训练的模型应用于新观测值的标准做法是什么?
如果我想更新模型,但无法控制我是否从训练集一开始就拥有所有原始观察结果,则此测试将表明我的预测也可能是随机数。
After training a sarimax model, I had hoped to be able to preform forecasts in future using it with new observations without having to retrain it. However, I noticed that the number of observations i use in the newly applied forecast change the predictions.
From my understanding, provided that enough observations are given to allow the autoregression and moving average to be calculated correctly, the model would not even use the earlier historic observations to inform itself as the coefficients are not being retrained. In a (3,0,1) example i would have thought it would need atleast 3 observations to apply its trained coefficients. However this does not seem to be the case and i am questioning whether i have understood the model correctly.
as an example and test, i have applied a trained sarimax to the exact same data with the initial few observations removed to test the effect of the number of rows on the prediction with the following code:
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX, SARIMAXResults
y = [348, 363, 435, 491, 505, 404, 359, 310, 337, 360, 342, 406, 396, 420, 472, 548, 559, 463, 407, 362, 405, 417, 391, 419, 461, 472, 535, 622, 606, 508, 461, 390, 432]
ynew = y[10:]
print(ynew)
model = SARIMAX(endog=y, order=(3,0,1))
model = model.fit()
print(model.params)
pred1 = model.predict(start=len(y), end = len(y)+7)
model2 = model.apply(ynew)
print(model.params)
pred2 = model2.predict(start=len(ynew), end = len(ynew)+7)
print(pd.DataFrame({'pred1': pred1, 'pred2':pred2}))
The results are as follows:
pred1 pred2
0 472.246996 472.711770
1 494.753955 495.745968
2 498.092585 499.427285
3 489.428531 490.862153
4 477.678527 479.035869
5 469.023243 470.239459
6 465.576002 466.673790
7 466.338141 467.378903
Based on this, it means that if I were to produce a forecast from a trained model with new observations, the change in the number of observations itself would impact the integrity of the forecast.
What is the explanation for this? What is the standard practice for applying a trained model on new observations given the change in the number of them?
If i wanted to update the model but could not control for whether or not i had all of the original observations from the very start of my training set, this test would indicate that my forecast might as well be random numbers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
主要问题
这里的主要问题是您没有使用新的结果对象 (
model2
) 进行第二组预测。你有:但你应该有:
如果你解决这个问题,你会得到非常相似的预测:
要理解为什么它们不相同,还有第二个问题(这不是你的代码中的问题,而只是你的统计特征)数据/模型)。
次要问题
您的估计参数意味着一个极其持久的模型:
给出
与近单位根过程(最大特征值
= 0.99957719)。
这意味着特定数据点对预测的影响需要很长时间才能消失。就您而言,这仅意味着前 10 个周期的预测仍然存在较小的影响。
这不是问题,这只是这个特定的估计模型的工作方式。
Main issue
The main problem here is that you are not using your new results object (
model2
) for your second set of predictions. You have:but you should have:
If you fix this, you get very similar predictions:
To understand why they're not identical, there is a second issue (which is not a problem in your code, but just a statistical feature of your data/model).
Secondary issue
Your estimated parameters imply an extremely persistent model:
gives
with is associated with a near-unit-root process (largest eigenvalue
= 0.99957719).
What this means is that it takes a very long time for the effects of a particular datapoint on the forecast to die out. In your case, this just means that there are still small effects on the forecasts from the first 10 periods.
This isn't a problem, it's just the way this particular estimated model works.