为什么观测值的数量会改变具有固定系数的 sarimax 模型的预测？

发布于 2025-01-09 06:21:45 字数 1348 浏览 7 评论 0原文

在训练了 sarimax 模型后，我希望能够在未来使用它进行新的观测预测，而无需重新训练它。然而，我注意到我在新应用的预测中使用的观察数量改变了预测。

根据我的理解，只要给出足够的观测值以允许正确计算自回归和移动平均值，该模型甚至不会使用早期的历史观测值来通知自己，因为系数没有被重新训练。在 (3,0,1) 示例中，我认为它需要至少 3 个观测值才能应用其训练系数。然而，情况似乎并非如此，我怀疑我是否正确理解了该模型。

作为示例和测试，我将经过训练的 sarimax 应用于完全相同的数据，并删除了最初的几个观察值，以使用以下代码测试行数对预测的影响：

import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX, SARIMAXResults
y = [348, 363, 435, 491, 505, 404, 359, 310, 337, 360, 342, 406, 396, 420, 472, 548, 559, 463, 407, 362, 405, 417, 391, 419, 461, 472, 535, 622, 606, 508, 461, 390, 432]
ynew = y[10:]
print(ynew)
model = SARIMAX(endog=y, order=(3,0,1))
model = model.fit()
print(model.params)
pred1 = model.predict(start=len(y), end = len(y)+7)
model2 = model.apply(ynew)
print(model.params)
pred2 = model2.predict(start=len(ynew), end = len(ynew)+7)
print(pd.DataFrame({'pred1': pred1, 'pred2':pred2}))

结果如下：

   pred1       pred2
0  472.246996  472.711770
1  494.753955  495.745968
2  498.092585  499.427285
3  489.428531  490.862153
4  477.678527  479.035869
5  469.023243  470.239459
6  465.576002  466.673790
7  466.338141  467.378903

基于此，这意味着，如果我要根据经过训练的模型和新的观察结果进行预测，那么观察数量本身的变化就会影响预测的完整性。

对此有何解释？考虑到新观测值数量的变化，将经过训练的模型应用于新观测值的标准做法是什么？

如果我想更新模型，但无法控制我是否从训练集一开始就拥有所有原始观察结果，则此测试将表明我的预测也可能是随机数。

原文

After training a sarimax model, I had hoped to be able to preform forecasts in future using it with new observations without having to retrain it. However, I noticed that the number of observations i use in the newly applied forecast change the predictions.

From my understanding, provided that enough observations are given to allow the autoregression and moving average to be calculated correctly, the model would not even use the earlier historic observations to inform itself as the coefficients are not being retrained. In a (3,0,1) example i would have thought it would need atleast 3 observations to apply its trained coefficients. However this does not seem to be the case and i am questioning whether i have understood the model correctly.

as an example and test, i have applied a trained sarimax to the exact same data with the initial few observations removed to test the effect of the number of rows on the prediction with the following code:

import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX, SARIMAXResults
y = [348, 363, 435, 491, 505, 404, 359, 310, 337, 360, 342, 406, 396, 420, 472, 548, 559, 463, 407, 362, 405, 417, 391, 419, 461, 472, 535, 622, 606, 508, 461, 390, 432]
ynew = y[10:]
print(ynew)
model = SARIMAX(endog=y, order=(3,0,1))
model = model.fit()
print(model.params)
pred1 = model.predict(start=len(y), end = len(y)+7)
model2 = model.apply(ynew)
print(model.params)
pred2 = model2.predict(start=len(ynew), end = len(ynew)+7)
print(pd.DataFrame({'pred1': pred1, 'pred2':pred2}))

The results are as follows:

   pred1       pred2
0  472.246996  472.711770
1  494.753955  495.745968
2  498.092585  499.427285
3  489.428531  490.862153
4  477.678527  479.035869
5  469.023243  470.239459
6  465.576002  466.673790
7  466.338141  467.378903

Based on this, it means that if I were to produce a forecast from a trained model with new observations, the change in the number of observations itself would impact the integrity of the forecast.

What is the explanation for this? What is the standard practice for applying a trained model on new observations given the change in the number of them?

If i wanted to update the model but could not control for whether or not i had all of the original observations from the very start of my training set, this test would indicate that my forecast might as well be random numbers.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

白昼 2025-01-16 06:21:45

主要问题

这里的主要问题是您没有使用新的结果对象 (model2) 进行第二组预测。你有：

pred2 = model.predict(start=len(ynew), end = len(ynew)+7)

但你应该有：

pred2 = model2.predict(start=len(ynew), end = len(ynew)+7)

如果你解决这个问题，你会得到非常相似的预测：

      pred1       pred2
0  472.246996  472.711770
1  494.753955  495.745968
2  498.092585  499.427285
3  489.428531  490.862153
4  477.678527  479.035869
5  469.023243  470.239459
6  465.576002  466.673790
7  466.338141  467.378903

要理解为什么它们不相同，还有第二个问题（这不是你的代码中的问题，而只是你的统计特征）数据/模型）。

次要问题

您的估计参数意味着一个极其持久的模型：

print(params)

给出

ar.L1        2.134401
ar.L2       -1.683946
ar.L3        0.549369
ma.L1       -0.874801
sigma2    1807.187815

与近单位根过程（最大特征值
= 0.99957719)。

这意味着特定数据点对预测的影响需要很长时间才能消失。就您而言，这仅意味着前 10 个周期的预测仍然存在较小的影响。

这不是问题，这只是这个特定的估计模型的工作方式。

Main issue

The main problem here is that you are not using your new results object (model2) for your second set of predictions. You have:

pred2 = model.predict(start=len(ynew), end = len(ynew)+7)

but you should have:

pred2 = model2.predict(start=len(ynew), end = len(ynew)+7)

If you fix this, you get very similar predictions:

      pred1       pred2
0  472.246996  472.711770
1  494.753955  495.745968
2  498.092585  499.427285
3  489.428531  490.862153
4  477.678527  479.035869
5  469.023243  470.239459
6  465.576002  466.673790
7  466.338141  467.378903

To understand why they're not identical, there is a second issue (which is not a problem in your code, but just a statistical feature of your data/model).

Secondary issue

Your estimated parameters imply an extremely persistent model:

print(params)

gives

ar.L1        2.134401
ar.L2       -1.683946
ar.L3        0.549369
ma.L1       -0.874801
sigma2    1807.187815

with is associated with a near-unit-root process (largest eigenvalue
= 0.99957719).

What this means is that it takes a very long time for the effects of a particular datapoint on the forecast to die out. In your case, this just means that there are still small effects on the forecasts from the first 10 periods.

This isn't a problem, it's just the way this particular estimated model works.

回复收藏 0 原文

~没有更多了~