applinpandas的fbProphet会导致意外的计数值[pyspark]
我正在使用ApplionInpandas
在ID
上使用groupby
上的采样数据实现预测函数。最终目标是为每个ID
计算Mape
。
def forecast_balance(history_pd: pd.DataFrame) -> pd.DataFrame:
anonym_cis = history_pd.at[0,'ID']
# instantiate the model, configure the parameters
model = Prophet(
interval_width=0.95,
growth='linear',
daily_seasonality=True,
weekly_seasonality=True,
yearly_seasonality=False,
seasonality_mode='multiplicative'
)
# fit the model
model.fit(history_pd)
# configure predictions
future_pd = model.make_future_dataframe(
periods=30,
freq='d',
include_history=True
)
# make predictions
results_pd = model.predict(future_pd)
results_pd.loc[:, 'ID'] = anonym_cis
# . . .
# return predictions
return results_pd[['ds', 'ID', 'yhat', 'yhat_upper', 'yhat_lower']]
results = (
fr_sample
.groupBy('ID')
.applyInPandas(forecast_balance, schema=result_schema)
)
我得到了预期的预测结果。但是,当我计算输入数据和输出数据中每个ID
的行数时,它不匹配。我想知道这些额外的30
(292-262
)如何在每个ID
的过程中创建行。
+----------+-----+
| ID|count|
+----------+-----+
| 482726| 262|
| 482769| 262|
| 483946| 262|
| 484124| 262|
| 484364| 262|
| 485103| 262|
+----------+-----+
+----------+-----+
| ID|count|
+----------+-----+
| 482726| 292|
| 482769| 292|
| 483946| 292|
| 484124| 292|
| 484364| 292|
| 485103| 292|
+----------+-----+
笔记: 这就是我目前正在计算mape
的方式,这不是每个id
,而是所有数据,因此产生了一个值(例如1.4382
)。
def gr_mape_val(pd_sample_df, result_df):
result_df = result_df.toPandas()
actuals_pd = pd_sample_df[pd_sample_df['ds'] < date(2022, 3, 19) ]['y']
predicted_pd = result_df[ result_df['ds'] < pd.to_datetime('2022-03-19') ]['yhat']
mape = mean_absolute_percentage_error(actuals_pd, predicted_pd)
return mape
要以每种id
的方式以groupby格式使用它,我需要在上面提到的count
值匹配的值,但我无法弄清楚,如何?
I am using applyInPandas
to implement a forecast function over a sampled data using groupBy
on ID
. The end goal is to calculate MAPE
for each ID
.
def forecast_balance(history_pd: pd.DataFrame) -> pd.DataFrame:
anonym_cis = history_pd.at[0,'ID']
# instantiate the model, configure the parameters
model = Prophet(
interval_width=0.95,
growth='linear',
daily_seasonality=True,
weekly_seasonality=True,
yearly_seasonality=False,
seasonality_mode='multiplicative'
)
# fit the model
model.fit(history_pd)
# configure predictions
future_pd = model.make_future_dataframe(
periods=30,
freq='d',
include_history=True
)
# make predictions
results_pd = model.predict(future_pd)
results_pd.loc[:, 'ID'] = anonym_cis
# . . .
# return predictions
return results_pd[['ds', 'ID', 'yhat', 'yhat_upper', 'yhat_lower']]
results = (
fr_sample
.groupBy('ID')
.applyInPandas(forecast_balance, schema=result_schema)
)
I am getting am expected predictive results. However, when I count the number of rows for each ID
in input data and the output data, it doesn't match. I would like to know from where/how these extra 30
(292-262
) rows are getting created in the process for each ID
.
+----------+-----+
| ID|count|
+----------+-----+
| 482726| 262|
| 482769| 262|
| 483946| 262|
| 484124| 262|
| 484364| 262|
| 485103| 262|
+----------+-----+
+----------+-----+
| ID|count|
+----------+-----+
| 482726| 292|
| 482769| 292|
| 483946| 292|
| 484124| 292|
| 484364| 292|
| 485103| 292|
+----------+-----+
Note:
This is how I am calculating MAPE
as of now which is not for each ID
but a over all data, hence resulting a single value (e.g. 1.4382
).
def gr_mape_val(pd_sample_df, result_df):
result_df = result_df.toPandas()
actuals_pd = pd_sample_df[pd_sample_df['ds'] < date(2022, 3, 19) ]['y']
predicted_pd = result_df[ result_df['ds'] < pd.to_datetime('2022-03-19') ]['yhat']
mape = mean_absolute_percentage_error(actuals_pd, predicted_pd)
return mape
To use it in groupBy format for each ID
, I need to have both the above mentioned count
values matched but I am not able to figure out, how?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我刚刚发现那里发生了什么:
基本上,使用
make_future_dataframe
,我正在创建30个额外的数据点,这正在改变prediction_pd
的总数。可以通过使用
df.na.drop()
来简单地解决这一点。I just found what was going on there:
Basically with
make_future_dataframe
, I am creating 30 extra datapoints which was changing the total count ofpredicted_pd
.This can be simply solved by using
df.na.drop()