applinpandas的fbProphet会导致意外的计数值[pyspark]

发布于 2025-01-22 23:51:27 字数 2196 浏览 0 评论 0原文

我正在使用ApplionInpandasID上使用groupby 上的采样数据实现预测函数。最终目标是为每个ID计算Mape

    
def forecast_balance(history_pd: pd.DataFrame) -> pd.DataFrame:

    anonym_cis = history_pd.at[0,'ID']
    
    # instantiate the model, configure the parameters
    model = Prophet(
        interval_width=0.95,
        growth='linear',
        daily_seasonality=True,
        weekly_seasonality=True,
        yearly_seasonality=False,
        seasonality_mode='multiplicative'
    )

    # fit the model
    model.fit(history_pd)

    # configure predictions
    future_pd = model.make_future_dataframe(
        periods=30,
        freq='d',
        include_history=True
    )

    # make predictions
    results_pd = model.predict(future_pd)
    results_pd.loc[:, 'ID'] = anonym_cis

    # . . .


    # return predictions
    return results_pd[['ds', 'ID', 'yhat', 'yhat_upper', 'yhat_lower']]



results = (
    fr_sample
    .groupBy('ID')
    .applyInPandas(forecast_balance, schema=result_schema)
    )


我得到了预期的预测结果。但是,当我计算输入数据和输出数据中每个ID的行数时,它不匹配。我想知道这些额外的30292-262)如何在每个ID的过程中创建行。

+----------+-----+
|        ID|count|
+----------+-----+
|    482726|  262|
|    482769|  262|
|    483946|  262|
|    484124|  262|
|    484364|  262|
|    485103|  262|
+----------+-----+


+----------+-----+
|        ID|count|
+----------+-----+
|    482726|  292|
|    482769|  292|
|    483946|  292|
|    484124|  292|
|    484364|  292|
|    485103|  292|
+----------+-----+

笔记: 这就是我目前正在计算mape的方式,这不是每个id,而是所有数据,因此产生了一个值(例如1.4382)。

def gr_mape_val(pd_sample_df, result_df):
  result_df = result_df.toPandas()
  actuals_pd = pd_sample_df[pd_sample_df['ds'] < date(2022, 3, 19) ]['y']
  predicted_pd = result_df[ result_df['ds'] < pd.to_datetime('2022-03-19') ]['yhat']
  mape = mean_absolute_percentage_error(actuals_pd, predicted_pd) 
  return mape

要以每种id的方式以groupby格式使用它,我需要在上面提到的count值匹配的值,但我无法弄清楚,如何?

I am using applyInPandas to implement a forecast function over a sampled data using groupBy on ID. The end goal is to calculate MAPE for each ID.

    
def forecast_balance(history_pd: pd.DataFrame) -> pd.DataFrame:

    anonym_cis = history_pd.at[0,'ID']
    
    # instantiate the model, configure the parameters
    model = Prophet(
        interval_width=0.95,
        growth='linear',
        daily_seasonality=True,
        weekly_seasonality=True,
        yearly_seasonality=False,
        seasonality_mode='multiplicative'
    )

    # fit the model
    model.fit(history_pd)

    # configure predictions
    future_pd = model.make_future_dataframe(
        periods=30,
        freq='d',
        include_history=True
    )

    # make predictions
    results_pd = model.predict(future_pd)
    results_pd.loc[:, 'ID'] = anonym_cis

    # . . .


    # return predictions
    return results_pd[['ds', 'ID', 'yhat', 'yhat_upper', 'yhat_lower']]



results = (
    fr_sample
    .groupBy('ID')
    .applyInPandas(forecast_balance, schema=result_schema)
    )


I am getting am expected predictive results. However, when I count the number of rows for each ID in input data and the output data, it doesn't match. I would like to know from where/how these extra 30 (292-262) rows are getting created in the process for each ID.

+----------+-----+
|        ID|count|
+----------+-----+
|    482726|  262|
|    482769|  262|
|    483946|  262|
|    484124|  262|
|    484364|  262|
|    485103|  262|
+----------+-----+


+----------+-----+
|        ID|count|
+----------+-----+
|    482726|  292|
|    482769|  292|
|    483946|  292|
|    484124|  292|
|    484364|  292|
|    485103|  292|
+----------+-----+

Note:
This is how I am calculating MAPE as of now which is not for each ID but a over all data, hence resulting a single value (e.g. 1.4382).

def gr_mape_val(pd_sample_df, result_df):
  result_df = result_df.toPandas()
  actuals_pd = pd_sample_df[pd_sample_df['ds'] < date(2022, 3, 19) ]['y']
  predicted_pd = result_df[ result_df['ds'] < pd.to_datetime('2022-03-19') ]['yhat']
  mape = mean_absolute_percentage_error(actuals_pd, predicted_pd) 
  return mape

To use it in groupBy format for each ID, I need to have both the above mentioned count values matched but I am not able to figure out, how?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

寒尘 2025-01-29 23:51:27

我刚刚发现那里发生了什么:
基本上,使用make_future_dataframe,我正在创建30个额外的数据点,这正在改变prediction_pd的总数。

可以通过使用df.na.drop()来简单地解决这一点。

pd_sample_df.join(result_df, on=['ID', 'ds'], how='outer').na.drop()

I just found what was going on there:
Basically with make_future_dataframe, I am creating 30 extra datapoints which was changing the total count of predicted_pd.

This can be simply solved by using df.na.drop()

pd_sample_df.join(result_df, on=['ID', 'ds'], how='outer').na.drop()

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文