applinpandas的fbProphet会导致意外的计数值[pyspark]

发布于 2025-01-22 23:51:27 字数 2196 浏览 6 评论 0原文

我正在使用ApplionInpandas在ID上使用groupby 上的采样数据实现预测函数。最终目标是为每个ID计算Mape。

    
def forecast_balance(history_pd: pd.DataFrame) -> pd.DataFrame:

    anonym_cis = history_pd.at[0,'ID']
    
    # instantiate the model, configure the parameters
    model = Prophet(
        interval_width=0.95,
        growth='linear',
        daily_seasonality=True,
        weekly_seasonality=True,
        yearly_seasonality=False,
        seasonality_mode='multiplicative'
    )

    # fit the model
    model.fit(history_pd)

    # configure predictions
    future_pd = model.make_future_dataframe(
        periods=30,
        freq='d',
        include_history=True
    )

    # make predictions
    results_pd = model.predict(future_pd)
    results_pd.loc[:, 'ID'] = anonym_cis

    # . . .


    # return predictions
    return results_pd[['ds', 'ID', 'yhat', 'yhat_upper', 'yhat_lower']]



results = (
    fr_sample
    .groupBy('ID')
    .applyInPandas(forecast_balance, schema=result_schema)
    )

我得到了预期的预测结果。但是，当我计算输入数据和输出数据中每个ID的行数时，它不匹配。我想知道这些额外的30（292-262）如何在每个ID的过程中创建行。

+----------+-----+
|        ID|count|
+----------+-----+
|    482726|  262|
|    482769|  262|
|    483946|  262|
|    484124|  262|
|    484364|  262|
|    485103|  262|
+----------+-----+


+----------+-----+
|        ID|count|
+----------+-----+
|    482726|  292|
|    482769|  292|
|    483946|  292|
|    484124|  292|
|    484364|  292|
|    485103|  292|
+----------+-----+

笔记：这就是我目前正在计算mape的方式，这不是每个id，而是所有数据，因此产生了一个值（例如1.4382）。

def gr_mape_val(pd_sample_df, result_df):
  result_df = result_df.toPandas()
  actuals_pd = pd_sample_df[pd_sample_df['ds'] < date(2022, 3, 19) ]['y']
  predicted_pd = result_df[ result_df['ds'] < pd.to_datetime('2022-03-19') ]['yhat']
  mape = mean_absolute_percentage_error(actuals_pd, predicted_pd) 
  return mape

要以每种id的方式以groupby格式使用它，我需要在上面提到的count值匹配的值，但我无法弄清楚，如何？

原文

I am using applyInPandas to implement a forecast function over a sampled data using groupBy on ID. The end goal is to calculate MAPE for each ID.

    
def forecast_balance(history_pd: pd.DataFrame) -> pd.DataFrame:

    anonym_cis = history_pd.at[0,'ID']
    
    # instantiate the model, configure the parameters
    model = Prophet(
        interval_width=0.95,
        growth='linear',
        daily_seasonality=True,
        weekly_seasonality=True,
        yearly_seasonality=False,
        seasonality_mode='multiplicative'
    )

    # fit the model
    model.fit(history_pd)

    # configure predictions
    future_pd = model.make_future_dataframe(
        periods=30,
        freq='d',
        include_history=True
    )

    # make predictions
    results_pd = model.predict(future_pd)
    results_pd.loc[:, 'ID'] = anonym_cis

    # . . .


    # return predictions
    return results_pd[['ds', 'ID', 'yhat', 'yhat_upper', 'yhat_lower']]



results = (
    fr_sample
    .groupBy('ID')
    .applyInPandas(forecast_balance, schema=result_schema)
    )

I am getting am expected predictive results. However, when I count the number of rows for each ID in input data and the output data, it doesn't match. I would like to know from where/how these extra 30 (292-262) rows are getting created in the process for each ID.

+----------+-----+
|        ID|count|
+----------+-----+
|    482726|  262|
|    482769|  262|
|    483946|  262|
|    484124|  262|
|    484364|  262|
|    485103|  262|
+----------+-----+


+----------+-----+
|        ID|count|
+----------+-----+
|    482726|  292|
|    482769|  292|
|    483946|  292|
|    484124|  292|
|    484364|  292|
|    485103|  292|
+----------+-----+

Note:
This is how I am calculating MAPE as of now which is not for each ID but a over all data, hence resulting a single value (e.g. 1.4382).

def gr_mape_val(pd_sample_df, result_df):
  result_df = result_df.toPandas()
  actuals_pd = pd_sample_df[pd_sample_df['ds'] < date(2022, 3, 19) ]['y']
  predicted_pd = result_df[ result_df['ds'] < pd.to_datetime('2022-03-19') ]['yhat']
  mape = mean_absolute_percentage_error(actuals_pd, predicted_pd) 
  return mape

To use it in groupBy format for each ID, I need to have both the above mentioned count values matched but I am not able to figure out, how?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寒尘 2025-01-29 23:51:27

我刚刚发现那里发生了什么：
基本上，使用make_future_dataframe，我正在创建30个额外的数据点，这正在改变prediction_pd的总数。

可以通过使用df.na.drop（）来简单地解决这一点。

pd_sample_df.join(result_df, on=['ID', 'ds'], how='outer').na.drop()

I just found what was going on there:
Basically with make_future_dataframe, I am creating 30 extra datapoints which was changing the total count of predicted_pd.

This can be simply solved by using df.na.drop()

pd_sample_df.join(result_df, on=['ID', 'ds'], how='outer').na.drop()

回复收藏 0 原文

~没有更多了~

关于作者

兮子

暂无简介

文章

24 人气

关注发私信

友情链接

文江博客

applinpandas的fbProphet会导致意外的计数值[pyspark]

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

lylex099819

yg

mb_PT8LkUS5

埋情葬爱

佚名

奢望

友情链接

applinpandas的fbProphet会导致意外的计数值[pyspark]

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

lylex099819

yg

mb_PT8LkUS5

埋情葬爱

佚名

奢望

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。