计算开始/结束列之间的工作日数量

发布于 2025-01-26 23:08:35 字数 1772 浏览 1 评论 0原文

我有两个数据范围

  • 事实:
    • 列:数据start_dateend_date
  • 假期:
    • 列:Holiday_Date

我想要的是生产具有列的另一个数据框的方法: datastart_dateend_datenum_holidays

其中num_holidays计算为:在开始和结束之间不是周末或节日之间的天数(如假期表中)。

解决方案在这里如果我们想在PL/SQL中执行此操作。 Crux是代码的一部分:

    --Calculate and return the number of workdays using the input parameters.
    --This is the meat of the function.
    --This is really just one formula with a couple of parts that are listed on separate lines for documentation purposes.
    RETURN (
        SELECT
        --Start with total number of days including weekends
        (DATEDIFF(dd,@StartDate, @EndDate)+1)
        --Subtact 2 days for each full weekend
        -(DATEDIFF(wk,@StartDate, @EndDate)*2)
        --If StartDate is a Sunday, Subtract 1
        -(CASE WHEN DATENAME(dw, @StartDate) = 'Sunday'
            THEN 1
            ELSE 0
        END)
        --If EndDate is a Saturday, Subtract 1
        -(CASE WHEN DATENAME(dw, @EndDate) = 'Saturday'
            THEN 1
            ELSE 0
        END)
        --Subtract all holidays
        -(Select Count(*) from [dbo].[tblHolidays]
          where  [HolDate] between @StartDate and @EndDate )
        )
    END

我是Pyspark的新手,并且想知道这样做的有效方法是什么?我可以发布我正在写的udf,如果我有所帮助,尽管我觉得这是错误的事情:

  • 有没有比创建读取假期表的UDF更好的方法数据框架并加入它来计算假期?我甚至可以在UDF内加入吗?
  • 有没有办法编写pandas_udf?它足够快吗?
  • 我可以在每个工人上以某种方式使用一些优化,例如缓存假期表?

I have two Dataframes

  • facts:
    • columns: data, start_date and end_date
  • holidays:
    • column: holiday_date

What I want is a way to produce another Dataframe that has columns:
data, start_date, end_date and num_holidays

Where num_holidays is computed as: Number of days between start and end that are not weekends or holidays (as in the holidays table).

The solution is here if we wanted to do this in PL/SQL. Crux is this part of code:

    --Calculate and return the number of workdays using the input parameters.
    --This is the meat of the function.
    --This is really just one formula with a couple of parts that are listed on separate lines for documentation purposes.
    RETURN (
        SELECT
        --Start with total number of days including weekends
        (DATEDIFF(dd,@StartDate, @EndDate)+1)
        --Subtact 2 days for each full weekend
        -(DATEDIFF(wk,@StartDate, @EndDate)*2)
        --If StartDate is a Sunday, Subtract 1
        -(CASE WHEN DATENAME(dw, @StartDate) = 'Sunday'
            THEN 1
            ELSE 0
        END)
        --If EndDate is a Saturday, Subtract 1
        -(CASE WHEN DATENAME(dw, @EndDate) = 'Saturday'
            THEN 1
            ELSE 0
        END)
        --Subtract all holidays
        -(Select Count(*) from [dbo].[tblHolidays]
          where  [HolDate] between @StartDate and @EndDate )
        )
    END

I'm new to pyspark and was wondering what's the efficient way to do this? I can post the udf I'm writing if it helps though I'm going slow because I feel it's the wrong thing to do:

  • Is there a better way than creating a UDF that reads the holidays table in a Dataframe and joins with it to count the holidays? Can I even join inside a udf?
  • Is there a way to write a pandas_udf instead? Would it be faster enough?
  • Are there some optimizations I can apply like cache the holidays table somehow on every worker?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

半暖夏伤 2025-02-02 23:08:35

这样的事情可能会起作用:

from pyspark.sql import functions as F

df_facts = spark.createDataFrame(
    [('data1', '2022-05-08', '2022-05-14'),
     ('data1', '2022-05-08', '2022-05-21')],
    ['data', 'start_date', 'end_date']
)
df_holidays = spark.createDataFrame([('2022-05-10',)], ['holiday_date'])

df = df_facts.withColumn('exploded', F.explode(F.sequence(F.to_date('start_date'), F.to_date('end_date'))))
df = df.filter(~F.dayofweek('exploded').isin([1, 7]))
df = df.join(F.broadcast(df_holidays), df.exploded == df_holidays.holiday_date, 'anti')
df = df.groupBy('data', 'start_date', 'end_date').agg(F.count('exploded').alias('business_days'))

df.show()
# +-----+----------+----------+-------------+
# | data|start_date|  end_date|business_days|
# +-----+----------+----------+-------------+
# |data1|2022-05-08|2022-05-14|            4|
# |data1|2022-05-08|2022-05-21|            9|
# +-----+----------+----------+-------------+

答案:

有比创建UDF更好的方法...?

此方法不使用udf,因此必须更好地执行。

有没有办法编写pandas_udf?它会足够快吗?

PANDAS_UDF表现优于常规udf。但是,NO-UDF方法应该更好。

是否可以在每位工人上以某种方式进行一些优化?

Spark Engine本身可以进行优化。但是,有一些相对较少的情况,您可能会有所帮助。在答案中,我使用了f.Broadcast(DF_HOLIDAYS)广播将数据框发送给所有工人。但是我相信桌子会自动向工人广播,因为看起来应该很小。

Something like this may work:

from pyspark.sql import functions as F

df_facts = spark.createDataFrame(
    [('data1', '2022-05-08', '2022-05-14'),
     ('data1', '2022-05-08', '2022-05-21')],
    ['data', 'start_date', 'end_date']
)
df_holidays = spark.createDataFrame([('2022-05-10',)], ['holiday_date'])

df = df_facts.withColumn('exploded', F.explode(F.sequence(F.to_date('start_date'), F.to_date('end_date'))))
df = df.filter(~F.dayofweek('exploded').isin([1, 7]))
df = df.join(F.broadcast(df_holidays), df.exploded == df_holidays.holiday_date, 'anti')
df = df.groupBy('data', 'start_date', 'end_date').agg(F.count('exploded').alias('business_days'))

df.show()
# +-----+----------+----------+-------------+
# | data|start_date|  end_date|business_days|
# +-----+----------+----------+-------------+
# |data1|2022-05-08|2022-05-14|            4|
# |data1|2022-05-08|2022-05-21|            9|
# +-----+----------+----------+-------------+

Answers:

Is there a better way than creating a UDF...?

This method does not use udf, so it must perform better.

Is there a way to write a pandas_udf instead? Would it be faster enough?

pandas_udf performs better than regular udf. But no-udf approaches should be even better.

Are there some optimizations I can apply like cache the holidays table somehow on every worker?

Spark engine performs optimizations itself. However, there are some relatively rare cases when you may help it. In the answer, I have used F.broadcast(df_holidays). The broadcast sends the dataframe to all of the workers. But I am sure that the table would automatically be broadcasted to the workers, as it looks like it's supposed to be very small.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文