计算开始/结束列之间的工作日数量
我有两个数据范围
- 事实:
- 列:
数据
,start_date
和end_date
- 列:
- 假期:
- 列:
Holiday_Date
- 列:
我想要的是生产具有列的另一个数据框的方法: data
,start_date
,end_date
和num_holidays
其中num_holidays
计算为:在开始和结束之间不是周末或节日之间的天数(如假期
表中)。
解决方案在这里如果我们想在PL/SQL中执行此操作。 Crux是代码的一部分:
--Calculate and return the number of workdays using the input parameters.
--This is the meat of the function.
--This is really just one formula with a couple of parts that are listed on separate lines for documentation purposes.
RETURN (
SELECT
--Start with total number of days including weekends
(DATEDIFF(dd,@StartDate, @EndDate)+1)
--Subtact 2 days for each full weekend
-(DATEDIFF(wk,@StartDate, @EndDate)*2)
--If StartDate is a Sunday, Subtract 1
-(CASE WHEN DATENAME(dw, @StartDate) = 'Sunday'
THEN 1
ELSE 0
END)
--If EndDate is a Saturday, Subtract 1
-(CASE WHEN DATENAME(dw, @EndDate) = 'Saturday'
THEN 1
ELSE 0
END)
--Subtract all holidays
-(Select Count(*) from [dbo].[tblHolidays]
where [HolDate] between @StartDate and @EndDate )
)
END
我是Pyspark的新手,并且想知道这样做的有效方法是什么?我可以发布我正在写的udf,如果我有所帮助,尽管我觉得这是错误的事情:
- 有没有比创建读取
假期
表的UDF更好的方法数据框架并加入它来计算假期?我甚至可以在UDF内加入吗? - 有没有办法编写
pandas_udf
?它足够快吗? - 我可以在每个工人上以某种方式使用一些优化,例如缓存假期表?
I have two Dataframes
- facts:
- columns:
data
,start_date
andend_date
- columns:
- holidays:
- column:
holiday_date
- column:
What I want is a way to produce another Dataframe that has columns:data
, start_date
, end_date
and num_holidays
Where num_holidays
is computed as: Number of days between start and end that are not weekends or holidays (as in the holidays
table).
The solution is here if we wanted to do this in PL/SQL. Crux is this part of code:
--Calculate and return the number of workdays using the input parameters.
--This is the meat of the function.
--This is really just one formula with a couple of parts that are listed on separate lines for documentation purposes.
RETURN (
SELECT
--Start with total number of days including weekends
(DATEDIFF(dd,@StartDate, @EndDate)+1)
--Subtact 2 days for each full weekend
-(DATEDIFF(wk,@StartDate, @EndDate)*2)
--If StartDate is a Sunday, Subtract 1
-(CASE WHEN DATENAME(dw, @StartDate) = 'Sunday'
THEN 1
ELSE 0
END)
--If EndDate is a Saturday, Subtract 1
-(CASE WHEN DATENAME(dw, @EndDate) = 'Saturday'
THEN 1
ELSE 0
END)
--Subtract all holidays
-(Select Count(*) from [dbo].[tblHolidays]
where [HolDate] between @StartDate and @EndDate )
)
END
I'm new to pyspark and was wondering what's the efficient way to do this? I can post the udf I'm writing if it helps though I'm going slow because I feel it's the wrong thing to do:
- Is there a better way than creating a UDF that reads the
holidays
table in a Dataframe and joins with it to count the holidays? Can I even join inside a udf? - Is there a way to write a
pandas_udf
instead? Would it be faster enough? - Are there some optimizations I can apply like cache the holidays table somehow on every worker?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这样的事情可能会起作用:
答案:
此方法不使用
udf
,因此必须更好地执行。PANDAS_UDF
表现优于常规udf
。但是,NO-UDF方法应该更好。Spark Engine本身可以进行优化。但是,有一些相对较少的情况,您可能会有所帮助。在答案中,我使用了
f.Broadcast(DF_HOLIDAYS)
。广播
将数据框发送给所有工人。但是我相信桌子会自动向工人广播,因为看起来应该很小。Something like this may work:
Answers:
This method does not use
udf
, so it must perform better.pandas_udf
performs better than regularudf
. But no-udf approaches should be even better.Spark engine performs optimizations itself. However, there are some relatively rare cases when you may help it. In the answer, I have used
F.broadcast(df_holidays)
. Thebroadcast
sends the dataframe to all of the workers. But I am sure that the table would automatically be broadcasted to the workers, as it looks like it's supposed to be very small.