Databricks 集群阶段未启动
spark.sql(
s"""
select
nb.*,
facl_bed.hsid as facl_decn_bed_hsid,
facl_bed.bed_day_decn_seq_nbr as facl_decn_bednbr,
case when facl_bed.hsid is not null then concat(substr(nb.cse_dt, 1, 10), ' 00:00:00.000') else cast(null as string) end as en_dt
from nobed nb
left outer join decn_bed_interim facl_bed on
(nb.hsid=facl_bed.hsid and nb.facl_decn_hsid=facl_bed.hsid)
where nb.facl_decn_hsc_id is not null
union all
select
nb.*,
cast(null as int) as facl_decn_bed_hsid,
cast(null as int) as facl_decn_bednbr,
cast(null as string) as en_dt
from nobed nb
where nb.facl_decn_hsid is null
""")
我在 databricks Spark 集群上执行上述代码片段。我花了很多时间来完成这项任务。 这里我的表有非常大的数据。
从我的 DAG 中我可以看到它在 union all 中花费了很多精力。
这里第 205 和 206 阶段需要很长时间才能完成。
这可能是什么原因,我该如何解决这个问题?
spark.sql(
s"""
select
nb.*,
facl_bed.hsid as facl_decn_bed_hsid,
facl_bed.bed_day_decn_seq_nbr as facl_decn_bednbr,
case when facl_bed.hsid is not null then concat(substr(nb.cse_dt, 1, 10), ' 00:00:00.000') else cast(null as string) end as en_dt
from nobed nb
left outer join decn_bed_interim facl_bed on
(nb.hsid=facl_bed.hsid and nb.facl_decn_hsid=facl_bed.hsid)
where nb.facl_decn_hsc_id is not null
union all
select
nb.*,
cast(null as int) as facl_decn_bed_hsid,
cast(null as int) as facl_decn_bednbr,
cast(null as string) as en_dt
from nobed nb
where nb.facl_decn_hsid is null
""")
I was executing the above snippet on databricks spark cluster. I was taking so much of time to complete the task.
here My tables are having very large data.
From my DAG I can see it is taking much in union all.
Here Stage 205 and 206 are taking so much of time to complete.
What could be the reason for this and how can I solve this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论