Databricks 集群阶段未启动

发布于 2025-01-11 04:47:32 字数 1065 浏览 6 评论 0原文

spark.sql(
          s"""
select
nb.*,
facl_bed.hsid as facl_decn_bed_hsid,
facl_bed.bed_day_decn_seq_nbr as facl_decn_bednbr,
case when facl_bed.hsid is not null then concat(substr(nb.cse_dt, 1, 10), ' 00:00:00.000') else cast(null as string) end as en_dt
from nobed nb
left outer join decn_bed_interim facl_bed on
(nb.hsid=facl_bed.hsid and nb.facl_decn_hsid=facl_bed.hsid)
where nb.facl_decn_hsc_id is not null
union all
select
nb.*,
cast(null as int) as facl_decn_bed_hsid,
cast(null as int) as facl_decn_bednbr,
cast(null as string) as en_dt
from nobed nb
where nb.facl_decn_hsid is null
""")

我在 databricks Spark 集群上执行上述代码片段。我花了很多时间来完成这项任务。这里我的表有非常大的数据。

从我的 DAG 中我可以看到它在 union all 中花费了很多精力。
这里第 205 和 206 阶段需要很长时间才能完成。

这可能是什么原因，我该如何解决这个问题？

原文

spark.sql(
          s"""
select
nb.*,
facl_bed.hsid as facl_decn_bed_hsid,
facl_bed.bed_day_decn_seq_nbr as facl_decn_bednbr,
case when facl_bed.hsid is not null then concat(substr(nb.cse_dt, 1, 10), ' 00:00:00.000') else cast(null as string) end as en_dt
from nobed nb
left outer join decn_bed_interim facl_bed on
(nb.hsid=facl_bed.hsid and nb.facl_decn_hsid=facl_bed.hsid)
where nb.facl_decn_hsc_id is not null
union all
select
nb.*,
cast(null as int) as facl_decn_bed_hsid,
cast(null as int) as facl_decn_bednbr,
cast(null as string) as en_dt
from nobed nb
where nb.facl_decn_hsid is null
""")

I was executing the above snippet on databricks spark cluster. I was taking so much of time to complete the task.
here My tables are having very large data.

From my DAG I can see it is taking much in union all.

Here Stage 205 and 206 are taking so much of time to complete.

What could be the reason for this and how can I solve this?

分享到QQ

分享到微博