在Spark SQL中替换或优化连接

发布于 2025-02-09 03:49:38 字数 437 浏览 2 评论 0原文

我有此代码

df= dataframe_input.withColumn('status_flights', F.when((F.col('WOW') == 0), 1).otherwise(0))
df = df.groupBy('Filename').agg(F.sum('status_flights').alias('status_flights'))
dataframe_input = dataframe_input.drop('status_flights').join(df, ['Filename'], 'Left')
dataframe_input = dataframe_input.filter(F.col('status_flights')>0)

在这里没有优化的加入,我们可以替换加入的任何方法,因为我们正在dataframe和自身之间进行加入(在少量丰富之后)

I have this code

df= dataframe_input.withColumn('status_flights', F.when((F.col('WOW') == 0), 1).otherwise(0))
df = df.groupBy('Filename').agg(F.sum('status_flights').alias('status_flights'))
dataframe_input = dataframe_input.drop('status_flights').join(df, ['Filename'], 'Left')
dataframe_input = dataframe_input.filter(F.col('status_flights')>0)

The join here is not optimized is there any way we can replace the join because we are doing the join between the dataframe and itself (after a small enrichment)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

可爱咩 2025-02-16 03:49:38

这是回答在这里

这取决于数据。更具体地说,这取决于
名称列的基数。如果基数很小,则数据
聚集后会很小,聚集结果可能是
在加入中广播。在这种情况下,加入将比
窗口。另一方面,如果基数很大并且数据
聚集后很大,因此将与
SortMergeJoin,使用窗口将更有效。

在窗口的情况下,我们有1个洗牌 +一种。在情况下
sortmergejoin我们在左分支中有相同的(总的shuffle +
排序)加上额外的减少的洗牌并在右分支中排序(通过
减少了我的意思是数据首先是汇总)。在正确的分支
在加入中,我们还对数据进行了其他扫描。

另外,您可以检查我的 video 来自火花峰会我在哪里分析
类似的示例。
用户头像David Vrba

在视频中审查的一些内容正在查看Spark Plane。这可以通过在您正在运行的查询查看实际在做什么的查询上使用.explain()来完成。这可能需要一些时间来学习如何阅读,但是如果您想学习优化,这确实很有价值。通常,指导是,您可以做的速度越少,您的代码将运行的速度更快。如果您可以将任何混音更改为地图侧加入,则可以更快地运行。 (这很大程度上取决于您的数据拟合到内存)

上面文章中未讨论的一件事是,如果您定期运行此报告,他们可能会在实现您正在做的集体比以使其运行速度更快。这需要在插入方面进行其他工作,但将帮助您挤出所有可以从桌子上获得的性能。通常,您可以将更多的数据预先介绍为有用的报告格式,您的查询将更快地运行。

This has been answer here:

It depends on the data. More specifically here it depends on the
cardinality of the name column. If the cardinality is small, the data
will be small after the aggregation and the aggregated result can be
broadcasted in the join. In that case, the join will be faster than
the window. On the other hand, if the cardinality is big and the data
is large after the aggregation, so the join will be planed with
SortMergeJoin, using window will be more efficient.

In the case of window we have 1 total shuffle + one sort. In the case
of SortMergeJoin we have the same in the left branch (total shuffle +
sort) plus additional reduced shuffle and sort in the right branch (by
reduced I mean that the data is aggregated first). In the right branch
of the join we have also additional scan over the data.

Also, you can check my video from the Spark Summit where I analyze
similar example.
user avatar David Vrba

Something that is reviewed in the video is looking at the spark plans. This can be done by using .explain() on the query that you are running to see what it's actually doing. This can take some time to learn how to read but really is valuable if you want to learn to optimize. In general the guidance is, the less shuffles you can do that faster your code will run. If you can change any shuffle into a map side join you will run faster. (This is highly dependent on your data fitting into memory)

One thing that isn't discussed in the above article is if you will regularly be running this report their may be value in materializing the groupBy you are doing to make it run faster. This requires additional work on insert but will help you squeeze all the performance you can get out of the table. In general the more data you can pre-chew into a useful reporting format the faster your query will run.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文