在Spark SQL中替换或优化连接
我有此代码
df= dataframe_input.withColumn('status_flights', F.when((F.col('WOW') == 0), 1).otherwise(0))
df = df.groupBy('Filename').agg(F.sum('status_flights').alias('status_flights'))
dataframe_input = dataframe_input.drop('status_flights').join(df, ['Filename'], 'Left')
dataframe_input = dataframe_input.filter(F.col('status_flights')>0)
在这里没有优化的加入,我们可以替换加入的任何方法,因为我们正在dataframe和自身之间进行加入(在少量丰富之后)
I have this code
df= dataframe_input.withColumn('status_flights', F.when((F.col('WOW') == 0), 1).otherwise(0))
df = df.groupBy('Filename').agg(F.sum('status_flights').alias('status_flights'))
dataframe_input = dataframe_input.drop('status_flights').join(df, ['Filename'], 'Left')
dataframe_input = dataframe_input.filter(F.col('status_flights')>0)
The join here is not optimized is there any way we can replace the join because we are doing the join between the dataframe and itself (after a small enrichment)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是回答在这里:
在视频中审查的一些内容正在查看Spark Plane。这可以通过在您正在运行的查询查看实际在做什么的查询上使用
.explain()
来完成。这可能需要一些时间来学习如何阅读,但是如果您想学习优化,这确实很有价值。通常,指导是,您可以做的速度越少,您的代码将运行的速度更快。如果您可以将任何混音更改为地图侧加入,则可以更快地运行。 (这很大程度上取决于您的数据拟合到内存)上面文章中未讨论的一件事是,如果您定期运行此报告,他们可能会在实现您正在做的集体比以使其运行速度更快。这需要在插入方面进行其他工作,但将帮助您挤出所有可以从桌子上获得的性能。通常,您可以将更多的数据预先介绍为有用的报告格式,您的查询将更快地运行。
This has been answer here:
Something that is reviewed in the video is looking at the spark plans. This can be done by using
.explain()
on the query that you are running to see what it's actually doing. This can take some time to learn how to read but really is valuable if you want to learn to optimize. In general the guidance is, the less shuffles you can do that faster your code will run. If you can change any shuffle into a map side join you will run faster. (This is highly dependent on your data fitting into memory)One thing that isn't discussed in the above article is if you will regularly be running this report their may be value in materializing the groupBy you are doing to make it run faster. This requires additional work on insert but will help you squeeze all the performance you can get out of the table. In general the more data you can pre-chew into a useful reporting format the faster your query will run.