如何连续几个相交?
我有一个SQL查询,如下所示:
select col4, col5 from TableA where col1 = 'x'
intersect
select col4, col5 from TableA where col1 = 'y'
intersect
select col4, col5 from TableA where col1 = 'z'
如何将此SQL转换为Pyspark等效? 我可以创建3个df,然后进行以下相交:
df1 ==> select col4, col5 from TableA where col1 = 'x'
df2 ==> select col4, col5 from TableA where col1 = 'y'
df3 ==> select col4, col5 from TableA where col1 = 'z'
df_result = df1.intersect(df2)
df_result = df_result.intersect(df3)
但是,如果我有更多相交
查询,我觉得这不是一个好方法。
另外,可以说[x,y,z]是动态的,意味着它可以像[x,y,z,a,b,.....]
有什么建议吗?
I have a SQL query like below:
select col4, col5 from TableA where col1 = 'x'
intersect
select col4, col5 from TableA where col1 = 'y'
intersect
select col4, col5 from TableA where col1 = 'z'
How can I convert this SQL to PySpark equivalent?
I can create 3 DF and then do intersect like below:
df1 ==> select col4, col5 from TableA where col1 = 'x'
df2 ==> select col4, col5 from TableA where col1 = 'y'
df3 ==> select col4, col5 from TableA where col1 = 'z'
df_result = df1.intersect(df2)
df_result = df_result.intersect(df3)
But I feel that's not good approach to follow if I had more intersect
queries.
Also, let's say [x,y,z] is dynamic, means it can be like [x,y,z,a,b,.....]
Any suggestion?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您想执行几个连续的
Intersect
,则有降低
可用的。将所有DF放入一个列表中,您将连续相交:
但在您的情况下效率低下。
由于所有数据都来自相同的数据框,因此我建议进行返工。而不是使用
Intersect
重新加入DF,而是进行聚合和过滤。脚本(Spark 3.1):
测试:
If you wanted to do several consecutive
intersect
, there'sreduce
available. Put all your dfs in one list and you will do intersect consecutively:
But it would be inefficient in your case.
Since all the data comes from the same dataframe, I would suggest a rework. Instead of dividing df and then rejoining using
intersect
, do an aggregation and filtering.Script (Spark 3.1):
Test: