左连接条件,并使用Spark Python / Pyspark聚合最大
我拥有的: 2个大量火花数据框架,但这里有一些示例
- 数据框架:
ID | IG | OPENDATE |
---|---|---|
P111 | 100 | 13/04/2022 |
P222 | 101 | 16/04/2022 |
P333 | 102 | 20/04/2022 |
- DATAFRAME B DATA FRAME B DATA FRAME B :
IG | 服务 | DT_Service |
---|---|---|
100 | A | 12/04/2022 |
100 | B | 13/04/2022 |
100 | B | 14/04/2022 |
101 | A | 15/04/2022 |
101 | A | 16/04/2022 |
101 | B | 17/04/2022 |
101 | B | 18/ 04/2022 |
102 | A | 19/04/2022 |
102 | b | 20/04/2022 |
我想要的是:我想使用键使用键'join of dataframe a dataframe a dataframe a两个列“服务”和“ dt_service” IG',但也具有相应日期的“服务”的最大值。因此,我需要最新的“服务”,并在DataFrame A中的每一行的相应日期。这是我期望的结果:
ID | IG | OPENDATE | 服务 | DT_Service |
---|---|---|---|---|
P111 | 100 | 13/04/2022 | B | 14/04/2022 |
P222 | 101 | 16/04 /2022 | B | 18/04/2022 |
P333 | 102 | 20/04/2022 | B | 20/04/2022 |
工具:Spark 2.2带Pyspark带有Pyspark,因为我正在研究Hadoop,
谢谢您的帮助
What I have : 2 massive spark dataframes, but here are some samples
- Dataframe A:
ID | IG | OpenDate |
---|---|---|
P111 | 100 | 13/04/2022 |
P222 | 101 | 16/04/2022 |
P333 | 102 | 20/04/2022 |
- Dataframe B:
IG | Service | Dt_Service |
---|---|---|
100 | A | 12/04/2022 |
100 | B | 13/04/2022 |
100 | B | 14/04/2022 |
101 | A | 15/04/2022 |
101 | A | 16/04/2022 |
101 | B | 17/04/2022 |
101 | B | 18/04/2022 |
102 | A | 19/04/2022 |
102 | B | 20/04/2022 |
What I want: I want to left join on dataframe A the two columns 'Service' and 'Dt_Service' using the key 'IG' but also having the Max value of 'Service' with the corresponding date. So I need the most recent 'Service' with its corresponding date for each row in Dataframe A. This is the result I expect :
ID | IG | OpenDate | Service | Dt_Service |
---|---|---|---|---|
P111 | 100 | 13/04/2022 | B | 14/04/2022 |
P222 | 101 | 16/04/2022 | B | 18/04/2022 |
P333 | 102 | 20/04/2022 | B | 20/04/2022 |
Tool : Spark 2.2 with PySpark since I am working on hadoop
Thank you for your help
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
就像Samkart所说
As samkart said we can do rank/row_number to get last service first then join to get your desired result