我想根据另外两个列获得列的最大值,而对于第四列,最重复的数字的值
我已经有了这个数据框,
df1 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 7.3, 8),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 8),
('e', 'f', 6.0, 3),
('e', 'f', 6.0, 8),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 3),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df1.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.0| 4|
| c| d|7.3| 8|
| c| d|7.3| 2|
| c| d|7.3| 8|
| e| f|6.0| 3|
| e| f|6.0| 8|
| e| f|6.0| 3|
| c| j|4.2| 3|
| c| j|4.3| 9|
+---+---+---+---+
我这样做是为了获得夫妻A和B的最大C
df2 = df1.groupBy('a', 'b').agg(F.max('c').alias('c_max')).select(
F.col('a'),
F.col('b'),
F.col('c_max').alias('c')
)
df2.show()
+---+---+---+
| a| b| c|
+---+---+---+
| e| f|6.0|
| c| d|7.3|
| c| j|4.3|
+---+---+---+
,但现在我需要获得D的值,
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|7.3| 8|
| e| f|6.0| 3|
| c| j|4.3| 9|
+---+---+---+---+
我应该尝试在DF1和DF2之间进行内部连接工作:
condition = [df1.a == df2.a, df1.b == df2.b, df1.c == df2.c]
df3 = df1.join(df2,condition,"inner")
df3.show()
+---+---+---+---+---+---+---+
| a| b| c| d| a| b| c|
+---+---+---+---+---+---+---+
| c| d|7.3| 8| c| d|7.3|
| c| d|7.3| 8| c| d|7.3|
| c| d|7.3| 2| c| d|7.3|
| e| f|6.0| 3| e| f|6.0|
| e| f|6.0| 8| e| f|6.0|
| e| f|6.0| 3| e| f|6.0|
| c| j|4.3| 9| c| j|4.3|
+---+---+---+---+---+---+---+
我是Pyspark的初学者,所以请我需要一点帮助才能解决这个问题
I've got this dataframe
df1 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 7.3, 8),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 8),
('e', 'f', 6.0, 3),
('e', 'f', 6.0, 8),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 3),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df1.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.0| 4|
| c| d|7.3| 8|
| c| d|7.3| 2|
| c| d|7.3| 8|
| e| f|6.0| 3|
| e| f|6.0| 8|
| e| f|6.0| 3|
| c| j|4.2| 3|
| c| j|4.3| 9|
+---+---+---+---+
i did this to get the max of c of the couple a and b
df2 = df1.groupBy('a', 'b').agg(F.max('c').alias('c_max')).select(
F.col('a'),
F.col('b'),
F.col('c_max').alias('c')
)
df2.show()
+---+---+---+
| a| b| c|
+---+---+---+
| e| f|6.0|
| c| d|7.3|
| c| j|4.3|
+---+---+---+
but now i need to get the values of d that should be
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|7.3| 8|
| e| f|6.0| 3|
| c| j|4.3| 9|
+---+---+---+---+
i tried to do an inner join between df1 and df2 but that didn't work:
condition = [df1.a == df2.a, df1.b == df2.b, df1.c == df2.c]
df3 = df1.join(df2,condition,"inner")
df3.show()
+---+---+---+---+---+---+---+
| a| b| c| d| a| b| c|
+---+---+---+---+---+---+---+
| c| d|7.3| 8| c| d|7.3|
| c| d|7.3| 8| c| d|7.3|
| c| d|7.3| 2| c| d|7.3|
| e| f|6.0| 3| e| f|6.0|
| e| f|6.0| 8| e| f|6.0|
| e| f|6.0| 3| e| f|6.0|
| c| j|4.3| 9| c| j|4.3|
+---+---+---+---+---+---+---+
i'm a beginner in pyspark, so please i need a little help to figure this out
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以“ zip”
d
和d
的计数,并像往常一样汇总以保持频率现在加入您的
df2
和此新的DF3
将提供您所需的输出。You can "zip"
d
and count ofd
and aggregate as usual to keep the frequencyNow join both your
df2
and this newdf3
will give your desired output.您可以首先计数频率并通过以降序排序订单值来分配订单值。然后,获取订单为1的第一个值。
这不涉及打破领带的打破,如果在最高频率中有领带,这将选择任何(非确定性)。
You can first count the frequency and assign the order value by sorting them in descending order. Then, get the first value where the order is 1.
This does not deal with tie breaking, if there are tie in the top frequency, this will pick whatever (non-deterministic).