多个列的多个聚合
我想在数据框架上编写一个UDF,该数据框架将特定行的值与来自同一组的值进行比较,该值通过多个键进行分组。当UDFS在单行上运行时,我想编写一个查询,该查询以新组为新列值返回值。
例如为此 输入:
ID类别 | 级别 | categoryxy | value1 | value2 |
---|---|---|---|---|
1 | a | x | 0.2 | true |
2 | a | x | 0.3 | false |
3 | a | x | 0.2 | true |
4 | b | x | 0.4 | true |
5 | b | x | 0.1 | true |
6 | b | y | 0.5 0.5 | false |
i可以添加
- group1:value1的汇总1:来自相同&lt< cateCroyab,cantroyxy> Group
- group2:value1的汇总来自sameyab,categroyxy>组IE相同的分组。
预期结果:
ID | 级别categoryxy | value1 | value2 | [0.2,0.3,0.2] [ true | group1 | group2 |
---|---|---|---|---|---|---|
1 | a | x | 0.2 | true | [0.2,0.3,0.2] | [true,false,true] |
2 | x | 0.3 x | ,true,false, true | false | 0.3 | ] |
3 | a | x | 0.2 | true | [0.2,0.3,0.2] | [true,false,true] |
4 | b | x | 0.4 | true | [0.4,0.1] | [true,true] |
5 | b | x | 0.1 | true | [0.4,0.1] | [true,true,true] |
6 | b | y | 0.5 | false | [0.5] | [false] |
要更清楚分组,在此示例中有3组
- < a,x>与行1、2和3
- < b,x>与第4和5行
- < b,y>使用第6行,
我需要在SCARA中使用SPARK SQL结构和功能实现它,但是可以指导通用的SQL答案。
I want to write a UDF over a data frame that operates as comparing values of particular row against the values from same group, where the grouping is by multiple keys. As UDFs operate on a single row, I want to write a query that returns values from same group in as a new column value.
For example over this
Input:
id | categoryAB | categoryXY | value1 | value2 |
---|---|---|---|---|
1 | A | X | 0.2 | True |
2 | A | X | 0.3 | False |
3 | A | X | 0.2 | True |
4 | B | X | 0.4 | True |
5 | B | X | 0.1 | True |
6 | B | Y | 0.5 | False |
I can add
- group1: aggregation of value1s from the same <categroyAB, categroyXY> group
- group2: aggregation of value1s from the same <categroyAB, categroyXY> group i.e. same grouping.
Expected result:
id | categoryAB | categoryXY | value1 | value2 | group1 | group2 |
---|---|---|---|---|---|---|
1 | A | X | 0.2 | True | [0.2, 0.3, 0.2] | [True, False, True] |
2 | A | X | 0.3 | False | [0.2, 0.3, 0.2] | [True, False, True] |
3 | A | X | 0.2 | True | [0.2, 0.3, 0.2] | [True, False, True] |
4 | B | X | 0.4 | True | [0.4, 0.1] | [True, True] |
5 | B | X | 0.1 | True | [0.4, 0.1] | [True, True] |
6 | B | Y | 0.5 | False | [0.5] | [False] |
To be more clear about grouping, there are 3 groups in this example
- <A,X> with rows 1, 2 and 3
- <B,X> with rows 4 and 5
- <B,Y> with row 6
I need to implement it in Scala with Spark SQL structures and functions but a generic SQL answer could be guiding.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
可能会有一个更优化的方法,但是在这里我通常如何做:
想法是我在
categoryab
categoryxy 上分别计算了聚合,然后我加入了新的DataFrame对于原始一个(确保df
如果是重型计算的结果,则将其计算两次计算)。There might be a more optimized method, but here how I usually do:
The idea is that I compute separately the aggregation on
categoryAB
andcategoryXY
, and then I join the new dataframe to the original one (make sure thatdf
is cached if it is the result of heavy computations as otherwise it will be computed twice).