多个列的多个聚合

发布于 2025-01-30 15:13:08 字数 2018 浏览 2 评论 0原文

我想在数据框架上编写一个UDF，该数据框架将特定行的值与来自同一组的值进行比较，该值通过多个键进行分组。当UDFS在单行上运行时，我想编写一个查询，该查询以新组为新列值返回值。

例如为此输入：

ID类别	级别	categoryxy	value1	value2
1	a	x	0.2	true
2	a	x	0.3	false
3	a	x	0.2	true
4	b	x	0.4	true
5	b	x	0.1	true
6	b	y	0.5 0.5	false

i可以添加

group1：value1的汇总1：来自相同＆lt＆lt; cateCroyab，cantroyxy＆gt; Group
group2：value1的汇总来自sameyab，categroyxy＆gt;组IE相同的分组。

预期结果：

ID	级别categoryxy	value1	value2	[0.2，0.3，0.2] [ true	group1	group2
1	a	x	0.2	true	[0.2，0.3，0.2]	[true，false，true]
2	x	0.3 x	，true，false， true	false	0.3	]
3	a	x	0.2	true	[0.2，0.3，0.2]	[true，false，true]
4	b	x	0.4	true	[0.4，0.1]	[true，true]
5	b	x	0.1	true	[0.4，0.1]	[true，true，true]
6	b	y	0.5	false	[0.5]	[false]

要更清楚分组，在此示例中有3组

＆lt; a，x＆gt;与行1、2和3
＆lt; b，x＆gt;与第4和5行
＆lt; b，y＆gt;使用第6行，

我需要在SCARA中使用SPARK SQL结构和功能实现它，但是可以指导通用的SQL答案。

原文

I want to write a UDF over a data frame that operates as comparing values of particular row against the values from same group, where the grouping is by multiple keys. As UDFs operate on a single row, I want to write a query that returns values from same group in as a new column value.

For example over this
Input:

id	categoryAB	categoryXY	value1	value2
1	A	X	0.2	True
2	A	X	0.3	False
3	A	X	0.2	True
4	B	X	0.4	True
5	B	X	0.1	True
6	B	Y	0.5	False

I can add

group1: aggregation of value1s from the same <categroyAB, categroyXY> group
group2: aggregation of value1s from the same <categroyAB, categroyXY> group i.e. same grouping.

Expected result:

id	categoryAB	categoryXY	value1	value2	group1	group2
1	A	X	0.2	True	[0.2, 0.3, 0.2]	[True, False, True]
2	A	X	0.3	False	[0.2, 0.3, 0.2]	[True, False, True]
3	A	X	0.2	True	[0.2, 0.3, 0.2]	[True, False, True]
4	B	X	0.4	True	[0.4, 0.1]	[True, True]
5	B	X	0.1	True	[0.4, 0.1]	[True, True]
6	B	Y	0.5	False	[0.5]	[False]

To be more clear about grouping, there are 3 groups in this example

<A,X> with rows 1, 2 and 3
<B,X> with rows 4 and 5
<B,Y> with row 6

I need to implement it in Scala with Spark SQL structures and functions but a generic SQL answer could be guiding.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

jJeQQOZ5 2025-02-06 15:13:08

可能会有一个更优化的方法，但是在这里我通常如何做：

val df = Seq(
  (1, "A", "X", 0.2, true),
  (2, "A", "X", 0.3, false),
  (3, "A", "X", 0.2, true),
  (4, "B", "X", 0.4, true),
  (5, "B", "X", 0.1, true),
  (6, "B", "Y", 0.5, false)
).toDF("id", "categoryAB", "categoryXY", "value1", "value2")

df.join(
  df.groupBy("categoryAB", "categoryXY")
    .agg(
      collect_list('value1) as "group1",
      collect_list('value2) as "group2"
    ),
  Seq("categoryAB", "categoryXY")
).show()

想法是我在categoryab categoryxy 上分别计算了聚合，然后我加入了新的DataFrame对于原始一个（确保df如果是重型计算的结果，则将其计算两次计算）。

There might be a more optimized method, but here how I usually do:

val df = Seq(
  (1, "A", "X", 0.2, true),
  (2, "A", "X", 0.3, false),
  (3, "A", "X", 0.2, true),
  (4, "B", "X", 0.4, true),
  (5, "B", "X", 0.1, true),
  (6, "B", "Y", 0.5, false)
).toDF("id", "categoryAB", "categoryXY", "value1", "value2")

df.join(
  df.groupBy("categoryAB", "categoryXY")
    .agg(
      collect_list('value1) as "group1",
      collect_list('value2) as "group2"
    ),
  Seq("categoryAB", "categoryXY")
).show()

The idea is that I compute separately the aggregation on categoryAB and categoryXY, and then I join the new dataframe to the original one (make sure that df is cached if it is the result of heavy computations as otherwise it will be computed twice).

回复收藏 0 原文

~没有更多了~