多个列的多个聚合

发布于 2025-01-30 15:13:08 字数 2018 浏览 2 评论 0原文

我想在数据框架上编写一个UDF,该数据框架将特定行的值与来自同一组的值进行比较,该值通过多个键进行分组。当UDFS在单行上运行时,我想编写一个查询,该查询以新组为新列值返回值。

例如为此 输入:

ID类别级别categoryxyvalue1value2
1ax0.2true
2ax0.3false
3ax0.2true
4bx0.4true
5bx0.1true
6by0.5 0.5false

i可以添加

  • group1:value1的汇总1:来自相同&lt< cateCroyab,cantroyxy> Group
  • group2:value1的汇总来自sameyab,categroyxy>组IE相同的分组。

预期结果:

ID级别categoryxyvalue1value2[0.2,0.3,0.2] [ truegroup1group2
1ax0.2true[0.2,0.3,0.2][true,false,true]
2x0.3 x,true,false, truefalse0.3]
3ax0.2true[0.2,0.3,0.2][true,false,true]
4bx0.4true[0.4,0.1][true,true]
5bx0.1true[0.4,0.1][true,true,true]
6by0.5false[0.5][false]

要更清楚分组,在此示例中有3组

  • < a,x>与行1、2和3
  • < b,x>与第4和5行
  • < b,y>使用第6行,

我需要在SCARA中使用SPARK SQL结构和功能实现它,但是可以指导通用的SQL答案。

I want to write a UDF over a data frame that operates as comparing values of particular row against the values from same group, where the grouping is by multiple keys. As UDFs operate on a single row, I want to write a query that returns values from same group in as a new column value.

For example over this
Input:

idcategoryABcategoryXYvalue1value2
1AX0.2True
2AX0.3False
3AX0.2True
4BX0.4True
5BX0.1True
6BY0.5False

I can add

  • group1: aggregation of value1s from the same <categroyAB, categroyXY> group
  • group2: aggregation of value1s from the same <categroyAB, categroyXY> group i.e. same grouping.

Expected result:

idcategoryABcategoryXYvalue1value2group1group2
1AX0.2True[0.2, 0.3, 0.2][True, False, True]
2AX0.3False[0.2, 0.3, 0.2][True, False, True]
3AX0.2True[0.2, 0.3, 0.2][True, False, True]
4BX0.4True[0.4, 0.1][True, True]
5BX0.1True[0.4, 0.1][True, True]
6BY0.5False[0.5][False]

To be more clear about grouping, there are 3 groups in this example

  • <A,X> with rows 1, 2 and 3
  • <B,X> with rows 4 and 5
  • <B,Y> with row 6

I need to implement it in Scala with Spark SQL structures and functions but a generic SQL answer could be guiding.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

jJeQQOZ5 2025-02-06 15:13:08

可能会有一个更优化的方法,但是在这里我通常如何做:

val df = Seq(
  (1, "A", "X", 0.2, true),
  (2, "A", "X", 0.3, false),
  (3, "A", "X", 0.2, true),
  (4, "B", "X", 0.4, true),
  (5, "B", "X", 0.1, true),
  (6, "B", "Y", 0.5, false)
).toDF("id", "categoryAB", "categoryXY", "value1", "value2")

df.join(
  df.groupBy("categoryAB", "categoryXY")
    .agg(
      collect_list('value1) as "group1",
      collect_list('value2) as "group2"
    ),
  Seq("categoryAB", "categoryXY")
).show()

想法是我在categoryab categoryxy 上分别计算了聚合,然后我加入了新的DataFrame对于原始一个(确保df如果是重型计算的结果,则将其计算两次计算)。

There might be a more optimized method, but here how I usually do:

val df = Seq(
  (1, "A", "X", 0.2, true),
  (2, "A", "X", 0.3, false),
  (3, "A", "X", 0.2, true),
  (4, "B", "X", 0.4, true),
  (5, "B", "X", 0.1, true),
  (6, "B", "Y", 0.5, false)
).toDF("id", "categoryAB", "categoryXY", "value1", "value2")

df.join(
  df.groupBy("categoryAB", "categoryXY")
    .agg(
      collect_list('value1) as "group1",
      collect_list('value2) as "group2"
    ),
  Seq("categoryAB", "categoryXY")
).show()

The idea is that I compute separately the aggregation on categoryAB and categoryXY, and then I join the new dataframe to the original one (make sure that df is cached if it is the result of heavy computations as otherwise it will be computed twice).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文