选定列中值的独特组合计数

发布于 2025-02-11 06:23:31 字数 668 浏览 3 评论 0原文

我有一个看起来像这样的Pyspark数据框架:

----------------------------
id    A    B    C 
id1   on   on   on
id1   on   off  on 
id1   on   on   on 
id1   on   on   on 
id1   on   on   off
-----------------------------

我正在寻找一种找到所选列的所有唯一组合并显示其计数的方法。 The expected output:

----------------------------
id    A    B    C    count
id1   on   on   on   3
id1   on   off  on   1
id1   on   on   off  1
-----------------------------

I see that there is

UPD: 另外,请注意,A和B列的独特组合与A,B,c的组合不同。我想要每一列的所有可能组合。有没有办法实现它,而不是通过并计算一种组合,另一种组合等?还有10列。

I have a PySpark data frame that looks like this:

----------------------------
id    A    B    C 
id1   on   on   on
id1   on   off  on 
id1   on   on   on 
id1   on   on   on 
id1   on   on   off
-----------------------------

I am looking for a way to find all unique combinations for selected columns and show their count.
The expected output:

----------------------------
id    A    B    C    count
id1   on   on   on   3
id1   on   off  on   1
id1   on   on   off  1
-----------------------------

I see that there is a way to do a similar operation in Pandas, but I need PySpark.

UPD:
Also, please note that a unique combination of columns A and B is not the same as a combination of A,B,C. I want all possible combination of every column. Is there a way to achieve it rather than grouping by and counting one combination, another combination, etc.? There are more that 10 columns.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

把回忆走一遍 2025-02-18 06:23:31

cube 可以做到。但它显示了所有组合,包括是否没有考虑某些列。因此,您将必须过滤。

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('id1', 'on', 'on', 'on'),
     ('id1', 'on', 'off', 'on'), 
     ('id1', 'on', 'on', 'on'), 
     ('id1', 'on', 'on', 'on'), 
     ('id1', 'on', 'on', 'off')],
    ['id', 'A', 'B', 'C'])

df = df.cube(df.columns).count()
df = df.filter(F.forall(F.array(df.columns), lambda x: x.isNotNull()))

df.show()
# +---+---+---+---+-----+
# | id|  A|  B|  C|count|
# +---+---+---+---+-----+
# |id1| on| on| on|    3|
# |id1| on|off| on|    1|
# |id1| on| on|off|    1|
# +---+---+---+---+-----+

这将计算出仅在指定列中的事件:

cols = ['A', 'B']
df = df.cube(cols).count()
df = df.filter(F.forall(F.array(df.columns), lambda x: x.isNotNull()))

df.show()
# +---+---+-----+
# |  A|  B|count|
# +---+---+-----+
# | on|off|    1|
# | on| on|    4|
# +---+---+-----+

cube can do it. But it displays ALL combinations including if some columns were not taken into account. So you will have to filter afterwards.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('id1', 'on', 'on', 'on'),
     ('id1', 'on', 'off', 'on'), 
     ('id1', 'on', 'on', 'on'), 
     ('id1', 'on', 'on', 'on'), 
     ('id1', 'on', 'on', 'off')],
    ['id', 'A', 'B', 'C'])

df = df.cube(df.columns).count()
df = df.filter(F.forall(F.array(df.columns), lambda x: x.isNotNull()))

df.show()
# +---+---+---+---+-----+
# | id|  A|  B|  C|count|
# +---+---+---+---+-----+
# |id1| on| on| on|    3|
# |id1| on|off| on|    1|
# |id1| on| on|off|    1|
# +---+---+---+---+-----+

This would count occurrences just in specified columns:

cols = ['A', 'B']
df = df.cube(cols).count()
df = df.filter(F.forall(F.array(df.columns), lambda x: x.isNotNull()))

df.show()
# +---+---+-----+
# |  A|  B|count|
# +---+---+-----+
# | on|off|    1|
# | on| on|    4|
# +---+---+-----+
找回味觉 2025-02-18 06:23:31

解决方案:

df = spark.createDataFrame(
    [
        ("id1", "on","on","on"),  # create your data here, be consistent in the types.
        ("id1", "on","off","on"),
        ("id1", "on","on","on"),
        ("id1", "on","on","on"),
        ("id1", "on","on","off"),
    ],
    ["id", "A" , "B" , "C"]  # add your column names here
)

除了立方体功能外,我们还具有汇总功能。

Cube :它获取列列表,并将汇总表达式应用于分组列的所有可能组合。

汇总:与 Cube IS crolup 相似的功能,该功能从左到右计算层次的小写。
通过组进行汇总(...)与立方体相似,但在层次上起作用,通过从左到右填充列。

from pyspark.sql import functions as F

df = df.rollup(df.columns).count()

df1 = df.na.drop(subset=df.columns)

df1.show()

# ---+---+---+---+-----+
# | id|  A|  B|  C|count|
# +---+---+---+---+-----+
# |id1| on| on|off|    1|
# |id1| on| on| on|    3|
# |id1| on|off| on|    1|
# +---+---+---+---+-----+

Solution:

df = spark.createDataFrame(
    [
        ("id1", "on","on","on"),  # create your data here, be consistent in the types.
        ("id1", "on","off","on"),
        ("id1", "on","on","on"),
        ("id1", "on","on","on"),
        ("id1", "on","on","off"),
    ],
    ["id", "A" , "B" , "C"]  # add your column names here
)

Apart from Cube function, we also have Rollup function.

cube: It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns.

rollup: A similar function to cube is rollup which computes hierarchical subtotals from left to right.
With GROUP BY ROLLUP(...) is similar to CUBE but works hierarchically by filling columns from left to right.

from pyspark.sql import functions as F

df = df.rollup(df.columns).count()

df1 = df.na.drop(subset=df.columns)

df1.show()

# ---+---+---+---+-----+
# | id|  A|  B|  C|count|
# +---+---+---+---+-----+
# |id1| on| on|off|    1|
# |id1| on| on| on|    3|
# |id1| on|off| on|    1|
# +---+---+---+---+-----+
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文