如何根据其他值删除冗余值?

发布于 2025-01-30 19:59:16 字数 1112 浏览 3 评论 0原文

在下面的数据框中,有几个公寓,带有不同的job's:

+---+---------+------+                                                                 
|id |apartment|job   |
+---+---------+------+
|1  |Ap1      |dev   |
|2  |Ap1      |anyl  |
|3  |Ap2      |dev   |
|4  |Ap2      |anyl  |
|5  |Ap2      |anyl  |
|6  |Ap2      |dev   |
|7  |Ap2      |dev   |
|8  |Ap2      |dev   |
|9  |Ap3      |anyl  |
|10 |Ap3      |dev   |
|11 |Ap3      |dev   |
+---+---------+------+

对于每个公寓,带有job ='dev'dev'的行数与job ='Anyl'(例如,对于AP1)的行相等。如何在所有公寓中使用'dev'删除冗余行?

预期的结果是:

+---+---------+------+                                                                 
|id |apartment|job   |
+---+---------+------+
|1  |Ap1      |dev   |
|2  |Ap1      |anyl  |
|3  |Ap2      |dev   |
|4  |Ap2      |anyl  |
|5  |Ap2      |anyl  |
|6  |Ap2      |dev   |
|9  |Ap3      |anyl  |
|10 |Ap3      |dev   |
+---+---------+------+

我想我应该使用窗口函数来处理这一点,但我无法弄清楚。

In the below dataframe, there are several apartments with different job's:

+---+---------+------+                                                                 
|id |apartment|job   |
+---+---------+------+
|1  |Ap1      |dev   |
|2  |Ap1      |anyl  |
|3  |Ap2      |dev   |
|4  |Ap2      |anyl  |
|5  |Ap2      |anyl  |
|6  |Ap2      |dev   |
|7  |Ap2      |dev   |
|8  |Ap2      |dev   |
|9  |Ap3      |anyl  |
|10 |Ap3      |dev   |
|11 |Ap3      |dev   |
+---+---------+------+

For each apartment, the number of rows with job='dev' should be equal to the number of rows with job='anyl' (like for Ap1). How to delete the redundant rows with 'dev' in all the apartments?

The expected result:

+---+---------+------+                                                                 
|id |apartment|job   |
+---+---------+------+
|1  |Ap1      |dev   |
|2  |Ap1      |anyl  |
|3  |Ap2      |dev   |
|4  |Ap2      |anyl  |
|5  |Ap2      |anyl  |
|6  |Ap2      |dev   |
|9  |Ap3      |anyl  |
|10 |Ap3      |dev   |
+---+---------+------+

I guess I should use Window functions to deal with that, but I couldn't figure it out.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

烟雨扶苏 2025-02-06 19:59:16

我认为您首先需要找出每个“公寓”有多少个“ Anyl”,然后使用它来删除所有多余的“开发”。因此,首先,汇总,然后加入,然后在窗口函数row_number之前,您可以过滤出不需要的内容。

设置:

from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
    [(1, 'Ap1', 'dev'),
     (2, 'Ap1', 'anyl'),
     (3, 'Ap2', 'dev'),
     (4, 'Ap2', 'anyl'),
     (5, 'Ap2', 'anyl'),
     (6, 'Ap2', 'dev'),
     (7, 'Ap2', 'dev'),
     (8, 'Ap2', 'dev'),
     (9, 'Ap3', 'anyl'),
     (10, 'Ap3', 'dev'),
     (11, 'Ap3', 'dev')],
    ['id', 'apartment', 'job']
)

脚本:

df_grp = df.filter(F.col('job') == 'anyl').groupBy('apartment').count()
df = df.join(df_grp, 'apartment', 'left')

w = W.partitionBy('apartment', 'job').orderBy('id')
df = df.withColumn('_rn', F.row_number().over(w))
df = df.filter('_rn <= count')
df = df.select('id', 'apartment', 'job')

df.show()
# +---+---------+----+
# | id|apartment| job|
# +---+---------+----+
# |  2|      Ap1|anyl|
# |  1|      Ap1| dev|
# |  4|      Ap2|anyl|
# |  5|      Ap2|anyl|
# |  3|      Ap2| dev|
# |  6|      Ap2| dev|
# |  9|      Ap3|anyl|
# | 10|      Ap3| dev|
# +---+---------+----+

I think you first need to find out how many 'anyl' do you have for every 'apartment' and then use it to delete all the excess 'dev'. So, first, aggregation, then join and then window function row_number before you can filter out what you don't need.

Setup:

from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
    [(1, 'Ap1', 'dev'),
     (2, 'Ap1', 'anyl'),
     (3, 'Ap2', 'dev'),
     (4, 'Ap2', 'anyl'),
     (5, 'Ap2', 'anyl'),
     (6, 'Ap2', 'dev'),
     (7, 'Ap2', 'dev'),
     (8, 'Ap2', 'dev'),
     (9, 'Ap3', 'anyl'),
     (10, 'Ap3', 'dev'),
     (11, 'Ap3', 'dev')],
    ['id', 'apartment', 'job']
)

Script:

df_grp = df.filter(F.col('job') == 'anyl').groupBy('apartment').count()
df = df.join(df_grp, 'apartment', 'left')

w = W.partitionBy('apartment', 'job').orderBy('id')
df = df.withColumn('_rn', F.row_number().over(w))
df = df.filter('_rn <= count')
df = df.select('id', 'apartment', 'job')

df.show()
# +---+---------+----+
# | id|apartment| job|
# +---+---------+----+
# |  2|      Ap1|anyl|
# |  1|      Ap1| dev|
# |  4|      Ap2|anyl|
# |  5|      Ap2|anyl|
# |  3|      Ap2| dev|
# |  6|      Ap2| dev|
# |  9|      Ap3|anyl|
# | 10|      Ap3| dev|
# +---+---------+----+
再见回来 2025-02-06 19:59:16

使用左SEMIJOIN代替groupby+过滤器 @zygd建议的组合可能更有效:

>>> from pyspark.sql import Window
>>> from pyspark.sql.functions import *
>>> df1 = df.withColumn('rn', row_number().over(Window.partitionBy('apartment', 'job').orderBy('id')))
>>> df2 = df1.join(df1.alias('dfa').where("job='anyl'"),(df1.apartment==dfa.apartment)&(df1.rn==dfa.rn),'leftsemi')
>>> df2.show(truncate=False)
+---+---------+----+---+
|id |apartment|job |rn |
+---+---------+----+---+
|1  |Ap1      |dev |1  |
|2  |Ap1      |anyl|1  |
|3  |Ap2      |dev |1  |
|4  |Ap2      |anyl|1  |
|5  |Ap2      |anyl|2  |
|6  |Ap2      |dev |2  |
|9  |Ap3      |anyl|1  |
|10 |Ap3      |dev |1  |
+---+---------+----+---+

Using a left semijoin instead of groupBy+filter combo suggested by @ZygD might be more efficient:

>>> from pyspark.sql import Window
>>> from pyspark.sql.functions import *
>>> df1 = df.withColumn('rn', row_number().over(Window.partitionBy('apartment', 'job').orderBy('id')))
>>> df2 = df1.join(df1.alias('dfa').where("job='anyl'"),(df1.apartment==dfa.apartment)&(df1.rn==dfa.rn),'leftsemi')
>>> df2.show(truncate=False)
+---+---------+----+---+
|id |apartment|job |rn |
+---+---------+----+---+
|1  |Ap1      |dev |1  |
|2  |Ap1      |anyl|1  |
|3  |Ap2      |dev |1  |
|4  |Ap2      |anyl|1  |
|5  |Ap2      |anyl|2  |
|6  |Ap2      |dev |2  |
|9  |Ap3      |anyl|1  |
|10 |Ap3      |dev |1  |
+---+---------+----+---+
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文