如何根据其他值删除冗余值?
在下面的数据框中,有几个公寓
,带有不同的job
's:
+---+---------+------+
|id |apartment|job |
+---+---------+------+
|1 |Ap1 |dev |
|2 |Ap1 |anyl |
|3 |Ap2 |dev |
|4 |Ap2 |anyl |
|5 |Ap2 |anyl |
|6 |Ap2 |dev |
|7 |Ap2 |dev |
|8 |Ap2 |dev |
|9 |Ap3 |anyl |
|10 |Ap3 |dev |
|11 |Ap3 |dev |
+---+---------+------+
对于每个公寓,带有job ='dev'dev'
的行数与job ='Anyl'
(例如,对于AP1)的行相等。如何在所有公寓中使用'dev'
删除冗余行?
预期的结果是:
+---+---------+------+
|id |apartment|job |
+---+---------+------+
|1 |Ap1 |dev |
|2 |Ap1 |anyl |
|3 |Ap2 |dev |
|4 |Ap2 |anyl |
|5 |Ap2 |anyl |
|6 |Ap2 |dev |
|9 |Ap3 |anyl |
|10 |Ap3 |dev |
+---+---------+------+
我想我应该使用窗口函数来处理这一点,但我无法弄清楚。
In the below dataframe, there are several apartments
with different job
's:
+---+---------+------+
|id |apartment|job |
+---+---------+------+
|1 |Ap1 |dev |
|2 |Ap1 |anyl |
|3 |Ap2 |dev |
|4 |Ap2 |anyl |
|5 |Ap2 |anyl |
|6 |Ap2 |dev |
|7 |Ap2 |dev |
|8 |Ap2 |dev |
|9 |Ap3 |anyl |
|10 |Ap3 |dev |
|11 |Ap3 |dev |
+---+---------+------+
For each apartment, the number of rows with job='dev'
should be equal to the number of rows with job='anyl'
(like for Ap1). How to delete the redundant rows with 'dev'
in all the apartments?
The expected result:
+---+---------+------+
|id |apartment|job |
+---+---------+------+
|1 |Ap1 |dev |
|2 |Ap1 |anyl |
|3 |Ap2 |dev |
|4 |Ap2 |anyl |
|5 |Ap2 |anyl |
|6 |Ap2 |dev |
|9 |Ap3 |anyl |
|10 |Ap3 |dev |
+---+---------+------+
I guess I should use Window functions to deal with that, but I couldn't figure it out.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为您首先需要找出每个“公寓”有多少个“ Anyl”,然后使用它来删除所有多余的“开发”。因此,首先,汇总,然后
加入
,然后在窗口函数row_number
之前,您可以过滤出不需要的内容。设置:
脚本:
I think you first need to find out how many 'anyl' do you have for every 'apartment' and then use it to delete all the excess 'dev'. So, first, aggregation, then
join
and then window functionrow_number
before you can filter out what you don't need.Setup:
Script:
使用左SEMIJOIN代替
groupby
+过滤器
@zygd建议的组合可能更有效:Using a left semijoin instead of
groupBy
+filter
combo suggested by @ZygD might be more efficient: