熊猫：在时间间隔熊猫中出现的滴剂

发布于 2025-01-24 20:43:43 字数 585 浏览 4 评论 0原文

我们有一个包含“ ID”和“ Day”列的数据框架，该列表显示了何时提出投诉。我们需要从“ ID”列中删除重复项，但前提是相距30天，只有重复的副本。请参阅下面的示例：

当前数据集：

   ID        DAY           
0   1  22.03.2020       
1   1  18.04.2020       
2   2  10.05.2020       
3   2  13.01.2020       
4   3  30.03.2020       
5   3  31.03.2020       
6   3  24.02.2021

目标：

   ID     DAY           
0   1  22.03.2020       
1   2  10.05.2020       
2   2  13.01.2020       
3   3  30.03.2020       
4   3  24.02.2021

是否有建议？我尝试了Groupby，然后创建一个循环来计算每种组合之间的差异，但是由于数据框有数百万的行，这将永远花费...

原文

We have a dataframe containing an 'ID' and 'DAY' columns, which shows when a specific customer made a complaint. We need to drop duplicates from the 'ID' column, but only if the duplicates happened 30 days apart, tops. Please see the example below:

Current Dataset:

   ID        DAY           
0   1  22.03.2020       
1   1  18.04.2020       
2   2  10.05.2020       
3   2  13.01.2020       
4   3  30.03.2020       
5   3  31.03.2020       
6   3  24.02.2021

Goal:

   ID     DAY           
0   1  22.03.2020       
1   2  10.05.2020       
2   2  13.01.2020       
3   3  30.03.2020       
4   3  24.02.2021

Any suggestions? I have tried groupby and then creating a loop to calculate the difference between each combination, but because the dataframe has millions of rows this would take forever...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

秋风の叶未落 2025-01-31 20:43:43

您可以计算每组连续日期之间的差异，并使用它形成一个面具以删除相距不到30天的天数：

df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)

mask = (df
        .sort_values(by=['ID', 'DAY'])
        .groupby('ID')['DAY']
        .diff().lt('30d')
        .sort_index()
       )

df[~mask]

nb。这种方法的潜在缺点是，如果客户在30天内提出了新的投诉，则可以重新启动下一个投诉的门槛

输出：

   ID        DAY
0   1 2020-03-22
2   2 2020-10-05
3   2 2020-01-13
4   3 2020-03-30
6   3 2021-02-24

因此，另一种方法可能是resplame小组到30天：

(df
 .groupby('ID')
 .resample('30d', on='DAY').first()
 .dropna()
 .convert_dtypes()
 .reset_index(drop=True)
)

输出：

   ID        DAY
0   1 2020-03-22
1   2 2020-01-13
2   2 2020-10-05
3   3 2020-03-30
4   3 2021-02-24

You can compute the difference between successive dates per group and use it to form a mask to remove days that are less than 30 days apart:

df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)

mask = (df
        .sort_values(by=['ID', 'DAY'])
        .groupby('ID')['DAY']
        .diff().lt('30d')
        .sort_index()
       )

df[~mask]

NB. the potential drawback of this approach is that if the customer makes a new complaint within the 30days, this restarts the threshold for the next complaint

output:

   ID        DAY
0   1 2020-03-22
2   2 2020-10-05
3   2 2020-01-13
4   3 2020-03-30
6   3 2021-02-24

Thus another approach might be to resample the data per group to 30days:

(df
 .groupby('ID')
 .resample('30d', on='DAY').first()
 .dropna()
 .convert_dtypes()
 .reset_index(drop=True)
)

output:

   ID        DAY
0   1 2020-03-22
1   2 2020-01-13
2   2 2020-10-05
3   3 2020-03-30
4   3 2021-02-24

回复收藏 0 原文

鹿港巷口少年归 2025-01-31 20:43:43

您可以通过ID列和diff day列中的 conter 列

df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)

from datetime import timedelta

m = timedelta(days=30)

out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)

print(out)

   ID        DAY
0   1 2020-03-22
1   2 2020-05-10
2   2 2020-01-13
3   3 2020-03-30
4   3 2021-02-24

转换为原始日期格式，您可以使用dt .Strftime

out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')

print(out)

   ID         DAY
0   1  22.03.2020
1   2  10.05.2020
2   2  13.01.2020
3   3  30.03.2020
4   3  24.02.2021

You can try group by ID column and diff the DAY column in each group

df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)

from datetime import timedelta

m = timedelta(days=30)

out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)

print(out)

   ID        DAY
0   1 2020-03-22
1   2 2020-05-10
2   2 2020-01-13
3   3 2020-03-30
4   3 2021-02-24

To convert to original date format, you can use dt.strftime

out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')

print(out)

   ID         DAY
0   1  22.03.2020
1   2  10.05.2020
2   2  13.01.2020
3   3  30.03.2020
4   3  24.02.2021

回复收藏 0 原文

~没有更多了~

关于作者

三人与歌

暂无简介

文章

28 人气

关注发私信

友情链接

文江博客

熊猫：在时间间隔熊猫中出现的滴剂

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

熊猫：在时间间隔熊猫中出现的滴剂

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。