使用层次结构条件SQL Python Pandas删除重复项

发布于 2025-02-06 09:30:53 字数 711 浏览 3 评论 0原文

我需要在大型数据库中删除重复项，但是要删除的行必须基于使用sqlite或python pandas的层次结构。是否有一种有效的方法可以解决此问题？最好使用Python Pandas DataFrame，但SQLite也可以。

ID	文本	类别
1	文本	优先级3
2	文本	优先级1
3	文本	优先2
4	文本2	优先级3
5	文本2	优先级

应转换为：

ID	文本	类别
2	文本	优先级1
5	文本2	优先级2优先级2

原文

I need to delete duplicates in a large database, but the rows to be deleted must be based on a hierarchy using either SQLite or Python Pandas. Is there a efficient way to relize this? preferably using python pandas dataframe but SQLite is also fine.

ID	Text	Category
1	text	Priority 3
2	text	Priority 1
3	text	Priority 2
4	text 2	Priority 3
5	text 2	Priority 2

should turn to this:

ID	Text	Category
2	text	Priority 1
5	text 2	Priority 2

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谈场末日恋爱 2025-02-13 09:30:53

尝试以下尝试：

df = df.sort_values(by=['Text','Category'], ascending=[True,True])
df.groupby('Text')['Category'].first().reset_index()

输出：

索引	文本	类别
0	文本	优先级1
1	文本2	优先级2

Try this:

df = df.sort_values(by=['Text','Category'], ascending=[True,True])
df.groupby('Text')['Category'].first().reset_index()

Output:

index	Text	Category
0	text	Priority 1
1	text 2	Priority 2

回复收藏 0 原文

只怪假的太真实 2025-02-13 09:30:53

@drakax非常相似的方法，但使用drop_duplicates而不是groupby和首先

import pandas as pd

df = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5],
    'Text': ['text', 'text', 'text', 'text 2', 'text 2'],
    'Category': ['Priority 3', 'Priority 1', 'Priority 2', 'Priority 3', 'Priority 2'],
})

df.sort_values(['Text','Category']).drop_duplicates('Text')

Very similar approach to @Drakax but using drop_duplicates instead of groupby and first

import pandas as pd

df = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5],
    'Text': ['text', 'text', 'text', 'text 2', 'text 2'],
    'Category': ['Priority 3', 'Priority 1', 'Priority 2', 'Priority 3', 'Priority 2'],
})

df.sort_values(['Text','Category']).drop_duplicates('Text')

回复收藏 0 原文

能否归途做我良人 2025-02-13 09:30:53

避免在可能的时候进行分类，请使用分类来定义优先级的顺序，并获取每组最小的索引：

# priorities in order
priorities = ['Priority 1', 'Priority 2', 'Priority 3']
# set up Categorical
df['Category'] = pd.Categorical(df['Category'], priorities, ordered=True)
# min per group 
df.loc[df.groupby('Text')['Category'].idxmin()]

输出：

   ID    Text    Category
1   2    text  Priority 1
4   5  text 2  Priority 2

Avoid sorting when you can, use a Categorical to define the order of the priorities and get the index of the min per group:

# priorities in order
priorities = ['Priority 1', 'Priority 2', 'Priority 3']
# set up Categorical
df['Category'] = pd.Categorical(df['Category'], priorities, ordered=True)
# min per group 
df.loc[df.groupby('Text')['Category'].idxmin()]

Output:

   ID    Text    Category
1   2    text  Priority 1
4   5  text 2  Priority 2

回复收藏 0 原文

~没有更多了~

关于作者

方圜几里

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

使用层次结构条件SQL Python Pandas删除重复项

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

眼泪淡了忧伤

corot39

守护在此方

github_3h15MP3i7

相思故

滥情空心

友情链接

使用层次结构条件SQL Python Pandas删除重复项

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

眼泪淡了忧伤

corot39

守护在此方

github_3h15MP3i7

相思故

滥情空心

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。