使用 Python 删除对象列表中的重复项
我有一个对象列表,还有一个充满记录的数据库表。我的对象列表有一个标题属性,我想从列表中删除任何具有重复标题的对象(保留原始对象)。
然后我想检查我的对象列表是否有数据库中任何记录的重复项,如果有,请从列表中删除这些项目,然后再将它们添加到数据库中。
我已经看到了从列表中删除重复项的解决方案,如下所示:myList = list(set(myList))
,但我不确定如何使用对象列表来做到这一点?
我也需要维护对象列表的顺序。我也在想也许我可以使用 difflib
来检查标题中的差异。
I've got a list of objects and I've got a db table full of records. My list of objects has a title attribute and I want to remove any objects with duplicate titles from the list (leaving the original).
Then I want to check if my list of objects has any duplicates of any records in the database and if so, remove those items from list before adding them to the database.
I have seen solutions for removing duplicates from a list like this: myList = list(set(myList))
, but i'm not sure how to do that with a list of objects?
I need to maintain the order of my list of objects too. I was also thinking maybe I could use difflib
to check for differences in the titles.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
仅当您知道重复项是什么时,
set(list_of_objects)
才会删除重复项,也就是说,您需要定义对象的唯一性。为此,您需要使对象可散列。您需要定义
__hash__
和__eq__
方法,具体方法如下:http://docs.python.org/glossary.html#term-hashable
不过,您可能只需要定义
__eq__
方法。编辑:如何实现
__eq__
方法:正如我提到的,您需要知道对象的唯一性定义。假设我们有一本具有author_name和title属性的书,它们的组合是唯一的(因此,我们可以有很多斯蒂芬·金创作的书,也可以有很多名为《闪灵》的书,但只有一本名为《闪灵》的书是斯蒂芬·金写的),那么执行如下:
同样,这就是我有时实现 __hash__ 方法的方式:
您可以检查一下,如果您创建具有相同作者和标题的 2 本书的列表,则图书对象将
相同(使用等于(使用is
运算符)和==
运算符)。另外,当使用set()
时,它将删除一本书。编辑:这是我的一个旧答案,但我现在才注意到它有一个错误,该错误已在最后一段中用删除线纠正:具有相同
hash()
与is
相比,不会给出True
。但是,如果您打算将它们用作集合的元素或字典中的键,则使用对象的哈希性。The
set(list_of_objects)
will only remove the duplicates if you know what a duplicate is, that is, you'll need to define a uniqueness of an object.In order to do that, you'll need to make the object hashable. You need to define both
__hash__
and__eq__
method, here is how:http://docs.python.org/glossary.html#term-hashable
Though, you'll probably only need to define
__eq__
method.EDIT: How to implement the
__eq__
method:You'll need to know, as I mentioned, the uniqueness definition of your object. Supposed we have a Book with attributes author_name and title that their combination is unique, (so, we can have many books Stephen King authored, and many books named The Shining, but only one book named The Shining by Stephen King), then the implementation is as follows:
Similarly, this is how I sometimes implement the
__hash__
method:You can check that if you create a list of 2 books with same author and title, the book objects will
be the same (withequal (withis
operator) and==
operator). Also, whenset()
is used, it will remove one book.EDIT: This is one old anwser of mine, but I only now notice that it has the error which is corrected with strikethrough in the last paragraph: objects with the same
hash()
won't giveTrue
when compared withis
. Hashability of object is used, however, if you intend to use them as elements of set, or as keys in dictionary.由于它们不可散列,因此您不能直接使用集合。标题应该是。
这是第一部分。
不过,您将需要描述第二部分使用的数据库/ORM 等。
Since they're not hashable, you can't use a set directly. The titles should be though.
Here's the first part.
You're going to need to describe what database/ORM etc. you're using for the second part though.
如果您不能(或不会)为对象定义
__eq__
,您可以使用字典理解来实现相同的目的:请注意,这将包含 last 给定键的实例,例如
对于
mylist = [Item(attribute=1, tag='first'), Item(attribute=1, tag='second'), Item(attribute=2, tag='third')]
您将得到[Item(attribute=1, tag='second'), Item(attribute=2, tag='third')]
。您可以使用mylist[::-1]
(如果存在完整列表)来解决此问题。If you can't (or won't) define
__eq__
for the objects, you can use a dict-comprehension to achieve the same end:Note that this will contain the last instance of a given key, e.g.
for
mylist = [Item(attribute=1, tag='first'), Item(attribute=1, tag='second'), Item(attribute=2, tag='third')]
you get[Item(attribute=1, tag='second'), Item(attribute=2, tag='third')]
. You can get around this by usingmylist[::-1]
(if the full list is present).这看起来相当小:
This seems pretty minimal:
对于不可散列的类型,您可以使用 字典理解 根据字段删除重复对象在所有物体中。这对于 Pydantic 特别有用,因为它默认情况下不支持可哈希类型:
请注意这将仅根据
row.title
考虑重复项,并将采用row.title
最后匹配的对象。这意味着如果您的行可能具有相同的标题但其他属性的值不同,那么这将不起作用。例如
[{"title": "test", "myval": 1}, {"title": "test", "myval": 2}] ==> [{"title": "test", "myval": 2}]
如果您想匹配
row
中的多个字段,您可以进一步扩展:空字符
\0
用作字段之间的分隔符。这假设row.title
或row.value
中均未使用空字符。For non-hashable types you can use a dictionary comprehension to remove duplicate objects based on a field in all objects. This is particularly useful for Pydantic which doesn't support hashable types by default:
Note that this will consider duplicates solely based on on
row.title
, and will take the last matched object forrow.title
. This means if your rows may have the same title but different values in other attributes, then this won't work.e.g.
[{"title": "test", "myval": 1}, {"title": "test", "myval": 2}] ==> [{"title": "test", "myval": 2}]
If you wanted to match against multiple fields in
row
, you could extend this further:The null character
\0
is used as a separator between fields. This assumes that the null character isn't used in eitherrow.title
orrow.value
.为此,需要
__hash__
和__eq__
。需要
__hash__
将对象添加到集合中,因为 python 的集合是作为哈希表实现的。默认情况下,数字、字符串和元组等不可变对象是可哈希的。然而,由于鸽巢原理,哈希冲突(两个不同的对象哈希为相同的值)是不可避免的。因此,仅使用哈希值无法区分两个对象,用户必须指定自己的
__eq__
函数。因此,用户提供的实际哈希函数并不重要,但最好尽量避免哈希冲突以提高性能(请参阅什么是正确的以及实现 __hash__() 的好方法?)。Both
__hash__
and__eq__
are needed for this.__hash__
is needed to add an object to a set, since python's sets are implemented as hashtables. By default, immutable objects like numbers, strings, and tuples are hashable.However, hash collisions (two distinct objects hashing to the same value) are inevitable, due to the pigeonhole principle. So, two objects cannot be distinguished only using their hash, and the user must specify their own
__eq__
function. Thus, the actual hash function the user provides is not crucial, though it is best to try to avoid hash collisions for performance (see What's a correct and good way to implement __hash__()?).我最近最终使用了下面的代码。它与其他答案类似,因为它迭代列表并记录它所看到的内容,然后删除它已经看到的任何项目,但它不会创建重复的列表,而是只是从原始列表中删除该项目。
I recently ended up using the code below. It is similar to other answers as it iterates over the list and records what it is seeing and then removes any item that it has already seen but it doesn't create a duplicate list, instead it just deletes the item from original list.
这很简单,朋友们:-
就是这样! :)
Its quite easy freinds :-
thats it ! :)