为什么 Python 字典可以有多个具有相同哈希值的键?
我试图理解 Python hash
函数的底层。我创建了一个自定义类,其中所有实例都返回相同的哈希值。
class C:
def __hash__(self):
return 42
我只是假设在任何时候 dict
中只能存在上述类的一个实例,但实际上 dict
可以包含具有相同哈希值的多个元素。
c, d = C(), C()
x = {c: 'c', d: 'd'}
print(x)
# {<__main__.C object at 0x7f0824087b80>: 'c', <__main__.C object at 0x7f0823ae2d60>: 'd'}
# note that the dict has 2 elements
我进行了更多实验,发现如果我重写 __eq__ 方法以使类的所有实例比较相等,则 dict 只允许一个实例。
class D:
def __hash__(self):
return 42
def __eq__(self, other):
return True
p, q = D(), D()
y = {p: 'p', q: 'q'}
print(y)
# {<__main__.D object at 0x7f0823a9af40>: 'q'}
# note that the dict only has 1 element
所以我很想知道 dict
如何拥有具有相同哈希值的多个元素。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这是我能够整理的有关 Python 字典的所有内容(可能比任何人想知道的都多;但答案很全面)。向 Duncan 致敬,他指出 Python 字典使用槽并引导我进入这个兔子洞。
O(1)
查找)。下图是Python哈希表的逻辑表示。下图中,左边的 0, 1, ..., i, ... 是哈希表中槽的索引(仅用于说明目的,不与显然桌子!)。
当一个新的字典初始化时,它以 8 个槽位开始。 (参见dictobject.h:49)
i
开始。 CPython 使用初始 i = hash(key) &掩码。其中mask = PyDictMINSIZE - 1
,但这并不重要)。请注意,检查的初始槽 i 取决于密钥的哈希。
)。但如果那个位置被占用了怎么办!?最有可能的是因为另一个条目具有相同的哈希值(哈希冲突!)==槽中的条目与要插入的当前条目的键的
比较,而不是is
比较)(dictobject.c:337,344-345)。如果两者匹配,则它认为该条目已经存在,放弃并移至下一个要插入的条目。如果哈希值或密钥不匹配,它将开始探测。(参见 dictobject.h 你去! dict 的 Python 实现在插入项时检查两个键的哈希相等性以及键的正常相等性 (
==
)。所以综上所述,如果有两个键,a
和b
并且hash(a)==hash(b)
,但是a!=b
,那么两者可以和谐地存在于Python字典中。但是如果hash(a)==hash(b)
和a==b
,那么它们不能都在同一个字典中。因为我们必须在每次哈希冲突后进行探测,所以太多哈希冲突的一个副作用是查找和插入将变得非常慢(正如 Duncan 在 评论)。
我想我的问题的简短答案是,“因为这就是它在源代码中的实现方式;)”
虽然这很高兴知道(对于极客点?),但我不确定它如何在现实生活中使用。因为除非你试图显式地破坏某些东西,否则为什么两个不相等的对象具有相同的哈希值?
Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive). A shout out to Duncan for pointing out that Python dicts use slots and leading me down this rabbit hole.
O(1)
lookup by index).The figure below is a logical representation of a python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).
When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)
i
that is based on the hash of the key. CPython uses initiali = hash(key) & mask
. Wheremask = PyDictMINSIZE - 1
, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.<hash|key|value>
). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)==
comparison not theis
comparison) of the entry in the slot against the key of the current entry to be inserted (dictobject.c:337,344-345). If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.There you go! The Python implementation of dict checks for both hash equality of two keys and the normal equality (
==
) of the keys when inserting items. So in summary, if there are two keys,a
andb
andhash(a)==hash(b)
, buta!=b
, then both can exist harmoniously in a Python dict. But ifhash(a)==hash(b)
anda==b
, then they cannot both be in the same dict.Because we have to probe after every hash collision, one side effect of too many hash collisions is that the lookups and insertions will become very slow (as Duncan points out in the comments).
I guess the short answer to my question is, "Because that's how it's implemented in the source code ;)"
While this is good to know (for geek points?), I am not sure how it can be used in real life. Because unless you are trying to explicitly break something, why would two objects that are not equal, have same hash?
有关 Python 哈希工作原理的详细说明,请参阅我对 为什么早期返回比其他方法慢? 的
回答 基本上它使用哈希来选择桌子上的一个槽位。如果槽中有一个值并且哈希匹配,它将比较这些项目以查看它们是否相等。
如果哈希匹配但项目不相等,则它会尝试另一个槽。有一个公式可以选择这个(我在参考答案中描述),它逐渐提取哈希值中未使用的部分;但是一旦它用完它们,它最终将遍历哈希表中的所有槽。这保证了我们最终要么找到匹配的项目,要么找到空的插槽。当搜索找到空槽时,它会插入该值或放弃(取决于我们是添加还是获取值)。
需要注意的重要一点是,没有列表或桶:只有一个具有特定数量的槽的哈希表,每个哈希用于生成一系列候选槽。
For a detailed description of how Python's hashing works see my answer to Why is early return slower than else?
Basically it uses the hash to pick a slot in the table. If there is a value in the slot and the hash matches, it compares the items to see if they are equal.
If the hash matches but the items aren't equal, then it tries another slot. There's a formula to pick this (which I describe in the referenced answer), and it gradually pulls in unused parts of the hash value; but once it has used them all up, it will eventually work its way through all slots in the hash table. That guarantees eventually we either find a matching item or an empty slot. When the search finds an empty slot, it inserts the value or gives up (depending whether we are adding or getting a value).
The important thing to note is that there are no lists or buckets: there is just a hash table with a particular number of slots, and each hash is used to generate a sequence of candidate slots.
编辑:下面的答案是处理哈希冲突的可能方法之一,但Python不是这样做的。下面引用的 Python wiki 也是不正确的。下面@Duncan给出的最佳来源是实现本身: https:// /github.com/python/cpython/blob/master/Objects/dictobject.c 我为混淆表示歉意。
它在哈希中存储元素列表(或存储桶),然后迭代该列表,直到找到该列表中的实际键。一张图片说一千多个字:
在这里你看到
约翰Smith
和Sandra Dee
都哈希为152
。 Bucket152
包含它们两者。查找Sandra Dee
时,它首先在存储桶152
中查找列表,然后循环遍历该列表,直到找到Sandra Dee
并返回521-6955。
以下内容是错误的,仅用于上下文:在 Python wiki你可以找到(伪?)Python 如何执行查找的代码。
实际上有几种可能的解决方案可以解决这个问题,请查看维基百科文章以获得很好的概述:http:// /en.wikipedia.org/wiki/Hash_table#Collision_resolution
Edit: the answer below is one of possible ways to deal with hash collisions, it is however not how Python does it. Python's wiki referenced below is also incorrect. The best source given by @Duncan below is the implementation itself: https://github.com/python/cpython/blob/master/Objects/dictobject.c I apologize for the mix-up.
It stores a list (or bucket) of elements at the hash then iterates through that list until it finds the actual key in that list. A picture says more than a thousand words:
Here you see
John Smith
andSandra Dee
both hash to152
. Bucket152
contains both of them. When looking upSandra Dee
it first finds the list in bucket152
, then loops through that list untilSandra Dee
is found and returns521-6955
.The following is wrong it's only here for context: On Python's wiki you can find (pseudo?) code how Python performs the lookup.
There's actually several possible solutions to this problem, check out the wikipedia article for a nice overview: http://en.wikipedia.org/wiki/Hash_table#Collision_resolution
哈希表通常必须允许哈希冲突!你会很不幸,两件事最终会散列成同一件事。下面,项目列表中有一组具有相同哈希键的对象。通常,该列表中只有一件事,但在这种情况下,它会继续将它们堆叠到同一个列表中。它知道它们不同的唯一方法是通过等于运算符。
发生这种情况时,您的性能会随着时间的推移而下降,这就是为什么您希望哈希函数尽可能“随机”。
Hash tables, in general have to allow for hash collisions! You will get unlucky and two things will eventually hash to the same thing. Underneath, there is a set of objects in a list of items that has that same hash key. Usually, there is only one thing in that list, but in this case, it'll keep stacking them into the same one. The only way it knows they are different is through the equals operator.
When this happens, your performance will degrade over time, which is why you want your hash function to be as "random as possible".
在线程中,当我们将用户定义类的实例作为键放入字典中时,我没有看到 python 到底对它做了什么。让我们阅读一些文档:它声明只有可哈希对象才能用作键。 Hashable 是所有不可变的内置类和所有用户定义的类。
因此,如果您的类中不断有 __hash__ ,但不提供任何 __cmp__ 或 __eq__ 方法,那么您的所有实例对于字典来说都是不相等的。
另一方面,如果您提供任何 __cmp__ 或 __eq__ 方法,但不提供 __hash__,则您的实例在字典方面仍然不相等。
输出
In the thread I did not see what exactly python does with instances of a user-defined classes when we put it into a dictionary as a keys. Let's read some documentation: it declares that only hashable objects can be used as a keys. Hashable are all immutable built-in classes and all user-defined classes.
So if you have a constantly __hash__ in your class, but not providing any __cmp__ or __eq__ method, then all your instances are unequal for the dictionary.
In the other hand, if you providing any __cmp__ or __eq__ method, but not providing __hash__, your instances are still unequal in terms of dictionary.
Output