当前位置：文江博客话题详情

为什么 Python 字典可以有多个具有相同哈希值的键？

发布于 2024-12-29 00:36:24 字数 832 浏览 0 评论 0 原文

我试图理解 Python hash 函数的底层。我创建了一个自定义类，其中所有实例都返回相同的哈希值。

class C:
    def __hash__(self):
        return 42

我只是假设在任何时候 dict 中只能存在上述类的一个实例，但实际上 dict 可以包含具有相同哈希值的多个元素。

c, d = C(), C()
x = {c: 'c', d: 'd'}
print(x)
# {<__main__.C object at 0x7f0824087b80>: 'c', <__main__.C object at 0x7f0823ae2d60>: 'd'}
# note that the dict has 2 elements

我进行了更多实验，发现如果我重写 __eq__ 方法以使类的所有实例比较相等，则 dict 只允许一个实例。

class D:
    def __hash__(self):
        return 42
    def __eq__(self, other):
        return True

p, q = D(), D()
y = {p: 'p', q: 'q'}
print(y)
# {<__main__.D object at 0x7f0823a9af40>: 'q'}
# note that the dict only has 1 element

所以我很想知道 dict 如何拥有具有相同哈希值的多个元素。

原文

I am trying to understand the Python hash function under the hood. I created a custom class where all instances return the same hash value.

class C:
    def __hash__(self):
        return 42

I just assumed that only one instance of the above class can be in a dict at any time, but in fact a dict can have multiple elements with the same hash.

c, d = C(), C()
x = {c: 'c', d: 'd'}
print(x)
# {<__main__.C object at 0x7f0824087b80>: 'c', <__main__.C object at 0x7f0823ae2d60>: 'd'}
# note that the dict has 2 elements

I experimented a little more and found that if I override the __eq__ method such that all the instances of the class compare equal, then the dict only allows one instance.

class D:
    def __hash__(self):
        return 42
    def __eq__(self, other):
        return True

p, q = D(), D()
y = {p: 'p', q: 'q'}
print(y)
# {<__main__.D object at 0x7f0823a9af40>: 'q'}
# note that the dict only has 1 element

So I am curious to know how a dict can have multiple elements with the same hash.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

意中人 2025-01-05 00:36:24

这是我能够整理的有关 Python 字典的所有内容（可能比任何人想知道的都多；但答案很全面）。向 Duncan 致敬，他指出 Python 字典使用槽并引导我进入这个兔子洞。

Python 字典被实现为哈希表。
哈希表必须允许哈希冲突，即即使两个键具有相同的哈希值，表的实现也必须具有明确插入和检索键和值对的策略。
Python dict 使用开放寻址来解决哈希冲突（如下所述）（请参阅dictobject.c:296-297)。
Python 哈希表只是一个连续的内存块（有点像数组，因此您可以通过索引进行 O(1) 查找）。
表中的每个插槽只能存储一个条目。这一点很重要
表中的每个条目实际上是三个值的组合 - 。这是作为 C 结构体实现的（请参阅 dictobject.h:51- 56)

下图是Python哈希表的逻辑表示。下图中，左边的 0, 1, ..., i, ... 是哈希表中槽的索引（仅用于说明目的，不与显然桌子！）。

# Python Hash表的逻辑模型
-+------------------+
0| <哈希|键|值>|
-+------------------+
1| ... |
-+------------------+
.| ... |
-+------------------+
我| ... |
-+------------------+
.| ... |
-+------------------+
n| ... |
-+------------------+

当一个新的字典初始化时，它以 8 个槽位开始。（参见dictobject.h:49）
添加条目时对于表，我们从一些基于密钥哈希的槽 i 开始。 CPython 使用初始 i = hash(key) &掩码。其中 mask = PyDictMINSIZE - 1，但这并不重要）。请注意，检查的初始槽 i 取决于密钥的哈希。
如果该插槽为空，则该条目将添加到该插槽中（对于条目，我的意思是）。但如果那个位置被占用了怎么办！？最有可能的是因为另一个条目具有相同的哈希值（哈希冲突！）
如果槽被占用，CPython（甚至 PyPy）会比较哈希值和密钥（通过比较我的意思是 ==槽中的条目与要插入的当前条目的键的 比较，而不是 is 比较）（dictobject.c:337,344-345）。如果两者匹配，则它认为该条目已经存在，放弃并移至下一个要插入的条目。如果哈希值或密钥不匹配，它将开始探测。
探测只是意味着它逐个槽地搜索槽以找到空槽。从技术上讲，我们可以逐一进行，i+1，i+2，...并使用第一个可用的（即线性探测）。但由于评论中解释得很清楚的原因（请参阅 dictobject.c:33 -126），CPython 使用随机探测。在随机探测中，以伪随机顺序选择下一个时隙。该条目将添加到第一个空槽中。对于本次讨论，用于选择下一个槽位的实际算法并不重要（请参阅 dictobject.c:33-126 用于探测算法）。重要的是探测槽直到找到第一个空槽。
查找也会发生同样的情况，只是从初始槽 i 开始（其中 i 取决于键的哈希值）。如果哈希值和密钥都不匹配槽中的条目，它将开始探测，直到找到匹配的槽。如果所有插槽都用完，则报告失败。
顺便说一句，如果字典已满三分之二，它将调整大小。这可以避免减慢查找速度。 :64-65）

（参见 dictobject.h 你去！ dict 的 Python 实现在插入项时检查两个键的哈希相等性以及键的正常相等性 (==)。所以综上所述，如果有两个键，a和b并且hash(a)==hash(b)，但是 a!=b，那么两者可以和谐地存在于Python字典中。但是如果hash(a)==hash(b)和a==b，那么它们不能都在同一个字典中。

因为我们必须在每次哈希冲突后进行探测，所以太多哈希冲突的一个副作用是查找和插入将变得非常慢（正如 Duncan 在评论）。

我想我的问题的简短答案是，“因为这就是它在源代码中的实现方式;)”

虽然这很高兴知道（对于极客点？），但我不确定它如何在现实生活中使用。因为除非你试图显式地破坏某些东西，否则为什么两个不相等的对象具有相同的哈希值？

Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive). A shout out to Duncan for pointing out that Python dicts use slots and leading me down this rabbit hole.

Python dictionaries are implemented as hash tables.
Hash tables must allow for hash collisions i.e. even if two keys have same hash value, the implementation of the table must have a strategy to insert and retrieve the key and value pairs unambiguously.
Python dict uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297).
Python hash table is just a continguous block of memory (sort of like an array, so you can do O(1) lookup by index).
Each slot in the table can store one and only one entry. This is important
Each entry in the table actually a combination of the three values - . This is implemented as a C struct (see dictobject.h:51-56)

The figure below is a logical representation of a python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).

# Logical model of Python Hash table
-+-----------------+
0| <hash|key|value>|
-+-----------------+
1|      ...        |
-+-----------------+
.|      ...        |
-+-----------------+
i|      ...        |
-+-----------------+
.|      ...        |
-+-----------------+
n|      ...        |
-+-----------------+

When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)
When adding entries to the table, we start with some slot, i that is based on the hash of the key. CPython uses initial i = hash(key) & mask. Where mask = PyDictMINSIZE - 1, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.
If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)
If the slot is occupied, CPython (and even PyPy) compares the the hash AND the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the key of the current entry to be inserted (dictobject.c:337,344-345). If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.
Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.
The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.
BTW, the dict will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)

There you go! The Python implementation of dict checks for both hash equality of two keys and the normal equality (==) of the keys when inserting items. So in summary, if there are two keys, a and b and hash(a)==hash(b), but a!=b, then both can exist harmoniously in a Python dict. But if hash(a)==hash(b) and a==b, then they cannot both be in the same dict.

Because we have to probe after every hash collision, one side effect of too many hash collisions is that the lookups and insertions will become very slow (as Duncan points out in the comments).

I guess the short answer to my question is, "Because that's how it's implemented in the source code ;)"

While this is good to know (for geek points?), I am not sure how it can be used in real life. Because unless you are trying to explicitly break something, why would two objects that are not equal, have same hash?

回复收藏 0 原文

哆兒滾 2025-01-05 00:36:24

有关 Python 哈希工作原理的详细说明，请参阅我对为什么早期返回比其他方法慢？的

回答基本上它使用哈希来选择桌子上的一个槽位。如果槽中有一个值并且哈希匹配，它将比较这些项目以查看它们是否相等。

如果哈希匹配但项目不相等，则它会尝试另一个槽。有一个公式可以选择这个（我在参考答案中描述），它逐渐提取哈希值中未使用的部分；但是一旦它用完它们，它最终将遍历哈希表中的所有槽。这保证了我们最终要么找到匹配的项目，要么找到空的插槽。当搜索找到空槽时，它会插入该值或放弃（取决于我们是添加还是获取值）。

需要注意的重要一点是，没有列表或桶：只有一个具有特定数量的槽的哈希表，每个哈希用于生成一系列候选槽。

回复收藏 0 原文

半透明的墙 2025-01-05 00:36:24

编辑：下面的答案是处理哈希冲突的可能方法之一，但Python不是这样做的。下面引用的 Python wiki 也是不正确的。下面@Duncan给出的最佳来源是实现本身： https:// /github.com/python/cpython/blob/master/Objects/dictobject.c 我为混淆表示歉意。

它在哈希中存储元素列表（或存储桶），然后迭代该列表，直到找到该列表中的实际键。一张图片说一千多个字：

在这里你看到约翰Smith 和 Sandra Dee 都哈希为 152。 Bucket 152 包含它们两者。查找 Sandra Dee 时，它首先在存储桶 152 中查找列表，然后循环遍历该列表，直到找到 Sandra Dee 并返回 521-6955。

以下内容是错误的，仅用于上下文：在 Python wiki你可以找到（伪？）Python 如何执行查找的代码。

实际上有几种可能的解决方案可以解决这个问题，请查看维基百科文章以获得很好的概述：http:// /en.wikipedia.org/wiki/Hash_table#Collision_resolution

回复收藏 0 原文

别闹i 2025-01-05 00:36:24

哈希表通常必须允许哈希冲突！你会很不幸，两件事最终会散列成同一件事。下面，项目列表中有一组具有相同哈希键的对象。通常，该列表中只有一件事，但在这种情况下，它会继续将它们堆叠到同一个列表中。它知道它们不同的唯一方法是通过等于运算符。

发生这种情况时，您的性能会随着时间的推移而下降，这就是为什么您希望哈希函数尽可能“随机”。

回复收藏 0 原文

热风软妹 2025-01-05 00:36:24

在线程中，当我们将用户定义类的实例作为键放入字典中时，我没有看到 python 到底对它做了什么。让我们阅读一些文档：它声明只有可哈希对象才能用作键。 Hashable 是所有不可变的内置类和所有用户定义的类。

用户定义的类有 __cmp__() 和
默认情况下 __hash__() 方法；与他们一起，所有物体
比较不相等（除了与自己比较）和
x.__hash__() 返回从 id(x) 派生的结果。

因此，如果您的类中不断有 __hash__ ，但不提供任何 __cmp__ 或 __eq__ 方法，那么您的所有实例对于字典来说都是不相等的。
另一方面，如果您提供任何 __cmp__ 或 __eq__ 方法，但不提供 __hash__，则您的实例在字典方面仍然不相等。

class A(object):
    def __hash__(self):
        return 42


class B(object):
    def __eq__(self, other):
        return True


class C(A, B):
    pass


dict_a = {A(): 1, A(): 2, A(): 3}
dict_b = {B(): 1, B(): 2, B(): 3}
dict_c = {C(): 1, C(): 2, C(): 3}

print(dict_a)
print(dict_b)
print(dict_c)

输出

{<__main__.A object at 0x7f9672f04850>: 1, <__main__.A object at 0x7f9672f04910>: 3, <__main__.A object at 0x7f9672f048d0>: 2}
{<__main__.B object at 0x7f9672f04990>: 2, <__main__.B object at 0x7f9672f04950>: 1, <__main__.B object at 0x7f9672f049d0>: 3}
{<__main__.C object at 0x7f9672f04a10>: 3}

In the thread I did not see what exactly python does with instances of a user-defined classes when we put it into a dictionary as a keys. Let's read some documentation: it declares that only hashable objects can be used as a keys. Hashable are all immutable built-in classes and all user-defined classes.

User-defined classes have __cmp__() and
__hash__() methods by default; with them, all objects
compare unequal (except with themselves) and
x.__hash__() returns a result derived from id(x).

So if you have a constantly __hash__ in your class, but not providing any __cmp__ or __eq__ method, then all your instances are unequal for the dictionary.
In the other hand, if you providing any __cmp__ or __eq__ method, but not providing __hash__, your instances are still unequal in terms of dictionary.

class A(object):
    def __hash__(self):
        return 42


class B(object):
    def __eq__(self, other):
        return True


class C(A, B):
    pass


dict_a = {A(): 1, A(): 2, A(): 3}
dict_b = {B(): 1, B(): 2, B(): 3}
dict_c = {C(): 1, C(): 2, C(): 3}

print(dict_a)
print(dict_b)
print(dict_c)

Output

{<__main__.A object at 0x7f9672f04850>: 1, <__main__.A object at 0x7f9672f04910>: 3, <__main__.A object at 0x7f9672f048d0>: 2}
{<__main__.B object at 0x7f9672f04990>: 2, <__main__.B object at 0x7f9672f04950>: 1, <__main__.B object at 0x7f9672f049d0>: 3}
{<__main__.C object at 0x7f9672f04a10>: 3}

回复收藏 0 原文

~没有更多了~