Python 如何确定集合中元素的顺序？

发布于 2025-01-15 11:39:05 字数 538 浏览 2 评论 0原文

我知道Python中的集合是无序的，但我对它们显示的“顺序”很好奇，因为它似乎是一致的。它们似乎每次都以同样的方式无序：

>>> set_1 = set([5, 2, 7, 2, 1, 88])
>>> set_2 = set([5, 2, 7, 2, 1, 88])
>>> set_1
set([88, 1, 2, 5, 7])
>>> set_2
set([88, 1, 2, 5, 7])

......还有另一个例子：

>>> set_3 = set('abracadabra')
>>> set_4 = set('abracadabra')
>>> set_3
set(['a', 'r', 'b', 'c', 'd'])
>>>> set_4
set(['a', 'r', 'b', 'c', 'd'])

我只是好奇为什么会这样。有什么帮助吗？

原文

I understand that sets in Python are unordered, but I'm curious about the 'order' they're displayed in, as it seems to be consistent. They seem to be out-of-order in the same way every time:

>>> set_1 = set([5, 2, 7, 2, 1, 88])
>>> set_2 = set([5, 2, 7, 2, 1, 88])
>>> set_1
set([88, 1, 2, 5, 7])
>>> set_2
set([88, 1, 2, 5, 7])

...and another example:

>>> set_3 = set('abracadabra')
>>> set_4 = set('abracadabra')
>>> set_3
set(['a', 'r', 'b', 'c', 'd'])
>>>> set_4
set(['a', 'r', 'b', 'c', 'd'])

I'm just curious why this would be. Any help?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萌面超妹 2025-01-22 11:39:05

您应该观看此视频（尽管它是 CPython¹ 特定的以及关于字典——但我认为它也适用于集合）。

基本上，Python 对元素进行哈希处理并获取最后 N 位（其中 N 由集合的大小决定），并使用这些位作为数组索引将对象放置在内存中。然后按照对象在内存中存在的顺序生成对象。当然，当您需要解决哈希之间的冲突时，情况会变得更加复杂，但这就是要点。

另请注意，它们的打印顺序取决于您放置它们的顺序（由于冲突）。因此，如果您对传递给 set_2 的列表进行重新排序，则如果存在按键冲突，您可能会得到不同的顺序。

例如：

list1 = [8,16,24]
set(list1)        #set([8, 16, 24])
list2 = [24,16,8]
set(list2)        #set([24, 16, 8])

注意这些集合中保留的顺序是“巧合”，并且与冲突解决有关（我对此一无所知）。重点是，hash(8)、hash(16) 和 hash(24) 的最后 3 位是相同的。因为它们是相同的，所以冲突解决会接管并将元素放入“备份”内存位置，而不是第一个（最佳）选择，因此无论 8 占用一个位置还是 16 > 由哪一个先到达聚会并占据“最佳座位”来确定。

如果我们使用 1、2 和 3 重复该示例，无论它们在输入列表中的顺序如何，您都将获得一致的顺序：

list1 = [1,2,3]
set(list1)      # set([1, 2, 3])
list2 = [3,2,1]
set(list2)      # set([1, 2, 3])

由于 hash(1)、hash(2) 和 hash(3) 的最后 3 位是唯一的。

¹注意此处描述的实现适用于 CPython dict 和 set。我认为一般描述对于 CPython 3.6 及之前的所有现代版本都有效。然而，从 CPython3.6 开始，有一个额外的实现细节，实际上保留了 dict 迭代的插入顺序。看来set仍然没有这个属性。数据结构由这篇博文由 pypy 人员编写（他们在 CPython 人员之前就开始使用它）。最初的想法（至少对于Python生态系统）存档于python-dev 邮件列表。

You should watch this video (although it is CPython¹ specific and about dictionaries -- but I assume it applies to sets as well).

Basically, python hashes the elements and takes the last N bits (where N is determined by the size of the set) and uses those bits as array indices to place the object in memory. The objects are then yielded in the order they exist in memory. Of course, the picture gets a little more complicated when you need to resolve collisions between hashes, but that's the gist of it.

Also note that the order that they are printed out is determined by the order that you put them in (due to collisions). So, if you reorder the list you pass to set_2, you might get a different order out if there are key collisions.

For example:

list1 = [8,16,24]
set(list1)        #set([8, 16, 24])
list2 = [24,16,8]
set(list2)        #set([24, 16, 8])

Note the fact that the order is preserved in these sets is "coincidence" and has to do with collision resolution (which I don't know anything about). The point is that the last 3 bits of hash(8), hash(16) and hash(24) are the same. Because they are the same, collision resolution takes over and puts the elements in "backup" memory locations instead of the first (best) choice and so whether 8 occupies a location or 16 is determined by which one arrived at the party first and took the "best seat".

If we repeat the example with 1, 2 and 3, you will get a consistent order no matter what order they have in the input list:

list1 = [1,2,3]
set(list1)      # set([1, 2, 3])
list2 = [3,2,1]
set(list2)      # set([1, 2, 3])

since the last 3 bits of hash(1), hash(2) and hash(3) are unique.

¹Note The implementation described here applies to CPython dict and set. I think that the general description is valid for all modern versions of CPython up to 3.6. However, starting with CPython3.6, there is an additional implementation detail that actually preserves the insertion order for iteration for dict. It appears that set still do not have this property. The data structure is described by this blog post by the pypy folks (who started using this before the CPython folks). The original idea (at least for the python ecosystem) is archived on the python-dev mailing list.

回复收藏 0 原文

卷耳 2025-01-22 11:39:05

这种行为的原因是Python使用哈希表来实现字典：https://en.wikipedia。 org/wiki/Hash_table#Open_addressing

键的位置由其内存地址定义。如果您知道Python会为某些对象重用内存：

>>> a = 'Hello world'
>>> id(a)
140058096568768
>>> a = 'Hello world'
>>> id(a)
140058096568480

您可以看到对象a每次初始化时都有不同的地址。

但对于小整数，它不会改变：

>>> a = 1
>>> id(a)
40060856
>>> a = 1
>>> id(a)
40060856

即使我们创建具有不同名称的第二个对象，它也会是相同的：

>>> b = 1
>>> id(b)
40060856

这种方法可以节省 Python 解释器消耗的内存。

The reason of such behavior is than Python use hash tables for dictionary implementation: https://en.wikipedia.org/wiki/Hash_table#Open_addressing

Position of the key is defined by it's memory address. If you know Python reuse memory for some objects:

>>> a = 'Hello world'
>>> id(a)
140058096568768
>>> a = 'Hello world'
>>> id(a)
140058096568480

You can see that object a has different address every time it's init.

But for small integers it isn't change:

>>> a = 1
>>> id(a)
40060856
>>> a = 1
>>> id(a)
40060856

Even if we create second object with different name it would be the same:

>>> b = 1
>>> id(b)
40060856

This approach allow to save memory which Python interpreter consume.

回复收藏 0 原文

樱桃奶球 2025-01-22 11:39:05

mgilson 的出色答案暗示了一件关键的事情，但在任何现有答案中都没有明确提及：

小整数哈希对自己：

>>> [hash(x) for x in (1, 2, 3, 88)]
[1, 2, 3, 88]

字符串哈希为不可预测的值。事实上，从 3.3 开始，默认情况下，它们是基于启动时随机化的种子。因此，对于每个新的 Python 解释器会话，您都会得到不同的结果，但是：

>>> [hash(x) for x in 'abcz']
[6014072853767888837,
 8680706751544317651,
 -7529624133683586553,
 -1982255696180680242]

因此，请考虑最简单的哈希表实现：只是一个包含 N 个元素的数组，其中插入值意味着将其放入 hash(value) % N（假设没有冲突）。您可以粗略地猜测 N 有多大 - 它会比其中的元素数量稍大一些。当从 6 个元素的序列创建一个集合时，N 很容易是 8。

当您使用 N=8 存储这 5 个数字时会发生什么？嗯，hash(1) % 8、hash(2) % 8等只是数字本身，但是hash(88) % 8 code> 为 0。因此，哈希表的数组最终保存为 88, 1, 2, NULL, NULL, 5, NULL, 7。因此应该很容易弄清楚为什么迭代该集合可能会得到 88, 1, 2, 5, 7。

当然，Python 并不保证您每次都会得到这个订单。对 N 正确值的猜测方式的一个小改变可能意味着 88 最终会出现不同的结果（或者最终与其他值之一发生冲突）。事实上，在我的 Mac 上运行 CPython 3.7，我得到 1, 2, 5, 7, 88.0

同时，当您从大小为 11 的序列构建哈希，然后插入随机哈希时进入其中，会发生什么？即使假设最简单的实现，并假设没有冲突，您仍然不知道将得到什么顺序。它在 Python 解释器的单次运行中是一致的，但在下次启动时会有所不同。（除非您将 PYTHONHASHSEED 设置为 0 或其他某个 int 值。）这正是您所看到的。

当然，值得一看集合的实际实现方式而不是猜测。但是，基于最简单的哈希表实现的假设，您会猜测（禁止冲突和禁止哈希表扩展）到底会发生什么。

One key thing that's hinted at mgilson's great answer, but isn't mentioned explicitly in any of the existing answers:

Small integers hash to themselves:

>>> [hash(x) for x in (1, 2, 3, 88)]
[1, 2, 3, 88]

Strings hash to values that are unpredictable. In fact, from 3.3 on, by default, they're built off a seed that's randomized at startup. So, you'll get different results for each new Python interpreter session, but:

>>> [hash(x) for x in 'abcz']
[6014072853767888837,
 8680706751544317651,
 -7529624133683586553,
 -1982255696180680242]

So, consider the simplest possible hash table implementation: just an array of N elements, where inserting a value means putting it in hash(value) % N (assuming no collisions). And you can make a rough guess at how large N will be—it's going to be a little bigger than the number of elements in it. When creating a set from a sequence of 6 elements, N could easily be, say, 8.

What happens when you store those 5 numbers with N=8? Well, hash(1) % 8, hash(2) % 8, etc. are just the numbers themselves, but hash(88) % 8 is 0. So, the hash table's array ends up holding 88, 1, 2, NULL, NULL, 5, NULL, 7. So it should be easy to figure out why iterating the set might give you 88, 1, 2, 5, 7.

Of course Python doesn't guarantee that you'll get this order every time. A small change to the way it guesses at the right value for N could mean 88 ends up somewhere different (or ends up colliding with one of the other values). And, in fact, running CPython 3.7 on my Mac, I get 1, 2, 5, 7, 88.0

Meanwhile, when you build a hash from a sequence of size 11 and then insert randomized hashes into it, what happens? Even assuming the simplest implementation, and assuming there are no collisions, you still have no idea what order you're going to get. It will be consistent within a single run of the Python interpreter, but different the next time you start it up. (Unless you set PYTHONHASHSEED to 0, or to some other int value.) Which is exactly what you see.

Of course it's worth looking at the way sets are actually implemented rather than guessing. But what you'd guess based on the assumption of the simplest hash table implementation is (barring collisions and barring expansion of the hash table) exactly what happens.

回复收藏 0 原文