为什么 python 的 dict 是作为哈希表实现的,而 std::map 是基于树的?
为什么一种语言使用树而另一种语言使用哈希表来实现看似相似的数据结构?
c++ 的 map 与 python 的 dict
一个相关的问题是关于哈希表的性能。
请在下面评论我对哈希表的理解。
一棵树保证有 O(log n)。
而哈希表则无法保证,除非由于可能发生冲突而事先已知输入。
我倾向于认为随着问题规模变大,哈希表的性能将变得接近 O(n)。
因为我还没有听说过随着问题大小的增长而动态调整其表大小的哈希函数。
因此,哈希表仅对特定范围的问题大小有用,这就是为什么大多数数据库使用树而不是哈希表的原因。
Why one languages uses tree and another uses hash table for seemingly similar data structure?
c++'s map vs python's dict
A related question is about performance of hash table.
Please comment on my understanding of hash table below.
A tree is guaranteed to have O(log n).
Whereas hash table has no guarantee unless inputs are previously known because of possible collisions.
I tend to think hash table's performance would become close to O(n) as problem size gets bigger.
Because I haven't heard of a hash function that dynamically adjust its table size as problem size grows.
Hence, hash table is only useful for certain range of problem size, and that's why most DB uses tree than hash table.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
新的 C++ 标准具有
std::unordered_map
类型 哈希表。 IIRC 他们也希望将其纳入之前的标准,但讨论期间没有足够的时间,因此将其排除在外。然而,多年来,大多数流行的编译器都以这种或那种方式提供它。换句话说,不用太担心。使用正确的数据结构来完成手头的任务。
至于你对哈希表的理解是不准确的:
所有严肃的哈希表实现都会通过分配更大的数组并重新散列所有键来动态调整自身以适应不断增长的输入。尽管此操作成本高昂,但如果设计得当(很少进行),性能仍可摊销为 O(1)。
The new C++ standard has the
std::unordered_map
type which is a hash table. IIRC they wanted it to get into the previous standard as well, but there was not enough time during the discussions so it was left out. However, most popular compilers provided it in one way or another for years.In other words, don't worry about it too much. Use the proper data structure for the task at hand.
As for your understanding of hash tables, it's inaccurate:
All serious hash table implementation dynamically adjust themselves for growing input, by allocating a larger array and re-hashing all the keys. Although this operation is expensive, if designed properly (to be done rarely enough) the performance is still amortized O(1).
您对哈希表(以及谁使用它们)的理解是有缺陷的。
问题是,哈希表是一个相当模糊的术语。在幕后有很多实现......但首先我们来谈谈 BST(二叉搜索树)的使用。
C++是由委员会设计的,哈希表有多种可能的实现,导致特性差异很大,而BST(红黑树和AVL树)最流行的实现具有几乎相同的特性。因此,他们并没有彻底拒绝哈希表,他们只是无法确定要选择的特征和向用户公开的细节。参见 James Kanze 的评论,该提案来得太晚了,James 提出了一个有趣的问题:为什么斯捷潘诺夫没有首先提出它。我仍然怀疑选择的数量是罪魁祸首。
首先,让我们选择一个数据库软件。我会选择 Oracle,因为它有广泛的文档记录,而且是 SQL 数据库的典型代表。 Oracle 提供两种类型的索引:位图和搜索树。
注意:它们不使用二进制搜索树,而是使用对 IO 和缓存更友好的 B+树
哈希表和搜索树之间有一个根本区别:后者是排序的。许多数据库操作都意味着排序:
在所有这些情况下,哈希表都是无用的。
此外,数据库需要处理巨大的数据集(通常),这意味着它们需要组织数据以最大限度地减少 IO(磁盘读/写)。在这里,搜索树的排序性质意味着(在索引中)可能一起访问的元素(因为它们共享很多)也将被分组在一起,而不是分散到磁盘的四个角。
最后,Oracle内部可以在其执行计划中使用哈希表。当您执行需要两组行的交集的操作时,优化引擎可能会决定将(临时)集存储在哈希表中是最快的方法。
现在,关于性能。
事实上,搜索树的性能众所周知且易于理解,O(log N) 非常好且整洁。
另一方面,正如我所说,有许多不同的哈希表实现可能,以及处理增长和收缩的策略......肯定更复杂。
一个简单的结构示例,哈希表可以使用:
这两种策略具有截然不同的特性,而后者的特性也取决于桶的实现(简单的实现是使用简单的链表)。
但即使您选择一个实现,它的性能也是基于哈希函数分布,而哈希函数分布根据输入序列本身而变化!
我个人的建议?要在 C++ 中的
unordered_map
和map
之间进行选择,我只需问自己是否需要排序元素。如果我需要对它们进行排序,我会使用map
,否则我会使用unordered_map
。大多数时候,性能无论如何都一样好,所以这只是语义。Your understanding of hash tables (and who use them) is flawed.
The problem is, hash table is a rather vague term. Under the hood there are many implementations... but first let's talk about the use of BST (Binary Search Trees).
C++ is designed by commitee, there are many possible implementations of hash tables leading to widely different characteristics while the most popular implementations of BST (Red-Black Tree and AVL Tree) have nearly identical characteristics. Therefore, they did not rejected hash tables outright, they just could not settle on the characteristics to choose and the details to expose to the user.See James Kanze's comment, the proposal arrived too late and James asks an interesting question as to why Stepanov did not proposed it first. I still suspect the number of choices to be the culprit.
First of all, let's settle on a database software. I'll pick Oracle because it's both widely documented and so typical of SQL databases. Oracle offers two types of indexes: Bitmap and Search Trees.
Note: they do not use BINARY Search Trees, but instead use B+Trees which are much more IO and cache friendly
There is a fundamental difference between a Hash Table and a Search Tree: the latter is sorted. Many databases operations imply sorting:
In all those cases, a Hash Table is useless.
Furthermore, databases need to juggle with huge datasets (in general), which means that they need to organize their data in order to minimize IO (disk read/write). Here, the sorted nature of a Search Tree mean that (in the index) elements that are likely to be accessed together (because they share much) will also be grouped together instead of being scattered to the four corners of the disk.
Finally, internally Oracle may use Hash Tables in its execution plan. When you perform an operation that requires the intersection of two sets of rows, the optimization engine may decide that storing the (temporary) sets in Hash Tables is the fastest way to go.
Now, regarding performance.
Indeed, the performance of Search Trees is generally well-known and easy to understand O(log N) is nice and tidy.
On the other hand, as I said, there are many different Hash Tables implementation possible, as well as strategies to handle both growth and shrink... definitely more complicated.
A simple example of structure, a Hash Table may use:
Those two strategies have extremely different characteristics, and the latter characteristics also depend on the buckets implementations (the easy implementation is to use a simple linked-list).
But even if you pick an implementation, its performance is based on the hash function distribution, which varies depending on the input sequence itself!
My personal advice ? To pick between
unordered_map
andmap
in C++, I simply ask myself about whether I need sorted elements or not. If I need them to be sorted I use amap
, otherwise I use anunordered_map
. Most of the times, the performances are just as good anyway, so it's just semantics.这或多或少是语言设计者的任意选择。在
对于 C++ 的情况,我怀疑(但不确定)动机是
定义严格的复杂性上限的愿望:设计一个好的
哈希函数并不是微不足道的,并且哈希函数的哈希表也很差
表现很差。另一个可能考虑的问题是
事实上,有一个已建立的排序运算符 (
<
);那里对于散列来说没有什么相似之处。
就 Python(以及许多其他语言)而言,很多时候,
键将是内置类型,例如
str
(std::string
不是在定义 STL 时可用),因此您可以确保足够的
哈希函数。当一切都是对象并且继承自
通用基类,您可以轻松定义标准接口
hash
,通过在通用基类中定义一个(虚拟)函数来实现。最后,C++ 解决方案依赖于单个函数/运算符;一个哈希值
表需要两个(哈希函数和相等),其中必须
兼容,这样更容易出错。 Java 中的一个常见错误是
定义
equals
,但不定义hashCode
;我怀疑有犯同样错误的 Python 类(定义
__cmp__
或__eq__
,但不是__hash__
)。当然,看次数人们搞乱了 C++ 中的
<
运算符,我不确定它是否那么安全任何一个:-)。
It's a more or less arbitrary choice of the language designers. In the
case of C++, I suspect (but don't know for sure) that the motivation was
the desire to define strict upper limits to complexity: designing a good
hash function isn't trivial, and a hash table with a poor hash function
performs very poorly. Another issue that might have been considered is
the fact that there is an established operator for ordering (
<
); thereis nothing similar for hashing.
In the case of Python (and many other languages), a lot of the time, the
keys will be a built-in type, like
str
(std::string
was notavailable when the STL was being defined), so you can ensure an adequate
hash function. And when everything is an object, and inherits from a
common base class, you can easily define a standard interface for
hash
, by defining a (virtual) function in the univeral base class.Finally, the C++ solution depends on a single function/operator; a hash
table requires two (the hash function and equality), which must
compatible, which is more error prone. A common error in Java is to
define
equals
, but not to definehashCode
; I suspect that there arePython classes which make the same mistake (defining
__cmp__
or__eq__
, but not__hash__
). Of course, seeing the number of timespeople mess up the
<
operator in C++, I'm not sure that it's that safeeither:-).
Python 哈希表的空间永远不会超过 2/3。随着它们的增长而调整大小(从大小 8 开始,然后大小增加四倍,直到 50000,然后再增加一倍)。这使它们的插入、删除和查找的时间复杂度为 O(1)。过度碰撞是可能的,但很少见。
Python hash tables are never more than 2/3 full. The resize as they grow (starting at size 8, then quadrupling in size to until 50000, and doubling thereafter). This gives them amortized O(1) insertion, deletion, and lookup. Excess collisions are possible but are rare.