哈希查找和二分查找哪个更快?
当给定一组静态对象时(静态是指一旦加载,它就很少发生变化),需要重复并发查找以实现最佳性能,HashMap
或带有使用一些自定义比较器进行二分搜索?
答案是对象或结构类型的函数吗? 散列和/或相等函数性能? 哈希唯一性? 清单大小? 哈希集
大小/集合大小?
我正在查看的集合的大小可以是 500k 到 10m 之间的任意值 - 如果该信息有用的话。
虽然我正在寻找 C# 答案,但我认为真正的数学答案并不在于语言中,因此我不包含该标签。 但是,如果需要注意 C# 特定的事情,则需要该信息。
When given a static set of objects (static in the sense that once loaded it seldom if ever changes) into which repeated concurrent lookups are needed with optimal performance, which is better, a HashMap
or an array with a binary search using some custom comparator?
Is the answer a function of object or struct type? Hash and/or Equal function performance? Hash uniqueness? List size? Hashset
size/set size?
The size of the set that I'm looking at can be anywhere from 500k to 10m - incase that information is useful.
While I'm looking for a C# answer, I think the true mathematical answer lies not in the language, so I'm not including that tag. However, if there are C# specific things to be aware of, that information is desired.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(17)
这取决于您如何处理哈希表的重复项(如果有的话)。 如果您确实希望允许哈希键重复(没有完美的哈希函数),则主键查找仍然是 O(1),但后面搜索“正确”值的成本可能会很高。 答案是,理论上大多数时候,哈希更快。 YMMV 取决于您放在那里的数据......
It depends on how you handle duplicates for hash tables (if at all). If you do want to allow hash key duplicates (no hash function is perfect), It remains O(1) for primary key lookup but search behind for the "right" value may be costly. Answer is then, theorically most of the time, hashes are faster. YMMV depending on which data you put there...
这里 描述了哈希是如何构建的以及键的宇宙相当大,并且哈希函数被构建为“非常单射”,因此很少发生冲突,哈希表的访问时间实际上不是 O(1) ...它是基于某些概率的。
但是,可以合理地说,哈希的访问时间几乎总是小于时间 O(log_2(n))
Here it's described how hashes are built and because the Universe of keys is reasonably big and hash functions are built to be "very injective" so that collisions rarely happen the access time for a hash table is not O(1) actually ... it's something based on some probabilities.
But,it is reasonable to say that the access time of a hash is almost always less than the time O(log_2(n))
这更多的是对比尔答案的评论,因为他的答案尽管是错误的,但还是得到了很多支持。 所以我不得不发布这个。
我看到很多关于哈希表中查找的最坏情况复杂性是什么,以及什么被认为是摊销分析/什么不是的讨论。
请检查下面的链接
哈希表运行时复杂性(插入、搜索和删除)
最坏情况的复杂度是 O(n) 而不是 O(1),这与 Bill 所说的相反。 因此,他的 O(1) 复杂度不会被摊销,因为这种分析只能用于最坏的情况(他自己的维基百科链接也是如此)
https://en.wikipedia.org/wiki/Hash_table
https ://en.wikipedia.org/wiki/Amortized_analysis
This is more a comment to Bill's answer because his answer have so many upvotes even though its wrong. So I had to post this.
I see lots of discussion about what is the worst case complexity of a lookup in hashtable, and what is considered amortized analysis / what is not.
Please check the link below
Hash table runtime complexity (insert, search and delete)
worst case complexity is O(n) and not O(1) as opposed to what Bill says. And thus his O(1) complexity is not amortized since this analysis can only be used for worst cases (also his own wikipedia link says so)
https://en.wikipedia.org/wiki/Hash_table
https://en.wikipedia.org/wiki/Amortized_analysis
当然,对于如此大的数据集,哈希是最快的。
由于数据很少发生变化,一种进一步加快速度的方法是以编程方式生成临时代码,将第一层搜索作为一个巨大的 switch 语句(如果您的编译器可以处理它),然后分支进行搜索由此产生的桶。
Of course, hash is fastest for such a big dataset.
One way to speed it up even more, since the data seldom changes, is to programmatically generate ad-hoc code to do the first layer of search as a giant switch statement (if your compiler can handle it), and then branch off to search the resulting bucket.
答案取决于。 假设元素“n”的数量非常大。 如果您擅长编写更好的哈希函数并且碰撞更少,那么哈希是最好的。
请注意
哈希函数在搜索时只执行一次,并指向相应的存储桶。 所以如果n很高的话,这并不是一个很大的开销。
哈希表中的问题:
但哈希表的问题是,如果哈希函数不好(发生更多冲突),那么搜索就不是 O(1)。 它趋向于 O(n),因为在桶中搜索是线性搜索。 可能比二叉树更糟糕。
二叉树中的问题:
在二叉树中,如果树不平衡,它也会趋向于 O(n)。 例如,如果您将 1,2,3,4,5 插入到二叉树中,则该二叉树更有可能是一个列表。
所以,
如果您能看到好的哈希方法,请使用哈希表
如果没有,你最好使用二叉树。
The answer depends. Lets think that the number of elements 'n' is very large. If you are good at writing a better hash function which lesser collisions, then hashing is the best.
Note that
The hash function is being executed only once at searching and it directs to the corresponding bucket. So it is not a big overhead if n is high.
Problem in Hashtable:
But the problem in hash tables is if the hash function is not good (more collisions happens), then the searching isn't O(1). It tends to O(n) because searching in a bucket is a linear search. Can be worst than a binary tree.
problem in binary tree:
In binary tree, if the tree isn't balanced, it also tends to O(n). For example if you inserted 1,2,3,4,5 to a binary tree that would be more likely a list.
So,
If you can see a good hashing methodology, use a hashtable
If not, you better using a binary tree.
对于非常小的收藏,差异可以忽略不计。 在范围的低端(500k 项),如果您进行大量查找,您将开始看到差异。 二分查找的时间复杂度为 O(log n),而哈希查找的时间复杂度为 O(1),摊销< /a>. 这与真正的常数不同,但您仍然需要一个非常糟糕的哈希函数才能获得比二分搜索更差的性能。
(当我说“可怕的哈希”时,我的意思是:
是的,它本身速度非常快,但会导致您的哈希映射成为链接列表。)
ialiashkevich 使用数组和字典编写了一些 C# 代码来比较这两种方法,但它使用 Long 值作为键。 我想测试在查找过程中实际执行哈希函数的东西,所以我修改了该代码。 我将其更改为使用字符串值,并将填充和查找部分重构为它们自己的方法,以便更容易在探查器中查看。 我还保留了使用 Long 值的代码,只是作为比较点。 最后,我摆脱了自定义的二分搜索函数并使用了 Array 类中的函数。
这是该代码:
以下是几种不同大小的集合的结果。 (时间以毫秒为单位。)
为了进行比较,这里是程序上次运行的探查器输出(1000 万条记录和查找)。 我强调了相关功能。 他们非常同意上面的秒表计时指标。
您可以看到字典查找比二分搜索快得多,并且(如预期的那样)集合越大,差异越明显。 因此,如果您有一个合理的散列函数(相当快且很少发生冲突),则对于此范围内的集合,散列查找应该胜过二分搜索。
For very small collections the difference is going to be negligible. At the low end of your range (500k items) you will start to see a difference if you're doing lots of lookups. A binary search is going to be O(log n), whereas a hash lookup will be O(1), amortized. That's not the same as truly constant, but you would still have to have a pretty terrible hash function to get worse performance than a binary search.
(When I say "terrible hash", I mean something like:
Yeah, it's blazing fast itself, but causes your hash map to become a linked list.)
ialiashkevich wrote some C# code using an array and a Dictionary to compare the two methods, but it used Long values for keys. I wanted to test something that would actually execute a hash function during the lookup, so I modified that code. I changed it to use String values, and I refactored the populate and lookup sections into their own methods so it's easier to see in a profiler. I also left in the code that used Long values, just as a point of comparison. Finally, I got rid of the custom binary search function and used the one in the
Array
class.Here's that code:
Here are the results with several different sizes of collections. (Times are in milliseconds.)
And for comparison, here's the profiler output for the last run of the program (10 million records and lookups). I highlighted the relevant functions. They pretty closely agree with the Stopwatch timing metrics above.
You can see that the Dictionary lookups are much faster than binary search, and (as expected) the difference is more pronounced the larger the collection. So, if you have a reasonable hashing function (fairly quick with few collisions), a hash lookup should beat binary search for collections in this range.
鲍比、比尔和科尔宾的答案是错误的。 对于固定/有界 n,O(1) 并不慢于 O(log n):
log(n) 是常数,因此它取决于常数时间。
对于慢速哈希函数,听说过 md5 吗?
默认的字符串哈希算法可能涉及所有字符,并且很容易比长字符串键的平均比较慢 100 倍。 去过也做过。
您也许可以(部分)使用基数。 如果您可以分成 256 个大致相同大小的块,那么您正在考虑 2k 到 40k 的二分搜索。 这可能会提供更好的性能。
[编辑]
太多人对他们不理解的事情投了否决票。
用于二分搜索排序集的字符串比较有一个非常有趣的属性:它们越接近目标,它们就越慢。 首先,它们会在第一个字符处中断,最后仅在最后一个字符处中断。 假设它们的时间恒定是不正确的。
The answers by Bobby, Bill and Corbin are wrong. O(1) is not slower than O(log n) for a fixed/bounded n:
log(n) is constant, so it depends on the constant time.
And for a slow hash function, ever heard of md5?
The default string hashing algorithm probably touches all characters, and can be easily 100 times slower than the average compare for long string keys. Been there, done that.
You might be able to (partially) use a radix. If you can split up in 256 approximately same size blocks, you're looking at 2k to 40k binary search. That is likely to provide much better performance.
[Edit]
Too many people voting down what they do not understand.
String compares for binary searching sorted sets have a very interesting property: they get slower the closer they get to the target. First they will break on the first character, in the end only on the last. Assuming a constant time for them is incorrect.
这个问题唯一合理的答案是:视情况而定。 这取决于数据的大小、数据的形状、哈希实现、二分搜索实现以及数据所在的位置(即使问题中没有提及)。 其他几个答案也说了这么多,所以我可以删除它。 然而,分享我从原始答案的反馈中学到的东西可能会很好。
鉴于这些评论,您可能会认为使用哈希表的人精神错乱。 哈希表是否鲁莽且危险? 这些人疯了吗?
事实证明他们不是。 正如二叉树擅长某些事情(有序数据遍历、存储效率)一样,哈希表也有其闪光的时刻。 特别是,它们可以非常擅长减少获取数据所需的读取次数。 哈希算法可以生成一个位置并直接跳转到内存或磁盘中的该位置,而二分搜索在每次比较期间读取数据以决定下一步读取什么。 每次读取都有可能发生缓存未命中,这比 CPU 指令慢一个数量级(或更多)。
这并不是说哈希表比二分搜索更好。 他们不是。 这也不意味着所有哈希和二分搜索实现都是相同的。 他们不是。 如果我有一个观点的话,那就是:这两种方法的存在都是有原因的。 由您决定哪一个最适合您的需求。
原答案:
The only reasonable answer to this question is: It depends. It depends on the size of your data, the shape of your data, your hash implementation, your binary search implementation, and where your data lives (even though it's not mentioned in the question). A couple other answers say as much, so I could just delete this. However, it might be nice to share what I've learned from feedback to my original answer.
Given the comments, you might assume that people who use hash tables are deranged. Are hash tables reckless and dangerous? Are these people insane?
Turns out they're not. Just as binary trees are good at certain things (in-order data traversal, storage efficiency), hash tables have their moment to shine as well. In particular, they can be very good at reducing the number of reads required to fetch your data. A hash algorithm can generate a location and jump straight to it in memory or on disk while binary search reads data during each comparison to decide what to read next. Each read has the potential for a cache miss which is an order of magnitude (or more) slower than a CPU instruction.
That's not to say hash tables are better than binary search. They're not. It's also not to suggest that all hash and binary search implementations are the same. They're not. If I have a point, it's this: both approaches exist for a reason. It's up to you to decide which is best for your needs.
Original answer:
好吧,我会尽量长话短说。
C# 简短回答:
测试两种不同的方法。
.NET 为您提供了通过一行代码来改变您的方法的工具。
否则,请使用 System.Collections.Generic.Dictionary 并确保使用较大的数字作为初始容量对其进行初始化,否则由于 GC 必须执行收集旧存储桶数组的工作,您将在余生中插入项目。
更长的答案:
哈希表的查找时间几乎恒定,并且在现实世界中获取哈希表中的项目不仅仅需要计算哈希值。
要获取某个项目,哈希表将执行以下操作:
大多数哈希表使用相同的存储桶
这种处理桶/哈希的方法
碰撞)从那时开始
桶并将每个键与
您正在尝试的项目之一
添加/删除/更新/检查是否
包含。
查找时间取决于哈希函数的“好”程度(输出的稀疏程度)和速度、您使用的存储桶的数量以及键比较器的速度,这并不总是最好的解决方案。
更好、更深入的解释: http://en.wikipedia.org/wiki /哈希表
Ok, I'll try to be short.
C# short answer:
Test the two different approaches.
.NET gives you the tools to change your approach with a line of code.
Otherwise use System.Collections.Generic.Dictionary and be sure to initialize it with a large number as initial capacity or you'll pass the rest of your life inserting items due to the job GC has to do to collect old bucket arrays.
Longer answer:
An hashtable has ALMOST constant lookup times and getting to an item in an hash table in the real world does not just require to compute an hash.
To get to an item, your hashtable will do something like this:
the same bucket, most hashtables use
this method of handling bucket/hash
collisions) that starts at that
bucket and compare each key with the
one of the item you are trying to
add/delete/update/check if
contained.
Lookup times depend on how "good" (how sparse is the output) and fast is your hash function, the number of buckets you are using and how fast is the keys comparer, it's not always the best solution.
A better and deeper explanation: http://en.wikipedia.org/wiki/Hash_table
尽管二分搜索具有更好的最坏情况特征,但哈希通常更快。 哈希访问通常是一种获取哈希值的计算,以确定记录将位于哪个“桶”中,因此性能通常取决于记录分布的均匀程度以及用于搜索桶的方法。 如果哈希函数不好(留下一些包含大量记录的存储桶),并且对存储桶进行线性搜索,则会导致搜索速度变慢。 (第三方面,如果您正在读取磁盘而不是内存,则哈希桶可能是连续的,而二叉树几乎可以保证非本地访问。)
如果您想要总体快速,请使用哈希。 如果您确实想要保证有限的性能,您可能会选择二叉树。
Hashes are typically faster, although binary searches have better worst-case characteristics. A hash access is typically a calculation to get a hash value to determine which "bucket" a record will be in, and so the performance will generally depend on how evenly the records are distributed, and the method used to search the bucket. A bad hash function (leaving a few buckets with a whole lot of records) with a linear search through the buckets will result in a slow search. (On the third hand, if you're reading a disk rather than memory, the hash buckets are likely to be contiguous while the binary tree pretty much guarantees non-local access.)
If you want generally fast, use the hash. If you really want guaranteed bounded performance, you might go with the binary tree.
如果您的对象集确实是静态且不变的,则可以使用 完美哈希 来获得 O( 1)性能有保证。 我曾多次看到 gperf 被提及,但我从未有机会使用它我。
If your set of objects is truly static and unchanging, you can use a perfect hash to get O(1) performance guaranteed. I've seen gperf mentioned a few times, though I've never had occasion to use it myself.
令人惊讶的是没有人提到 Cuckoo 哈希,它提供了有保证的 O(1),并且与完美哈希不同,它能够使用它分配的所有内存,而完美哈希最终可以保证 O(1),但浪费了其大部分内存。分配。 警告? 插入时间可能非常慢,尤其是当元素数量增加时,因为所有优化都是在插入阶段执行的。
我相信它的某些版本在路由器硬件中用于 ip 查找。
请参阅链接文本
Surprised nobody mentioned Cuckoo hashing, which provides guaranteed O(1) and, unlike perfect hashing, is capable of using all of the memory it allocates, where as perfect hashing can end up with guaranteed O(1) but wasting the greater portion of its allocation. The caveat? Insertion time can be very slow, especially as the number of elements increases, since all of the optimization is performed during the insertion phase.
I believe some version of this is used in router hardware for ip lookups.
See link text
与数组相比,字典/哈希表使用更多内存并且需要更多时间来填充。
但是通过字典搜索比数组中的二分搜索更快。
以下是要搜索和填充的 10 百万个 Int64 项的数量。
加上您可以自己运行的示例代码。
字典内存:462,836
数组内存:88,376
填充字典:402
填充数组:23
搜索字典: 176
搜索数组:680
Dictionary/Hashtable is using more memory and takes more time to populate comparing to array.
But search is done faster by Dictionary rather than Binary Search within array.
Here are the numbers for 10 Million of Int64 items to search and populate.
Plus a sample code you can run by yourself.
Dictionary Memory: 462,836
Array Memory: 88,376
Populate Dictionary: 402
Populate Array: 23
Search Dictionary: 176
Search Array: 680
我强烈怀疑在大小约为 1M 的问题集中,散列会更快。
仅对于数字:
二分搜索将需要约 20 次比较(2^20 == 1M),
哈希查找将需要对搜索键进行 1 次哈希计算,并且之后可能需要进行少量比较以解决可能的冲突
编辑:数字:
次:c =“abcde”,d =“rwerij”哈希码:0.0012 秒。 比较:2.4秒。
免责声明:实际上,对哈希查找与二进制查找进行基准测试可能比这种不完全相关的测试更好。 我什至不确定 GetHashCode 是否在幕后被记住
I strongly suspect that in a problem set of size ~1M, hashing would be faster.
Just for the numbers:
a binary search would require ~ 20 compares (2^20 == 1M)
a hash lookup would require 1 hash calculation on the search key, and possibly a handful of compares afterwards to resolve possible collisions
Edit: the numbers:
times: c = "abcde", d = "rwerij" hashcode: 0.0012 seconds. Compare: 2.4 seconds.
disclaimer: Actually benchmarking a hash lookup versus a binary lookup might be better than this not-entirely-relevant test. I'm not even sure if GetHashCode gets memoized under-the-hood
我想说这主要取决于哈希和比较方法的性能。 例如,当使用非常长但随机的字符串键时,比较总是会产生非常快的结果,但默认的哈希函数将处理整个字符串。
但在大多数情况下,哈希映射应该更快。
I'd say it depends mainly on the performance of the hash and compare methods. For example, when using string keys that are very long but random, a compare will always yield a very quick result, but a default hash function will process the entire string.
But in most cases the hash map should be faster.
我想知道为什么没有人提到完美哈希。
仅当您的数据集长期固定时它才有意义,但它的作用是分析数据并构造一个完美的哈希函数以确保不会发生冲突。
如果您的数据集是恒定的并且计算函数的时间与应用程序运行时间相比很短,那么非常简洁。
I wonder why no one mentioned perfect hashing.
It's only relevant if your dataset is fixed for a long time, but what it does it analyze the data and construct a perfect hash function that ensures no collisions.
Pretty neat, if your data set is constant and the time to calculate the function is small compared to the application run time.
这个问题比纯粹算法性能的范围更复杂。 如果我们去除二分搜索算法对缓存更友好的因素,那么哈希查找一般意义上会更快。 最好的解决方法是构建一个程序并禁用编译器优化选项,我们可以发现哈希查找速度更快,因为一般意义上其算法时间效率为 O(1)。
但是,当您启用编译器优化,并尝试使用较小数量的样本(例如少于 10,000 个)进行相同的测试时,二分搜索利用其缓存友好的数据结构,其性能优于哈希查找。
This question is more complicated than the scope of pure algorithm performance. If we remove the factors that binary search algorithm is more cache friendly, the hash lookup is faster in general sense. The best way to figured out is to build a program and disable the compiler optimization options, and we could find that the hash lookup is faster given its algorithm time efficiency is O(1) in general sense.
But when you enable the compiler optimization, and try the same test with smaller count of samples say less than 10,000, the binary search outperformed the hash lookup by taking advantages of its cache-friendly data structure.