O(1) 怎么了?
我注意到在讨论涉及散列和搜索类型的算法时,O(1) 的一些非常奇怪的用法,通常是在使用语言系统提供的字典类型,或使用使用数组使用的字典或散列数组类型的上下文中-索引符号。
基本上,O(1) 意味着以恒定时间和(通常)固定空间为界。 一些非常基本的操作是 O(1) 的,尽管使用中间语言和特殊的虚拟机往往会扭曲这里的思维(例如,如何将垃圾收集器和其他动态进程分摊到本来是 O(1) 的活动上)。
但忽略延迟摊销、垃圾收集等,我仍然不明白如何跳跃到假设涉及某种搜索的某些技术可以是 O(1),除非在非常特殊的条件下。
虽然我之前已经注意到这一点,但一个例子刚刚出现在 Pandincus 问题,“在 C# .NET 中用于在 O(1) 时间内获取项目的‘正确’集合?”。
正如我在那里所说的,我所知道的唯一提供 O(1) 访问作为保证边界的集合是具有整数索引值的固定边界数组。 假设该数组是通过某种到随机存取存储器的映射来实现的,该随机存取存储器使用 O(1) 操作来定位具有该索引的单元。
对于涉及某种搜索以确定不同类型索引(或具有整数索引的稀疏数组)的匹配单元格位置的集合,生活并不那么容易。 特别是,如果存在冲突并且可能发生拥塞,则访问不完全是 O(1)。 如果集合是灵活的,则必须认识到并分摊扩展底层结构(例如树或哈希表)的成本,以缓解拥塞(例如,高冲突发生率或树不平衡)。 。
我绝不会想到将这些灵活且动态的结构称为 O(1)。 然而,我看到它们以 O(1) 解决方案的形式提供,但没有任何确定必须维持的条件才能确保实际具有 O(1) 访问权限(并且使该常数小到可以忽略不计)。
问题:所有这些准备实际上都是为了一个问题。 O(1) 的随意性是什么?为什么它被如此盲目地接受? 是否认识到即使 O(1) 也可能大得令人不快,尽管接近常数? 或者 O(1) 仅仅是计算复杂性概念对非正式使用的挪用? 我很困惑。
更新:答案和评论指出了我自己定义 O(1) 时随意的地方,我已经修复了它。 我仍在寻找好的答案,在某些情况下,一些评论线程比他们的答案更有趣。
I have been noticing some very strange usage of O(1) in discussion of algorithms involving hashing and types of search, often in the context of using a dictionary type provided by the language system, or using dictionary or hash-array types used using array-index notation.
Basically, O(1) means bounded by a constant time and (typically) fixed space. Some pretty fundamental operations are O(1), although using intermediate languages and special VMs tends to distort ones thinking here (e.g., how does one amortize the garbage collector and other dynamic processes over what would otherwise be O(1) activities).
But ignoring amortization of latencies, garbage-collection, and so on, I still don't understand how the leap to assumption that certain techniques that involve some kind of searching can be O(1) except under very special conditions.
Although I have noticed this before, an example just showed up in the Pandincus question, "'Proper’ collection to use to obtain items in O(1) time in C# .NET?".
As I remarked there, the only collection I know of that provides O(1) access as a guaranteed bound is a fixed-bound array with an integer index value. The presumption is that the array is implemented by some mapping to random access memory that uses O(1) operations to locate the cell having that index.
For collections that involve some sort of searching to determine the location of a matching cell for a different kind of index (or for a sparse array with integer index), life is not so easy. In particular, if there are collisons and congestion is possible, access is not exactly O(1). And if the collection is flexible, one must recognize and amortize the cost of expanding the underlying structure (such as a tree or a hash table) for which congestion relief (e.g., high collision incidence or tree imbalance).
I would never have thought to speak of these flexible and dynamic structures as O(1). Yet I see them offered up as O(1) solutions without any identification of the conditions that must be maintained to actually have O(1) access be assured (as well as have that constant be negligibly small).
THE QUESTION: All of this preparation is really for a question. What is the casualness around O(1) and why is it accepted so blindly? Is it recognized that even O(1) can be undesirably large, even though near-constant? Or is O(1) simply the appropriation of a computational-complexity notion to informal use? I'm puzzled.
UPDATE: The Answers and comments point out where I was casual about defining O(1) myself, and I have repaired that. I am still looking for good answers, and some of the comment threads are rather more interesting than their answers, in a few cases.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(13)
问题是人们对术语真的很草率。 这里有 3 个重要但不同的类别:
O(1) 最坏情况
这很简单 - 在最坏情况下,所有操作所花费的时间不超过恒定量的时间,因此在所有情况下都是如此。 访问数组元素的最坏情况是
O(1)
。O(1) 摊销最坏情况
摊销 意味着并非每个操作在最坏情况下都是
O(1)
,但对于任意 N 个操作序列,总成本为在最坏的情况下,该序列不是O(N)
。 这意味着,即使我们不能将任何单个操作的成本限制为一个常数,但总会有足够的“快速”操作来弥补“慢速”操作,使得操作序列的运行时间是线性的在操作数量上。例如,标准动态数组在填满时其容量加倍,需要
O( 1)
在末尾插入元素的摊销时间,即使某些插入需要O(N)
时间 - 总是有足够的O(1)
插入插入 N 个项目总是需要O(N)
时间。O(1) 平均情况
这是最棘手的。 平均情况有两种可能的定义:一种用于具有固定输入的随机算法,另一种用于具有随机输入的确定性算法。
对于具有固定输入的随机算法,我们可以通过分析算法并确定所有可能的运行时间的概率分布并对该分布取平均值来计算任何给定输入的平均情况运行时间(根据算法,这可能或由于停机问题可能无法实现)。
在另一种情况下,我们需要输入的概率分布。 例如,如果我们要测量排序算法,这样的概率分布就是包含所有 N! 的分布。 输入的可能排列的可能性相同。 然后,平均情况运行时间是所有可能输入的平均运行时间,并按每个输入的概率加权。
由于这个问题的主题是哈希表,它是确定性的,因此我将重点关注平均情况的第二个定义。 现在,我们不能总是确定输入的概率分布,因为我们可以对任何东西进行散列,并且这些项目可能来自用户在文件系统中输入或来自文件系统。 因此,在谈论哈希表时,大多数人只是假设输入表现良好并且哈希函数表现良好,使得任何输入的哈希值本质上在可能的哈希值范围内随机均匀分布。
花一点时间,让我们理解最后一点 - 哈希表的
O(1)
平均情况性能来自于假设所有哈希值均匀分布。 如果违反了这一假设(通常不会,但肯定会发生),平均运行时间将不再是O(1)
。另请参阅算法复杂性导致的拒绝服务。 在本文中,作者讨论了他们如何利用两个版本的 Perl 使用的默认哈希函数中的一些弱点来生成大量具有哈希冲突的字符串。 有了这个字符串列表,他们通过向某些网络服务器提供这些字符串,对某些网络服务器产生了拒绝服务攻击,从而导致了网络服务器使用的哈希表中出现最坏情况的
O(N)
行为。网络服务器。The problem is that people are really sloppy with terminology. There are 3 important but distinct classes here:
O(1) worst-case
This is simple - all operations take no more than a constant amount of time in the worst case, and therefore in all cases. Accessing an element of an array is
O(1)
worst-case.O(1) amortized worst-case
Amortized means that not every operation is
O(1)
in the worst case, but for any sequence of N operations, the total cost of the sequence is noO(N)
in the worst case. This means that even though we can't bound the cost of any single operation by a constant, there will always be enough "quick" operations to make up for the "slow" operations such that the running time of the sequence of operations is linear in the number of operations.For example, the standard Dynamic Array which doubles its capacity when it fills up requires
O(1)
amortized time to insert an element at the end, even though some insertions requireO(N)
time - there are always enoughO(1)
insertions that inserting N items always takesO(N)
time total.O(1) average-case
This one is the trickiest. There are two possible definitions of average-case: one for randomized algorithms with fixed inputs, and one for deterministic algorithms with randomized inputs.
For randomized algorithms with fixed inputs, we can calculate the average-case running time for any given input by analyzing the algorithm and determining the probability distribution of all possible running times and taking the average over that distribution (depending on the algorithm, this may or may not be possible due to the Halting Problem).
In the other case, we need a probability distribution over the inputs. For example, if we were to measure a sorting algorithm, one such probability distribution would be the distribution that has all N! possible permutations of the input equally likely. Then, the average-case running time is the average running time over all possible inputs, weighted by the probability of each input.
Since the subject of this question is hash tables, which are deterministic, I'm going to focus on the second definition of average-case. Now, we can't always determine the probability distribution of the inputs because, well, we could be hashing just about anything, and those items could be coming from a user typing them in or from a file system. Therefore, when talking about hash tables, most people just assume that the inputs are well-behaved and the hash function is well behaved such that the hash value of any input is essentially randomly distributed uniformly over the range of possible hash values.
Take a moment and let that last point sink in - the
O(1)
average-case performance for hash tables comes from assuming all hash values are uniformly distributed. If this assumption is violated (which it usually isn't, but it certainly can and does happen), the running time is no longerO(1)
on average.See also Denial of Service by Algorithmic Complexity. In this paper, the authors discuss how they exploited some weaknesses in the default hash functions used by two versions of Perl to generate large numbers of strings with hash collisions. Armed with this list of strings, they generated a denial-of-service attack on some webservers by feeding them these strings that resulted in the worst-case
O(N)
behavior in the hash tables used by the webservers.我的理解是 O(1) 不一定是常数; 相反,它不依赖于所考虑的变量。 因此,相对于散列中的元素数量,散列查找可以说是 O(1),但相对于被散列的数据的长度或散列中的元素与存储桶的比率而言,则不是。
另一个令人困惑的因素是大 O 表示法描述了限制行为。 因此,小 N 值的函数 f(N) 确实可能会表现出很大的变化,但如果 N 接近无穷大时的极限相对于 N 是恒定的,那么说它是 O(1) 仍然是正确的。
My understanding is that O(1) is not necessarily constant; rather, it is not dependent on the variables under consideration. Thus a hash lookup can be said to be O(1) with respect to the number of elements in the hash, but not with respect to the length of the data being hashed or ratio of elements to buckets in the hash.
The other element of confusion is that big O notation describes limiting behavior. Thus, a function f(N) for small values of N may indeed show great variation, but you would still be correct to say it is O(1) if the limit as N approaches infinity is constant with respect to N.
只是为了澄清这是两个单独的陈述。 你可以在时间上拥有 O(1),但在空间上拥有 O(n) 或其他什么。
O(1) 可能大得不切实际,但它仍然是 O(1)。 人们经常忽略的是,如果您知道自己将拥有一个非常小的数据集,那么常数比复杂性更重要,并且对于相当小的数据集,它是两者的平衡。 如果数据集的常数和大小适当,则 O(n!) 算法的性能可以优于 O(1)。
O() 表示法是对复杂性的度量,而不是算法所花费的时间,也不是对给定算法对于给定目的的“好”程度的纯粹度量。
Just to clarify these are two separate statements. You can have O(1) in time but O(n) in space or whatever.
O(1) can be impractically HUGE and it's still O(1). It is often neglected that if you know you'll have a very small data set the constant is more important than the complexity, and for reasonably small data sets, it's a balance of the two. An O(n!) algorithm can out-perform a O(1) if the constants and sizes of the data sets are of the appropriate scale.
O() notation is a measure of the complexity - not the time an algorithm will take, or a pure measure of how "good" a given algorithm is for a given purpose.
我明白你在说什么,但我认为哈希表中的查找复杂度为 O(1) 的说法背后有几个基本假设。
哈希表查找的最坏情况复杂度是 O(n),但鉴于上述 2 个假设,这种情况极不可能发生。
I can see what you're saying, but I think there are a couple of basic assumptions underlying the claim that look-ups in a Hash table have a complexity of O(1).
The worst case complexity of a Hash table look-up is O(n), but that's extremely unlikely given the above 2 assumptions.
哈希表是一种支持 O(1) 搜索和插入的数据结构。
哈希表通常有一个键和值对,其中键用作函数的参数(a href="http://en.wikipedia.org/wiki/Hash_function" rel="nofollow noreferrer">哈希函数),它将确定值在其内部数据结构(通常是数组)中的位置。
由于插入和搜索仅取决于哈希函数的结果,而不取决于哈希表的大小或存储的元素数量,因此哈希表的插入和搜索时间复杂度为 O(1)。
有一个<不过,请注意。 也就是说,随着哈希表变得越来越满,将会出现哈希冲突,其中哈希函数将返回数组中已被占用的元素。 这将需要冲突解决方案才能找到另一个空元素。
当发生哈希冲突时,无法在 O(1) 时间内执行搜索或插入。 然而,良好的冲突解决算法可以减少尝试寻找另一个合适的空位的次数,或者增加哈希表大小可以首先减少冲突的数量。
因此,理论上,只有由具有无限数量元素的数组和完美的哈希函数支持的哈希表才能实现 O(1) 性能,因为这是避免哈希的唯一方法导致所需操作数量增加的碰撞。 因此,对于任何有限大小的数组,由于哈希冲突,有时会小于 O(1)。
让我们看一个例子。 让我们使用哈希表来存储以下
(key, value)
对:(Name, Bob)
(Occupation, Student)
(位置,地球)
我们将使用包含 100 个元素的数组来实现哈希表后端。
key
将用于确定数组中存储 (key
,value
) 对的元素。 为了确定元素,将使用hash_function
:hash_function("Name")
返回 18hash_function("Occupation" )
返回 32hash_function("Location")
返回 74。根据上面的结果,我们将把
(key, value)
对分配给数组的元素。插入只需要使用哈希函数,不依赖于哈希表的大小及其元素,因此可以在 O(1) 时间内完成。
同样,搜索元素使用哈希函数。
如果我们想要查找键
“Name”
,我们将执行hash_function(“Name”)
来找出所需值所在的数组中的哪个元素。此外,搜索不依赖于哈希表的大小,也不取决于存储的元素数量,因此是 O(1) 操作。
一切都很好。 让我们尝试添加一个附加条目
("Pet", "Dog")
。 但是,存在一个问题,因为hash_function("Pet")
返回 18,这与"Name"
键的哈希值相同。因此,我们需要解决这个哈希冲突。 假设我们使用的哈希冲突解决函数发现新的空元素是29:
由于这次插入中存在哈希冲突,所以我们的性能不完全是O(1)。< /strong>
当我们尝试搜索
"Pet"
键时,也会出现这个问题,因为尝试通过执行"Pet"
键来查找包含"Pet"
键的元素code>hash_function("Pet") 最初总是返回 18。一旦我们查找元素 18,我们就会找到键
"Name"
而不是"Pet"
。 当我们发现这种不一致时,我们需要解决冲突,以便检索包含实际"Pet"
键的正确元素。 解决哈希冲突是一项附加操作,它使哈希表无法在 O(1) 时间内执行。Hashtables is a data structure that supports O(1) search and insertion.
A hashtable usually has a key and value pair, where the key is used to as the parameter to a function (a hash function) which will determine the location of the value in its internal data structure, usually an array.
As insertion and search only depends upon the result of the hash function and not on the size of the hashtable nor the number of elements stored, a hashtable has O(1) insertion and search.
There is one caveat, however. That is, as the hashtable becomes more and more full, there will be hash collisions where the hash function will return an element of an array which is already occupied. This will necesitate a collision resolution in order to find another empty element.
When a hash collision occurs, a search or insertion cannot be performed in O(1) time. However, good collision resolution algorithms can reduce the number of tries to find another suiteable empty spot or increasing the hashtable size can reduce the number of collisions in the first place.
So, in theory, only a hashtable backed by an array with an infinite number of elements and a perfect hash function would be able to achieve O(1) performance, as that is the only way to avoid hash collisions that drive up the number of required operations. Therefore, for any finite-sized array will at one time or another be less than O(1) due to hash collisions.
Let's take a look at an example. Let's use a hashtable to store the following
(key, value)
pairs:(Name, Bob)
(Occupation, Student)
(Location, Earth)
We will implement the hashtable back-end with an array of 100 elements.
The
key
will be used to determine an element of the array to store the (key
,value
) pair. In order to determine the element, thehash_function
will be used:hash_function("Name")
returns 18hash_function("Occupation")
returns 32hash_function("Location")
returns 74.From the above result, we'll assign the
(key, value)
pairs into the elements of the array.The insertion only requires the use of a hash function, and does not depend on the size of the hashtable nor its elements, so it can be performed in O(1) time.
Similarly, searching for an element uses the hash function.
If we want to look up the key
"Name"
, we'll perform ahash_function("Name")
to find out which element in the array the desired value resides.Also, searching does not depend on the size of the hashtable nor the number of elements stored, therefore an O(1) operation.
All is well. Let's try to add an additional entry of
("Pet", "Dog")
. However, there is a problem, ashash_function("Pet")
returns 18, which is the same hash for the"Name"
key.Therefore, we'll need to resolve this hash collision. Let's suppose that the hash collision resolving function we used found that the new empty element is 29:
Since there was a hash collision in this insertion, our performance was not quite O(1).
This problem will also crop up when we try to search for the
"Pet"
key, as trying to find the element containing the"Pet"
key by performinghash_function("Pet")
will always return 18 initially.Once we look up element 18, we'll find the key
"Name"
rather than"Pet"
. When we find this inconsistency, we'll need to resolve the collision in order to retrieve the correct element which contains the actual"Pet"
key. Resovling a hash collision is an additional operation which makes the hashtable not perform at O(1) time.我无法评论您所见过的其他讨论,但至少有一种哈希算法保证为 O(1)。
Cuckoo 哈希 维护一个不变量,以便哈希表中没有链接。 插入分摊为 O(1),检索始终为 O(1)。 我从未见过它的实现,这是我在大学时新发现的东西。 对于相对静态的数据集,它应该是一个非常好的 O(1),因为它计算两个哈希函数,执行两次查找,并立即知道答案。
请注意,这也是假设哈希计算也是 O(1) 的。 您可能会争辩说,对于长度为 K 的字符串,任何哈希值都最小为 O(K)。 实际上,你可以很容易地绑定 K,比如 K < 1000. O(K) ~= O(1) 对于 K < 1000。
I can't speak to the other discussions you've seen, but there is at least one hashing algorithm that is guaranteed to be O(1).
Cuckoo hashing maintains an invariant so that there is no chaining in the hash table. Insertion is amortized O(1), retrieval is always O(1). I've never seen an implementation of it, it's something that was newly discovered when I was in college. For relatively static data sets, it should be a very good O(1), since it calculates two hash functions, performs two lookups, and immediately knows the answer.
Mind you, this is assuming the hash calcuation is O(1) as well. You could argue that for length-K strings, any hash is minimally O(K). In reality, you can bound K pretty easily, say K < 1000. O(K) ~= O(1) for K < 1000.
您对 Big-Oh 表示法的理解可能存在概念错误。 这意味着,给定一个算法和一个输入数据集,当数据集的大小趋于无穷大时,算法运行时间的上限取决于 O 函数的值。
当有人说算法需要 O(n) 时间时,这意味着算法最坏情况的运行时间线性取决于输入集的大小。
当一个算法花费 O(1) 时间时,唯一的意思是,给定一个计算函数 f(n) 运行时间的函数 T(f),存在一个自然正数 k 使得 T(f) < k 对于任何输入 n。 本质上,这意味着算法运行时间的上限不依赖于其大小,并且具有固定的有限限制。
现在,这并不意味着限制很小,只是它与输入集的大小无关。 因此,如果我人为地为数据集的大小定义一个界限 k,那么它的复杂度将是 O(k) == O(1)。
例如,在链表上搜索值的实例是一个 O(n) 操作。 但如果我说一个列表最多有 8 个元素,那么 O(n) 就变成了 O(8) 就变成了 O(1)。
在这种情况下,我们使用 trie 数据结构作为字典(一棵字符树,其中叶节点包含用作键的字符串的值),如果键是有界的,那么它的查找时间可以认为是 O( 1)(如果我将字符字段定义为长度最多为 k 个字符,这在许多情况下都是一个合理的假设)。
对于哈希表,只要你假设哈希函数是好的(随机分布)并且足够稀疏以最小化冲突,并且当数据结构足够密集时执行重新哈希,你确实可以认为它是 O(1 ) 访问时间结构。
总之,对于很多事情来说,O(1) 时间可能被高估了。 对于大型数据结构,适当的哈希函数的复杂性可能并不小,并且存在足够的极端情况,其中冲突量导致其表现得像 O(n) 数据结构,并且重新哈希可能变得异常昂贵。 在这种情况下,像 AVL 或 B 树这样的 O(log(n)) 结构可能是更好的选择。
There may be a conceptual error as to how you're understanding Big-Oh notation. What it means is that, given an algorithm and an input data set, the upper bound for the algorithm's run time depends on the value of the O-function when the size of the data set tends to infinity.
When one says that an algorithm takes O(n) time, it means that the runtime for an algorithm's worst case depends linearly on the size of the input set.
When an algorithm takes O(1) time, the only thing it means is that, given a function T(f) which calculates the runtime of a function f(n), there exists a natural positive number k such that T(f) < k for any input n. Essentially, it means that the upper bound for the run time of an algorithm is not dependent on its size, and has a fixed, finite limit.
Now, that does not mean in any way that the limit is small, just that it's independent of the size of the input set. So if I artificially define a bound k for the size of a data set, then its complexity will be O(k) == O(1).
For example, searching for an instance of a value on a linked list is an O(n) operation. But if I say that a list has at most 8 elements, then O(n) becomes O(8) becomes O(1).
In this case, it we used a trie data structure as a dictionary (a tree of characters, where the leaf node contains the value for the string used as key), if the key is bounded, then its lookup time can be considered O(1) (If I define a character field as having at most k characters in length, which can be a reasonable assumption for many cases).
For a hash table, as long as you assume that the hashing function is good (randomly distributed) and sufficiently sparse so as to minimize collisions, and rehashing is performed when the data structure is sufficiently dense, you can indeed consider it an O(1) access-time structure.
In conclusion, O(1) time may be overrated for a lot of things. For large data structures the complexity of an adequate hash function may not be trivial, and sufficient corner cases exist where the amount of collisions lead it to behave like an O(n) data structure, and rehashing may become prohibitively expensive. In which case, an O(log(n)) structure like an AVL or a B-tree may be a superior alternative.
总的来说,我认为人们相对地使用它们而不考虑准确性。 例如,如果设计得好并且您拥有良好的散列,基于散列的数据结构的查找时间复杂度为 O(1)(平均)。 如果所有内容都散列到单个存储桶,那么它的复杂度是 O(n)。 一般来说,尽管使用了一种好的算法并且密钥分布合理,所以在没有所有限定条件的情况下将其称为 O(1) 很方便。 列表、树等也是如此。我们想到了某些实现,并且在讨论一般性时谈论它们更方便,无需限定。 另一方面,如果我们正在讨论具体的实现,那么可能需要更精确。
In general, I think people use them comparatively without regard to exactness. For example, hash-based data structures are O(1) (average) look up if designed well and you have a good hash. If everything hashes to a single bucket, then it's O(n). Generally, though one uses a good algorithm and the keys are reasonably distributed so it's convenient to talk about it as O(1) without all the qualifications. Likewise with lists, trees, etc. We have in mind certain implementations and it's simply more convenient to talk about them, when discussing generalities, without the qualifications. If, on the other hand, we're discussing specific implementations, then it probably pays to be more precise.
相对于表中的项目数量,HashTable 查找的时间复杂度为 O(1),因为无论向列表中添加多少项目,散列单个项目的成本几乎相同,并且创建散列将告诉您您该物品的地址。
为了回答为什么这是相关的:OP询问为什么 O(1) 似乎被如此随意地扔掉,而在他看来,它显然不适用于许多情况。 这个答案解释了在这些情况下 O(1) 时间确实是可能的。
HashTable looks-ups are O(1) with respect to the number of items in the table, because no matter how many items you add to the list the cost of hashing a single item is pretty much the same, and creating the hash will tell you the address of the item.
To answer why this is relevant: the OP asked about why O(1) seemed to be thrown around so casually when in his mind it obviously could not apply in many circumstances. This answer explains that O(1) time really is possible in those circumstances.
哈希表实现实际上并不“完全”使用 O(1),如果您测试一个,您会发现它们平均需要大约 1.5 次查找才能在大型数据集中找到给定的键
(由于冲突 确实会发生,并且在发生碰撞时,必须分配不同的位置)
此外,在实践中,HashMap 由具有初始大小的数组支持,当平均达到 70% 的填充度时,该数组会“增长”到双倍大小,这提供了相对良好的寻址空间。 70% 饱满度后,碰撞率增长得更快。
大 O 理论指出,如果您有 O(1) 算法,甚至是 O(2) 算法,关键因素是输入集大小与插入/获取其中之一的步骤之间的关系程度。 O(2) 仍然是常数时间,所以我们只是将其近似为 O(1),因为它或多或少意味着相同的事情。
实际上,只有一种方法可以得到 O(1) 的“完美哈希表”,并且需要:
(例外情况:如果您可以提前计算系统允许的密钥的所有排列,并且您的目标后备存储地址空间被定义为可以容纳所有允许的密钥的大小,那么你可以有一个完美的哈希,但它是一个“域有限”的完美)
给定固定的内存分配,它至少是不合理的,因为它会假设你有一些神奇的方法来打包无限量的将数据存储到固定大小的空间而不丢失数据,这在逻辑上是不可能的。
所以回想起来,在有限的内存中,即使使用相对简单的哈希密钥生成器,也能得到 O(1.5) 的时间,这仍然是常数时间,我认为这真是太棒了。
后缀注释 注意这里我使用 O(1.5) 和 O(2)。 这些实际上在big-o中并不存在。 这些只是那些不了解大事的人所认为的基本原理。
如果某个东西需要 1.5 步才能找到一个密钥,或者需要 2 步才能找到该密钥,或者需要 1 步才能找到该密钥,但步骤数从未超过 2,并且无论需要 1 步还是 2 步都是完全随机的,那么它仍然是O(1) 的大 O。 这是因为无论您向数据集大小添加多少个项目,它仍然保持 <2 个步骤。 如果对于所有表> 500 个键需要 2 个步骤,那么您可以假设这 2 个步骤实际上是一步,包含 2 个部分,...这仍然是 O(1)。
如果你不能做出这个假设,那么你根本就不是大O思维,因为这样你就必须使用代表完成所有事情所需的有限计算步骤数的数字,而“一步”对你来说毫无意义。 请记住,Big-O 与所涉及的执行周期数之间没有直接相关。
Hash table implementations are in practice not "exactly" O(1) in use, if you test one you'll find they average around 1.5 lookups to find a given key across a large dataset
( due to to the fact that collisions DO occur, and upon colliding, a different location must be assigned )
Also, In practice, HashMaps are backed by arrays with an initial size, that is "grown" to double size when it reaches 70% fullness on average, which gives a relatively good addressing space. After 70% fullness collision rates grow faster.
Big O theory states that if you have a O(1) algorithm, or even an O(2) algorithm, the critical factor is the degree of the relation between input-set size and steps to insert/fetch one of them. O(2) is still constant time, so we just approximate it as O(1), because it means more or less the same thing.
In reality, there is only 1 way to have a "perfect hashtable" with O(1), and that requires:
( Exception case: if you can compute in advance all the permutations of permitted keys for the system, and your target backing store address space is defined to be the size where it can hold all keys that are permitted, then you can have a perfect hash, but its a "domain limited" perfection )
Given a fixed memory allocation, it is not plausible in the least to have this, because it would assume that you have some magical way to pack an infinite amount of data into a fixed amount of space with no loss of data, and that's logistically impossible.
So retrospectively, getting O(1.5) which is still constant time, in a finite amount of memory with even a relatively Naïve hash key generator, I consider pretty damn awesome.
Suffixory note Note I use O(1.5) and O(2) here. These actually don't exist in big-o. These are merely what people whom don't know big-o assume is the rationale.
If something takes 1.5 steps to find a key, or 2 steps to find that key, or 1 steps to find that key, but the number of steps never exceeds 2 and whether it takes 1 step or 2 is completely random, then it is still Big-O of O(1). This is because no matter how many items to you add to the dataset size, It still maintains the <2 steps. If for all tables > 500 keys it takes 2 steps, then you can assume those 2 steps are in fact one-step with 2 parts, ... which is still O(1).
If you can't make this assumption, then your not being Big-O thinking at all, because then you must use the number which represents the number of finite computational steps required to do everything and "one-step" is meaningless to you. Just get into your head that there is NO direct correlation between Big-O and number of execution cycles involved.
O(1) 准确地说,算法的时间复杂度受固定值限制。 这并不意味着它是恒定的,只是它是有界的,无论输入值如何。 严格来说,许多所谓的 O(1) 时间算法实际上并不是 O(1),只是运行速度太慢,以至于它们对所有实际输入值都有限制。
O(1) means, exactly, that the algorithm's time complexity is bounded by a fixed value. This doesn't mean it's constant, only that it is bounded regardless of input values. Strictly speaking, many allegedly O(1) time algorithms are not actually O(1) and just go so slowly that they are bounded for all practical input values.
是的,垃圾收集确实会影响在垃圾收集领域运行的算法的渐近复杂性。 它并不是没有成本,但如果没有经验方法就很难分析,因为交互成本不是组合性的。
垃圾收集所花费的时间取决于所使用的算法。 通常,现代垃圾收集器会在内存填满时切换模式,以控制这些成本。 例如,一种常见的方法是在内存压力较低时使用切尼式复制收集器,因为它付出的成本与实时集的大小成正比,以换取使用更多空间,而在内存压力较大时则切换到标记和清除收集器变得更大,因为即使它支付的成本与用于标记的活动集和用于清除的整个堆或死集成正比。 当您添加卡片标记和其他优化等时,实际垃圾收集器的最坏情况成本实际上可能会更糟,为某些使用模式获取额外的对数因子。
因此,如果您分配一个大哈希表,即使您使用 O(1) 搜索其生命周期内的所有时间来访问它,如果您在垃圾收集环境中这样做,垃圾收集器有时也会遍历整个数组,因为它大小为 O(n),您将在收集期间定期支付该费用。
我们通常将其排除在算法复杂性分析之外的原因是垃圾收集以非平凡的方式与算法交互。 成本有多糟糕很大程度上取决于您在同一流程中所做的其他事情,因此分析不是组合性的。
此外,除了复制与紧凑与标记和清除问题之外,实现细节可能会极大地影响最终的复杂性:
最后,当讨论算法时,我们讨论的是稻草人。 渐进永远不会完全包含环境的所有变量。 您很少会按照设计实现数据结构的每个细节。 你到处借用一个功能,你放入一个哈希表,因为你需要快速无序的键访问,你在不相交的集合上使用联合查找,并使用路径压缩和按等级联合来合并那里的内存区域,因为你不能当您合并区域或您拥有的东西时,有能力支付与区域大小成正比的成本。 这些结构被认为是基元,渐进可以帮助您规划“大范围”结构的整体性能特征,但了解常数的知识也很重要。
您可以实现具有完美 O(1) 渐近特性的哈希表,只是不要使用垃圾收集; 将其从文件映射到内存并自行管理。 不过,您可能不喜欢其中涉及的常量。
Yes, garbage collection does affect the asymptotic complexity of algorithms running in the garbage collected arena. It is not without cost, but it is very hard to analyze without empirical methods, because the interaction costs are not compositional.
The time spent garbage collecting depends on the algorithm being used. Typically modern garbage collectors toggle modes as memory fills up to keep these costs under control. For instance, a common approach is to use a Cheney style copy collector when memory pressure is low because it pays cost proportional to the size of the live set in exchange for using more space, and to switch to a mark and sweep collector when memory pressure becomes greater, because even though it pays cost proportional to the live set for marking and to the whole heap or dead set for sweeping. By the time you add card-marking and other optimizations, etc. the worst case costs for a practical garbage collector may actually be a fair bit worse, picking up an extra logarithmic factor for some usage patterns.
So, if you allocate a big hash table, even if you access it using O(1) searches for all time during its lifetime, if you do so in a garbage collected environment, occasionally the garbage collector will traverse the entire array, because it is size O(n) and you will pay that cost periodically during collection.
The reason we usually leave it off of the complexity analysis of algorithms is that garbage collection interacts with your algorithm in non-trivial ways. How bad of a cost it is depends a lot on what else you are doing in the same process, so the analysis is not compositional.
Moreover, above and beyond the copy vs. compact vs. mark and sweep issue, the implementation details can drastically affect the resulting complexities:
Finally, when discussing an algorithm, we are discussing a straw man. The asymptotics will never fully incorporate all of the variables of your environment. Rarely do you ever implement every detail of a data structure as designed. You borrow a feature here and there, you drop a hash table in because you need fast unordered key access, you use a union-find over disjoint sets with path compression and union by rank to merge memory-regions over there because you can't afford to pay a cost proportional to the size of the regions when you merge them or what have you. These structures are thought primitives and the asymptotics help you when planning overall performance characteristics for the structure 'in-the-large' but knowledge of what the constants are matters too.
You can implement that hash table with perfectly O(1) asymptotic characteristics, just don't use garbage collection; map it into memory from a file and manage it yourself. You probably won't like the constants involved though.
我认为,当许多人使用“O(1)”这个术语时,他们心中隐含着一个“小”常数,无论“小”在他们的上下文中意味着什么。
你必须结合上下文和常识来进行所有这些大分析。 它可以是一个非常有用的工具,也可以是很荒谬的,这取决于你如何使用它。
I think when many people throw around the term "O(1)" they implicitly have in mind a "small" constant, whatever "small" means in their context.
You have to take all this big-O analysis with context and common sense. It can be an extremely useful tool or it can be ridiculous, depending on how you use it.