输入的平均间隔数(0..N)
这个问题是在检查“查找该集合中应该覆盖 [0..N] 的 K 个缺失数字”问题时出现的。
问题的作者要求 CS 答案而不是基于方程的答案,他的建议是对输入进行排序,然后对其进行迭代以列出 K 个缺失的数字。
虽然这对我来说似乎很好,但也似乎很浪费。让我们举个例子:
- N = 200
- K = 2 (我们将考虑 K << N)
- 缺失元素: 53, 75
“排序”集合可以表示为: [0, 52] U [54, 74] U [76, 200]
,这比枚举集合的所有值更紧凑(并且允许在 O(K) 操作中检索丢失的数字,如果集已排序)。
然而,这是最终结果,但在构造过程中,间隔列表可能会更大,因为我们一次提供一个元素......
因此,让我们引入另一个变量:让I
是迄今为止我们提供给结构的集合元素的数量。那么,我们在最坏的情况下可能有: min((NK)/2, I)
间隔(我认为...)
从中我们推断出构造过程中达到的间隔数是遇到的最大值对于 [0..N] 中的 I,最坏的情况是 (NK)/2
因此 O(N)
。
然而,我有一种直觉,如果输入是随机的,而不是专门设计的,我们可能会得到一个低得多的界限......因此总是如此棘手的问题:
平均间隔是多少?< /强>
The question sprang up when examining the "Find the K missing numbers in this set supposed to cover [0..N]" question.
The author of the question asked for CS answers instead of equation-based answers, and his proposal was to sort the input and then iterate over it to list the K missing numbers.
While this seems fine to me, it also seems wasteful. Let's take an example:
- N = 200
- K = 2 (we will consider K << N)
- missing elements: 53, 75
The "sorted" set can be represented as: [0, 52] U [54, 74] U [76, 200]
, which is way more compact than enumerating all values of the set (and allows to retrieve the missing numbers in O(K) operations, to be compared with O(N) if the set is sorted).
However this is the final result, but during the construction the list of intervals might be much larger, as we feed the elements one at a time....
Let us, therefore, introduce another variable: let I
be the number of elements of the set that we fed to the structure so far. Then, we may at worst have: min((N-K)/2, I)
intervals (I think...)
From which we deduce that the number of intervals reached during the construction is the maximum encountered for I in [0..N], the worst case being (N-K)/2
thus O(N)
.
I have however a gut feeling that if the input is random, instead of being specially crafted, we might get a much lower bound... and thus the always so tricky question:
How much intervals... in average ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的方法与建议的排序方法似乎是一种经典的权衡,即哪些操作便宜,哪些操作昂贵。
我发现你的符号有点混乱,所以请允许我使用我自己的:
让
S
成为集合。令n
为集合中的项目数:n = |S|
。令max
为集合中的最大数字:max = max(S)
。令k
为不在集合中的元素数量:k = |{0,...,max} \ S|
。对于排序解决方案,我们可以使用哈希非常便宜地将所有
n
元素插入到S
中。这将需要O(n)
。然后,为了找到k
个缺失元素,我们在O(nlogn)
中对集合进行排序,然后在O(n)
中确定缺失元素。也就是说,添加
n
个元素然后查找缺失的k
元素的总成本需要预期的O(n) + O(nlogn) + O(n) = O(nlogn)。
您建议采用一种不同的方法,将集合表示为
S
的密集子集列表。您将如何实现这样的数据结构?我建议使用排序树(而不是列表),以便插入变得高效。因为插入新元素e
需要做什么?我认为您必须:e
的潜在候选子集e
,则无需执行任何操作。e+1
并且另一个子集包含e-1
,则将子集合并在一起并将e
添加到结果中e+1
,但e-1
不包含在S
中,将e
添加到该子集e-1
,但S
中不包含e+1
,请将e
添加到该子集e
的新子集并将其插入树中。我们可以预期找到上述操作所需的子集需要
O(logn)
。如果我们将子集表示为整数对(我们只需递减下限或递增上限),则 4. 或 5. 的操作将花费常数时间。 3. 和 6. 可能需要更改树结构,但我们预计最多需要O(logn)
,因此整个“插入”不会超过O(logn)
。现在有了这样的数据结构,我们可以通过按顺序遍历树并收集不在任何子集中的数字来轻松确定
k
个缺失的数字。成本与树中节点的数量呈线性关系,即<= n/2
,因此总成本为O(n)
。但是,如果我们再次考虑完整的序列操作,我们会得到 for
n
insertsO(nlogn)
+O(n)
来查找 k缺少数字,因此总成本又是O(nlogn)
。这并不比第一个算法的预期成本更好。
第三种解决方案是使用布尔数组来表示集合,并使用单个整数 max 来表示集合中的最大元素。
如果将元素
e
添加到 Set 中,则设置array[e] = true
。您可以使用表扩展来实现集合的可变大小,因此插入一个数组中的元素是常量。要检索丢失的元素,您只需收集那些
f
其中array[f] == false
的元素。这将需要O(max)
。因此,插入 n 个元素并查找 k 个缺失元素的总成本为:
O(n) + O(max)
。然而,max = n + k
,因此我们得到的总成本为O(n + k)
。第四种方法是第三种方法和使用散列的方法之间的交叉,它也使用散列,但不需要排序。
将集合
S
存储在哈希集中,并将S
中的最大元素存储在变量max
中。要查找k
个缺失值,首先生成一个包含所有数字{0,...,max}
的结果集 R。然后迭代S
并从R
中删除S
中的每个元素。其成本也是
O(n + k)
。Your approach vs. the proposed one with sorting seems to be a classical trade-off of which operations is cheap and which one is expensive.
I find your notation a bit confusing, so please allow me to use my own:
Let
S
be the set. Letn
be the number of items in the set:n = |S|
. Letmax
be the biggest number in the set:max = max(S)
. Letk
be the number of elements not in the set:k = |{0,...,max} \ S|
.For the sorting solution, we could very cheaply insert all
n
elements intoS
using hashing. That would take expectedO(n)
. Then for finding thek
missing elements, we sort the set inO(nlogn)
, and then determine the missing elements inO(n)
.That is, the overall cost for adding
n
elements and then finding the missingk
elements takes expectedO(n) + O(nlogn) + O(n) = O(nlogn)
.You suggest a different approach in which we represent the set as a list of dense subsets of
S
. How would you implement such a data structure? I suggest a sorted tree (instead of a list) so that an insert becomes efficient. Because what do you have to do for an insert of a new elemente
? I think you have to:e
could be addede
, nothing has to be done.e+1
and another subset containse-1
, merge the subsets together and adde
to the resulte+1
, bute-1
is not contained inS
, adde
to that subsete-1
, bute+1
is not contained inS
, adde
to that subsete
and insert it into the tree.We can expect that finding the subsets needed for the above operations takes
O(logn)
. The operations of 4. or 5. take constant time if we represent the subsets as pairs of integers (we just have to decrement the lower or increment the upper boundary). 3. and 6. potentially require changing the tree structure, but we expect that to take at mostO(logn)
, so the whole "insert" will not take more thanO(logn)
.Now with such a datastructure in place, we can easily determine the
k
missing numbers by traversing the tree in order and collecting the numbers not in any of the subsets. The costs are linear in the number of nodes in the tree, which is<= n/2
, so the total costs areO(n)
for that.However, if we consider again the complete sequence operations, we get for
n
insertsO(nlogn)
+O(n)
for finding the k missing numbers, so the overall costs are againO(nlogn)
.This is not better than the expected costs of the first algorithm.
A third solution is to use a boolean
array
to represent the set and a single integermax
for the biggest element in the set.If an element
e
is added to the Set, you setarray[e] = true
. You can implement the variable size of the set using table expansion, so the costs for inserting an element into the array is constant.To retrieve the missing elements, you just collect those elements
f
wherearray[f] == false
. This will takeO(max)
.The overall costs for inserting n elements and finding the k missing ones is thus:
O(n) + O(max)
. However,max = n + k
, and so we get as the overall costsO(n + k)
.A fourth method which is a cross-over between the third one and the one using hashing is the following one, which also uses hashing, but doesn't require sorting.
Store your set
S
in a hash set, and also store the maximum element inS
in a variablemax
. To find thek
missing ones, first generate a result set R containing all numbers{0,...,max}
. Then iterate overS
and delete every element inS
fromR
.The costs for that are also
O(n + k)
.