输入的平均间隔数(0..N)

发布于 2024-10-06 21:14:00 字数 738 浏览 0 评论 0原文

这个问题是在检查“查找该集合中应该覆盖 [0..N] 的 K 个缺失数字”问题时出现的。

问题的作者要求 CS 答案而不是基于方程的答案,他的建议是对输入进行排序,然后对其进行迭代以列出 K 个缺失的数字。

虽然这对我来说似乎很好,但也似乎很浪费。让我们举个例子:

  • N = 200
  • K = 2 (我们将考虑 K << N)
  • 缺失元素: 53, 75

“排序”集合可以表示为: [0, 52] U [54, 74] U [76, 200],这比枚举集合的所有值更紧凑(并且允许在 O(K) 操作中检索丢失的数字,如果集已排序)。

然而,这是最终结果,但在构造过程中,间隔列表可能会更大,因为我们一次提供一个元素......

因此,让我们引入另一个变量:让I 是迄今为止我们提供给结构的集合元素的数量。那么,我们在最坏的情况下可能有: min((NK)/2, I) 间隔(我认为...)

从中我们推断出构造过程中达到的间隔数是遇到的最大值对于 [0..N] 中的 I,最坏的情况是 (NK)/2 因此 O(N)

然而,我有一种直觉,如果输入是随机的,而不是专门设计的,我们可能会得到一个低得多的界限......因此总是如此棘手的问题:

平均间隔是多少?< /强>

The question sprang up when examining the "Find the K missing numbers in this set supposed to cover [0..N]" question.

The author of the question asked for CS answers instead of equation-based answers, and his proposal was to sort the input and then iterate over it to list the K missing numbers.

While this seems fine to me, it also seems wasteful. Let's take an example:

  • N = 200
  • K = 2 (we will consider K << N)
  • missing elements: 53, 75

The "sorted" set can be represented as: [0, 52] U [54, 74] U [76, 200], which is way more compact than enumerating all values of the set (and allows to retrieve the missing numbers in O(K) operations, to be compared with O(N) if the set is sorted).

However this is the final result, but during the construction the list of intervals might be much larger, as we feed the elements one at a time....

Let us, therefore, introduce another variable: let I be the number of elements of the set that we fed to the structure so far. Then, we may at worst have: min((N-K)/2, I) intervals (I think...)

From which we deduce that the number of intervals reached during the construction is the maximum encountered for I in [0..N], the worst case being (N-K)/2 thus O(N).

I have however a gut feeling that if the input is random, instead of being specially crafted, we might get a much lower bound... and thus the always so tricky question:

How much intervals... in average ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

極樂鬼 2024-10-13 21:14:00

您的方法与建议的排序方法似乎是一种经典的权衡,即哪些操作便宜,哪些操作昂贵。

我发现你的符号有点混乱,所以请允许我使用我自己的:

S 成为集合。令n 为集合中的项目数:n = |S|。令 max 为集合中的最大数字:max = max(S)。令k 为不在集合中的元素数量:k = |{0,...,max} \ S|

对于排序解决方案,我们可以使用哈希非常便宜地将所有 n 元素插入到 S 中。这将需要O(n)。然后,为了找到k个缺失元素,我们在O(nlogn)中对集合进行排序,然后在O(n)中确定缺失元素。

也就是说,添加 n 个元素然后查找缺失的 k 元素的总成本需要预期的 O(n) + O(nlogn) + O(n) = O(nlogn)。


您建议采用一种不同的方法,将集合表示为 S 的密集子集列表。您将如何实现这样的数据结构?我建议使用排序树(而不是列表),以便插入变得高效。因为插入新元素 e 需要做什么?我认为您必须:

  1. 在树中找到可以添加 e 的潜在候选子集
  2. 如果子集已包含 e,则无需执行任何操作。
  3. 如果一个子集包含 e+1 并且另一个子集包含 e-1,则将子集合并在一起并将 e 添加到结果中
  4. 如果子集已存在包含 e+1,但 e-1 不包含在 S 中,将 e 添加到该子集
  5. 如果子集已包含 e-1,但 S 中不包含 e+1,请将 e 添加到该子集
  6. 否则,创建一个仅包含元素 e 的新子集并将其插入树中。

我们可以预期找到上述操作所需的子集需要O(logn)。如果我们将子集表示为整数对(我们只需递减下限或递增上限),则 4. 或 5. 的操作将花费常数时间。 3. 和 6. 可能需要更改树结构,但我们预计最多需要 O(logn),因此整个“插入”不会超过 O(logn)

现在有了这样的数据结构,我们可以通过按顺序遍历树并收集不在任何子集中的数字来轻松确定 k 个缺失的数字。成本与树中节点的数量呈线性关系,即<= n/2,因此总成本为O(n)

但是,如果我们再次考虑完整的序列操作,我们会得到 for n inserts O(nlogn) + O(n) 来查找 k缺少数字,因此总成本又是O(nlogn)

这并不比第一个算法的预期成本更好。


第三种解决方案是使用布尔数组来表示集合,并使用单个整数 max 来表示集合中的最大元素。

如果将元素 e 添加到 Set 中,则设置 array[e] = true。您可以使用表扩展来实现集合的可变大小,因此插入一个数组中的元素是常量。

要检索丢失的元素,您只需收集那些 f 其中 array[f] == false 的元素。这将需要O(max)

因此,插入 n 个元素并查找 k 个缺失元素的总成本为:O(n) + O(max)。然而,max = n + k,因此我们得到的总成本为O(n + k)


第四种方法是第三种方法和使用散列的方法之间的交叉,它也使用散列,但不需要排序。

将集合 S 存储在哈希集中,并将 S 中的最大元素存储在变量 max 中。要查找 k 个缺失值,首先生成一个包含所有数字 {0,...,max} 的结果集 R。然后迭代 S 并从 R 中删除 S 中的每个元素。

其成本也是O(n + k)

Your approach vs. the proposed one with sorting seems to be a classical trade-off of which operations is cheap and which one is expensive.

I find your notation a bit confusing, so please allow me to use my own:

Let S be the set. Let n be the number of items in the set: n = |S|. Let max be the biggest number in the set: max = max(S). Let k be the number of elements not in the set: k = |{0,...,max} \ S|.

For the sorting solution, we could very cheaply insert all n elements into S using hashing. That would take expected O(n). Then for finding the k missing elements, we sort the set in O(nlogn), and then determine the missing elements in O(n).

That is, the overall cost for adding n elements and then finding the missing k elements takes expected O(n) + O(nlogn) + O(n) = O(nlogn).


You suggest a different approach in which we represent the set as a list of dense subsets of S. How would you implement such a data structure? I suggest a sorted tree (instead of a list) so that an insert becomes efficient. Because what do you have to do for an insert of a new element e? I think you have to:

  1. Find the potential candidate subset(s) in the tree where e could be added
  2. If a subset already contains e, nothing has to be done.
  3. If a subset contains e+1 and another subset contains e-1, merge the subsets together and add e to the result
  4. If a subset already contains e+1, but e-1 is not contained in S, add e to that subset
  5. If a subset already contains e-1, but e+1 is not contained in S, add e to that subset
  6. Otherwise, create a new subset holding only the element e and insert it into the tree.

We can expect that finding the subsets needed for the above operations takes O(logn). The operations of 4. or 5. take constant time if we represent the subsets as pairs of integers (we just have to decrement the lower or increment the upper boundary). 3. and 6. potentially require changing the tree structure, but we expect that to take at most O(logn), so the whole "insert" will not take more than O(logn).

Now with such a datastructure in place, we can easily determine the k missing numbers by traversing the tree in order and collecting the numbers not in any of the subsets. The costs are linear in the number of nodes in the tree, which is <= n/2, so the total costs are O(n) for that.

However, if we consider again the complete sequence operations, we get for n inserts O(nlogn) + O(n) for finding the k missing numbers, so the overall costs are again O(nlogn).

This is not better than the expected costs of the first algorithm.


A third solution is to use a boolean array to represent the set and a single integer max for the biggest element in the set.

If an element e is added to the Set, you set array[e] = true. You can implement the variable size of the set using table expansion, so the costs for inserting an element into the array is constant.

To retrieve the missing elements, you just collect those elements f where array[f] == false. This will take O(max).

The overall costs for inserting n elements and finding the k missing ones is thus: O(n) + O(max). However, max = n + k, and so we get as the overall costs O(n + k).


A fourth method which is a cross-over between the third one and the one using hashing is the following one, which also uses hashing, but doesn't require sorting.

Store your set S in a hash set, and also store the maximum element in S in a variable max. To find the k missing ones, first generate a result set R containing all numbers {0,...,max}. Then iterate over S and delete every element in S from R.

The costs for that are also O(n + k).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文