维护一组最小子集
以下是我想要对一个假设的集合数据结构执行的操作,该数据结构将集合作为其元素:
- 将集合插入到数据结构中,但是:(1) 如果新集合是任何现有集合的超集,不要添加它 (2) 如果新集是任何现有集的子集,请将其删除。
- 枚举当前集合中的所有集合
所有相关集合都是已知有限集合的子集,例如 {0..10^4}。
有没有办法有效地做到这一点?
Here are the operations that I'd like to perform on a hypothetical collection data structure that holds sets as its elements:
- Insert a set into the data structure, but: (1) if the new set is a superset of any of the existing sets, don't add it (2) if the new set is a subset of any existing sets, remove them.
- Enumerate all the sets currently in the collection
All the sets in question are subsets of a known finite set, say {0..10^4}.
Is there a way to do this efficiently?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
以下是关于此问题的最新论文:http://research.google.com/pubs/pub36974。简而言之
,在最坏的情况下,你不可能比二次时间做得更好;但在实践中,有一些技巧可以加快速度。
Here is a recent paper on this problem: http://research.google.com/pubs/pub36974.html
Briefly, you cannot do much better than quadratic time in the worst case; but there are some tricks to speed it up in practice.
我们称其为 N = 10^4。这是相当小的,并且这将被证明是有用的。假设您有 S 套。
“逻辑上”这意味着你有一个 N*S 矩阵。
您已经拥有了一套。该一级结构中有S组。
10^4 足够小,您可以维护一个辅助数据结构,该结构为每个 N 值存储其所在的集合列表。此结构是排序的就像一级结构的转置一样。这可以是长度为 N 的向量,允许恒定时间查找来查找特定值所在的集合列表。
现在,当您添加新集合时,可以使用此二级结构来查找每个集合中的其他集合例如,我们添加一个新的集合,其值为 2,5,10
二级结构告诉我们它们位于哪些集合中:
我们可以对这三个列表进行合并和排序以获得
ABBBDD
它不仅告诉我们它与哪些集合重叠,但重叠的大小。与 B 共享三个节点,这意味着我们的新集合是 B 的子集或等于 B。我们与 A 共享 1 个节点,与 D 共享 2 个节点。如果 A 的总大小为 1,那么我们现在知道 A 是新集合的子集。Let's call this N = 10^4. This is reasonably small, and this will prove useful. Let's say you have S sets.
'Logically' this means you have an N*S matrix.
You will already have a set of sets. There are S sets in this primary structure.
10^4 is sufficiently small that you could maintain a secondary data structure which stores, for each the N values, the list of sets that it is in. This structure is sort of like the transpose of the primary structure. This could be a vector of length N, allowing constant time lookup to find the list of sets that a particular value is in.
Now, when you add a new set, it will be possible to use this secondary structure to find which other sets each of its values are in. For example, we add a new set with values 2,5, 10
The secondary structure tells us which sets they are in:
We can merge and sort these three lists to get
ABBBDD
which tells us not only which sets it overlaps with, but the size of the overlaps. Three nodes are shared with B, which means that our new set is a subset of, or equal to, B. We share 1 node with A, and two nodes with D. If it turns out that the total size of A is 1, then we now know that A is a subset of our new set.枚举集合中的集合很容易,O(n)。然而,检查一个新候选是否是所有现有集合的子集将会有些昂贵。有一些众所周知的算法可以测试一组是否是另一组的子集,非常简单,
类似于 O(n^2 lg n)。这算不算“高效”?
Enumerating the sets in the collection is easy, O(n). However, checking a new candidate for whether it's a subset of all the existing sets is going to be somewhat expensive. There are well known algorithms for testing if one set is subset of another, so simple
That's going to be something like O(n^2 lg n). Does that count as "efficient"?
为所有存储的集合维护一个布隆过滤器。为要插入的集合生成布隆过滤器。如果您将要插入的集合的过滤器(称为 X)与另一个集合的布隆过滤器按位与,并获取值 X,那么要插入的集合可能是一个子集(可能是误报,您需要检查此时的慢速方式)。否则肯定不行,你可以尝试另一个。
构建布隆过滤器时有许多可调整的参数,使您可以在空间效率和误报概率之间进行权衡。
http://en.wikipedia.org/wiki/Bloom_filter
Maintain a bloom filter for all of the stored sets. Generate a bloom filter for the set to be inserted. If you bitwise AND the filter for the set to be inserted (call this X) with another set's bloom filter, and get the value X back, then your set to be inserted MIGHT be a subset (possibly a false positive, you need to check the slow way at this point). Otherwise it is definitely not and you can try with another.
There are many adjustable parameters when building a bloom filter that allow you to make tradeoffs between space efficiency and probability of false positives.
http://en.wikipedia.org/wiki/Bloom_filter
为了空间效率,您可以使用位集来表示已知有限集的每个子集。还有一些表示稀疏位集的方法(例如,请参见 this Java 示例),以进一步节省空间。
整体结构可以是一组位组。在Java中,
BitSet
没有子集测试方法,但我认为扩展BitSet
以包含有效的子集测试方法不会太难。 (这将避免测试要添加的候选是否等于其与任何现有子集的交集的令人讨厌的任务。)For space efficiency, you can use a bit set to represent each subset of the known finite set. There are also methods for representing sparse bit sets (see, e.g., this Java sample), to gain further space savings.
The overall structure could be a set of bit sets. In Java,
BitSet
does not have a subset test method, but I don't think it would be too hard to extendBitSet
to include an efficient subset test method. (This would avoid the obnoxious task of testing whether the candidate to be added is equal to its intersection with any of the existing subsets.)使用某种树结构。
例如。将已排序的现有集合存储在 Trie 中。如果通向该节点的路径是现有集合,则在每个节点维护一个标志
1 检查给定集合是否是已存在集合的超集:
2 删除给定集合的所有超集
3 对于枚举集只需遍历树并打印路径 is_set flag is True
Use some kind of tree structure.
Eg. Store the sorted existing sets in a Trie. At each node maintain a Flag if the path leading to that node is an existing set
1 To check if the given set is a superset of an already existing set:
2 Remove all the supersets of a given set
3 For enumerating sets just traverse the tree and print path is is_set flag is True