当前位置：文江博客话题详情

维护一组最小子集

发布于 2024-12-28 21:51:31 字数 217 浏览 0 评论 0原文

以下是我想要对一个假设的集合数据结构执行的操作，该数据结构将集合作为其元素：

将集合插入到数据结构中，但是：(1) 如果新集合是任何现有集合的超集，不要添加它 (2) 如果新集是任何现有集的子集，请将其删除。
枚举当前集合中的所有集合

所有相关集合都是已知有限集合的子集，例如 {0..10^4}。

有没有办法有效地做到这一点？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

油饼 2025-01-04 21:51:31

以下是关于此问题的最新论文：http://research.google.com/pubs/pub36974。简而言之

，在最坏的情况下，你不可能比二次时间做得更好；但在实践中，有一些技巧可以加快速度。

回复收藏 0 原文

羁绊已千年 2025-01-04 21:51:31

所有相关集合都是已知有限集合的子集，例如 {0..10^4}。

我们称其为 N = 10^4。这是相当小的，并且这将被证明是有用的。假设您有 S 套。

“逻辑上”这意味着你有一个 N*S 矩阵。

您已经拥有了一套。该一级结构中有S组。

10^4 足够小，您可以维护一个辅助数据结构，该结构为每个 N 值存储其所在的集合列表。此结构是排序的就像一级结构的转置一样。这可以是长度为 N 的向量，允许恒定时间查找来查找特定值所在的集合列表。

现在，当您添加新集合时，可以使用此二级结构来查找每个集合中的其他集合例如，我们添加一个新的集合，其值为 2,5,10

new_set = {2,5,10}

二级结构告诉我们它们位于哪些集合中：

 2 : {A,B,D}
 5 : {B,D}
10 : {B}

我们可以对这三个列表进行合并和排序以获得 ABBBDD它不仅告诉我们它与哪些集合重叠，但重叠的大小。与 B 共享三个节点，这意味着我们的新集合是 B 的子集或等于 B。我们与 A 共享 1 个节点，与 D 共享 2 个节点。如果 A 的总大小为 1，那么我们现在知道 A 是新集合的子集。

All the sets in question are subsets of a known finite set, say {0..10^4}.

Let's call this N = 10^4. This is reasonably small, and this will prove useful. Let's say you have S sets.

'Logically' this means you have an N*S matrix.

You will already have a set of sets. There are S sets in this primary structure.

10^4 is sufficiently small that you could maintain a secondary data structure which stores, for each the N values, the list of sets that it is in. This structure is sort of like the transpose of the primary structure. This could be a vector of length N, allowing constant time lookup to find the list of sets that a particular value is in.

Now, when you add a new set, it will be possible to use this secondary structure to find which other sets each of its values are in. For example, we add a new set with values 2,5, 10

new_set = {2,5,10}

The secondary structure tells us which sets they are in:

 2 : {A,B,D}
 5 : {B,D}
10 : {B}

We can merge and sort these three lists to get ABBBDD which tells us not only which sets it overlaps with, but the size of the overlaps. Three nodes are shared with B, which means that our new set is a subset of, or equal to, B. We share 1 node with A, and two nodes with D. If it turns out that the total size of A is 1, then we now know that A is a subset of our new set.

回复收藏 0 原文

夜唯美灬不弃 2025-01-04 21:51:31

枚举集合中的集合很容易，O(n)。然而，检查一个新候选是否是所有现有集合的子集将会有些昂贵。有一些众所周知的算法可以测试一组是否是另一组的子集，非常简单，

for each subset s in S
    for each candidate set C
        test of C is a subset of s
        if it is, break
if never found, add C to S.

类似于 O(n^2 lg n)。这算不算“高效”？

Enumerating the sets in the collection is easy, O(n). However, checking a new candidate for whether it's a subset of all the existing sets is going to be somewhat expensive. There are well known algorithms for testing if one set is subset of another, so simple

for each subset s in S
    for each candidate set C
        test of C is a subset of s
        if it is, break
if never found, add C to S.

That's going to be something like O(n^2 lg n). Does that count as "efficient"?

回复收藏 0 原文

挖鼻大婶 2025-01-04 21:51:31

为所有存储的集合维护一个布隆过滤器。为要插入的集合生成布隆过滤器。如果您将要插入的集合的过滤器（称为 X）与另一个集合的布隆过滤器按位与，并获取值 X，那么要插入的集合可能是一个子集（可能是误报，您需要检查此时的慢速方式）。否则肯定不行，你可以尝试另一个。

构建布隆过滤器时有许多可调整的参数，使您可以在空间效率和误报概率之间进行权衡。

http://en.wikipedia.org/wiki/Bloom_filter

回复收藏 0 原文

请恋爱 2025-01-04 21:51:31

为了空间效率，您可以使用位集来表示已知有限集的每个子集。还有一些表示稀疏位集的方法（例如，请参见 this Java 示例），以进一步节省空间。

整体结构可以是一组位组。在Java中，BitSet没有子集测试方法，但我认为扩展BitSet以包含有效的子集测试方法不会太难。（这将避免测试要添加的候选是否等于其与任何现有子集的交集的令人讨厌的任务。）

回复收藏 0 原文

最美的太阳 2025-01-04 21:51:31

使用某种树结构。

例如。将已排序的现有集合存储在 Trie 中。如果通向该节点的路径是现有集合，则在每个节点维护一个标志

1 检查给定集合是否是已存在集合的超集：

def issuperset(node, set[N], setc, N):
    如果节点.is_set：
        返回真
    对于 j = setc:N
        如果 set[j] 是节点的子节点：
            if issuperset(node.child[set[j]], set, j+1, N):
                返回真
    返回错误

2 删除给定集合的所有超集

def remsuperset(node, set[N], setc, N):
    如果 setc == N+1：
        删除此节点上或下方的所有集（节点）
        返回
    对于 node.child 中的 ch：
        如果ch＜设置[设置]：
            remsuperset(node.child[ch], set, setc, N)
        elif ch == 设置[setc]：
            remsuperset(node.child[ch], set, setc+1, N)

3 对于枚举集只需遍历树并打印路径 is_set flag is True

Use some kind of tree structure.

Eg. Store the sorted existing sets in a Trie. At each node maintain a Flag if the path leading to that node is an existing set

1 To check if the given set is a superset of an already existing set:

def issuperset(node, set[N], setc, N):
    if node.is_set:
        return True
    for j = setc:N
        if set[j] is a child of node:
            if issuperset(node.child[set[j]], set, j+1, N):
                return True
    return False

2 Remove all the supersets of a given set

def remsuperset(node, set[N], setc, N):
    if setc == N+1:
        remove_all_sets_on_or_below_this_node(node)
        return
    for ch in node.child:
        if ch< set[setc]:
            remsuperset(node.child[ch], set, setc, N)
        elif ch == set[setc]:
            remsuperset(node.child[ch], set, setc+1, N)

3 For enumerating sets just traverse the tree and print path is is_set flag is True

回复收藏 0 原文

~没有更多了~