用于查找严格子集（从给定列表中）的快速数据结构

发布于 2024-11-18 02:20:39 字数 405 浏览 2 评论 0原文

我有一大组集合，例如 {{2,4,5} , {4,5}, ...}。 给定这些子集之一，我想迭代所有其他子集，这些子集是该子集的严格子集。也就是说，如果我对集合A感兴趣，例如{2,4,5}，我想找到所有集合B，其中相对补B / A = {}， 空集。一些可能性可能是 {2,4}、{2,5} 但不是 {2,3}

我当然可以线性搜索每次都检查，但我正在为较大的集合和子集寻找有效的数据结构（如果重要的话）。子集的数量通常为数十万，但如果它有所不同，我会对它可能达到数亿的情况感兴趣。子集的大小通常为 10 秒。

我正在用 C++ 编程，

谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浮云落日 2024-11-25 02:20:39

从数学上讲，您应该为您的集合构建哈斯图，这将是带有顶点的部分有序集合你的集合和箭头由遏制给出。本质上，您想要创建一个带有箭头的有向非循环图 A --> ; B 如果 A 严格包含 B 并且不存在 C 使得 A 严格包含 C 和 C 严格包含 B。

这实际上将是一个排名偏序集，这意味着您可以根据集合的基数跟踪有向图的“级别”。这有点像创建一个哈希表来跳转到正确的集合。

从 A 开始，只需沿着图进行 BFS 即可找到 A 的所有真子集。

如何实现：（伪代码）

for (C in sets) {
    for (B in HasseDiagram at rank rank(C)+1) {
      if (C contains B)
        addArrow(C,B)
    }
    for (A in HasseDiagram at rank rank(C)+1) {
      if (C contains A)
        addArrow(A,C)
    }
    addToDiagram(C)
}

为了使该子例程和所有子例程更快，您可以将每个集合编码为二进制，其中数字 i 为 1i 是C 语言，则为0，否则为0。这使得测试遏制和确定等级变得微不足道。

如果您拥有所有可能的子集，则上述方法有效。由于您可能会遗漏一些内容，因此您必须检查更多内容。对于伪代码，您需要将 rank(C)-1 更改为最大整数 l l l <等级(C)使得HasseDiagram的某些元素具有等级l，对于rank(C)+1也类似。然后，当您将集合C添加到图中时：

如果A覆盖C，那么您只需要检查较低的排名集 B 也被 A 覆盖。
如果C覆盖B，那么您只需要检查也被B覆盖的较高排名的集合A >.

“X 覆盖 Y” 我的意思是有一个箭头 X -> Y，不仅仅是一条路径。

此外，当您使用上述检查之一在 A 和 B 之间插入 C 时，您将需要删除箭头 A - -> B 当您添加 A --> 时C 和 C --> B。

Mathematically, you should construct the Hasse diagram for your sets, which will be the partially ordered set with vertices your sets and arrows given by containment. Essentially, you want to create a directed, acyclic graph with an arrow A --> B if A strictly contains B and there is no C such that A strictly contains C and C strictly contains B.

This is actually going to be a ranked poset, meaning that you can keep track of "levels" of the digraph based on the cardinality of the sets. This is sort of like creating a hash table to jump to the right set.

From A, just do a BFS down the graph to find all proper subsets of A.

How to implement this: (in pseudocode)

for (C in sets) {
    for (B in HasseDiagram at rank rank(C)+1) {
      if (C contains B)
        addArrow(C,B)
    }
    for (A in HasseDiagram at rank rank(C)+1) {
      if (C contains A)
        addArrow(A,C)
    }
    addToDiagram(C)
}

To make this and all the subroutines fast, you can encode each set an a binary where digit i is 1 if i is in C and 0 otherwise. This makes testing containment and determining rank trivial.

The above method works if you have all possible subsets. Since you may be missing some, you'll have to check more things. For the pseudocode, you'll need to change rank(C)-1 to the largest integer l < rank(C) such that some element of the HasseDiagram has rank l, and similarly for rank(C)+1. Then, when you're adding the set C to the diagram:

If A covers C, then you only need to check lower ranked sets B that are also covered by A.
If C covers B, then you only need to check higher ranked sets A that also cover by B.

By "X covers Y" I mean there is an arrow X -> Y, not just a path.

Furthermore, when you insert C between A and B using one of the above checks, you will need to remove the arrow A --> B when you add A --> C and C --> B.

回复收藏 0 原文

慢慢从新开始 2024-11-25 02:20:39

我建议将所有集合存储在树中。树的每个节点都代表包含指定初始整数列表的所有集合。我会让节点包含以下信息：

树中此时或下面的最小集合中的附加元素的数量。（0 表示该节点位于树中。）
表示树中该节点以下所有子集的交集的位集。
指向将较大整数映射到包含该整数作为下一个元素的子树的数组的指针。作为一种特殊情况，如果树中只有一个低于该子集的子集，则该指针可能为空。（不需要填写树中未填充的部分。）

给定这棵树和一个子集，您可以通过递归和回溯来搜索该集合的所有子集。在搜索中，您从子集的第一个元素开始，查找包含该元素的所有子集，然后搜索不包含该元素的所有子集。

构建这棵树最多需要时间和空间O(n * m * k)，其中n是子集的数量m是平均数量每个子集的元素数量，k 是集合中元素范围的大小。使用比您的 k 元素的子集可能的宇宙小得多的随机集合，您将无法构建树的大部分内容，并且需要 O(n * m)< /code> 为你的树。

理论上，遍历这棵树的时间可能是O(n)。但是实际上，您会很早就修剪树的分支，并且不会遍历大多数其他子集。粗略计算表明，如果您有 k 个元素宇宙中的 n 个随机集合，并且 n << 2^k 那么树的搜索是O(n^0.5k)。（对于每个整数，一半的时间它在你的集合中，你正在搜索它的子集，你将搜索分成 2 个，一半的时间它不在你的集合中，你消除了一半的空间。j 个整数 2^j/2 搜索大小为 2^-j 的集合因此，当您完成搜索时。要与单个其他子集进行比较，需要进行 O(n^0.5) 次搜索。位图的最终比较是 O(k)。 )

注意：通过这种粗略的计算，我确信每个<的平均性能为o(n^0.5+epsilon)代码>epsilon> 0，但是收敛速度很慢。更准确地说，我怀疑性能的算术平均值为 n^{0.5 + O(sqrt(log(n)))})。但是那个 sqrt(log(n)) 片段需要很长时间才能收敛。

请注意，使用树中此时或下方的最小集合中的附加元素数量，可以让您的搜索轻松过滤掉所有太大而无法成为子集的集合。根据您的数据集，这可能会也可能不会带来有用的加速。

I would suggest storing all of the sets in a tree. Each node of the tree would represent all sets that contained a specified initial list of integers. I would have the nodes contain the following pieces of information:

The number of additional elements in the smallest set at this point or below in the tree. (0 means that this node is in the tree.)
A bitset representing the intersection of all subsets below this one in the tree.
A pointer to an array mapping larger integers to subtrees that contain that as the next element. As a special case, if there is only one subset below this one in the tree, this pointer could be null. (There is no need to fill out unpopulated parts of the tree.)

Given this tree, and a subset you can do a search with recursion and backtracking for all subsets of the set. In your search you start with the first element of the subset, look for all subsets that contain that element, then you search for all subsets that don't contain that element.

Building this tree takes time and space at most O(n * m * k) where n is the number of subsets m is the average number of elements per subset, and k is the size of the universe of elements that can be in the sets. With random sets of sets that are much smaller than the possible universe of subsets of your k elements you won't construct most of the tree, and it will take O(n * m) for your tree.

In theory traversing this tree could be time O(n). But in practice you'll trim branches of the tree fairly early, and won't traverse most of the other subsets. A back of the envelope calculation suggests that if you have n random sets out of a k element universe with n << 2^k then a search of the tree is O(n^0.5k). (At each integer, half the time it is in your set you're searching for subsets of and you split your search into 2, and half the time it isn't in your set and you eliminate half of your space. After j integers you've got 2^j/2 searches going of sets of sets of size 2^-jn. Thus by the time you get the searches down to single other subsets to compare with, there are O(n^0.5) searches going. The final comparison of bitmaps is O(k).)

Note: I'm convinced by this back of the envelope calculation that the average performance is o(n^0.5+epsilon) for every epsilon > 0, but the convergence is very slow. More precisely I suspect that the arithmetic average of the performance is n^{0.5 + O(sqrt(log(n)))}). But that sqrt(log(n)) piece takes a long time to converge.

Note that using the number of additional elements in the smallest set at this point or below in the tree lets your search trivially filter out all sets that are too large to be subsets. Depending on your dataset, this may or may not lead to useful speedups.

回复收藏 0 原文

被翻牌 2024-11-25 02:20:39

PengOne建议的方法是可行的，但效率不是很高。要了解它失败的原因，请考虑以下病态示例：

假设您有一个宇宙 U，其中有 n 个不同的元素，并让您正在搜索的所有集合的集合由 U 的所有子集组成，其中恰好有 k 个元素。那么这里确实没有一对集合是严格相互包含的；因此，在最坏的情况下，您将不得不搜索所有 n 选择 k 个可能的集合！换句话说，在最坏的情况下，使用他提出的数据结构并不比简单的线性搜索更好。

显然，您可以做得更好，并且要使用的正确数据结构是 trie： http:// en.wikipedia.org/wiki/Trie

要使 trie 适应集合而不仅仅是字符串，只需修复通用集合元素的排序，然后将每个子集编码为二进制有限长度的字符串，其中第 i 个字符是 0 或 1，取决于集合是否包含第 i 个元素。下面是 python 中的实现

import math

class SetTree:
    def __init__(self, index, key, left, right):
        self.index = index
        self.key = key
        self.left = left
        self.right = right

cached_trees = { }
cached_index = 2

def get_index(T):
    if isinstance(T, SetTree):
        return T.index
    if T:
        return 1
    return 0        

def make_set_tree(key, left, right):
    global cached_trees, cached_index
    code = (key, get_index(left), get_index(right))
    if not code in cached_trees:
        cached_trees[code] = SetTree(cached_index, key, left, right)
        cached_index += 1
    return cached_trees[code]

def compute_freqs(X):
    freqs, total = {}, 0
    for S in X:
        for a in S:
            if a in freqs:
                freqs[a] += 1
            else:
                freqs[a] = 1
            total += 1
    U = [ (-f, a) for a,f in freqs.items() ]
    U.sort()
    return U

#Constructs the tree recursively
def build_tree_rec(X, U):
    if len(X) == 0:
        return False
    if len(U) == 0:
        return True

    key = U[0][1]

    left_elems = [ S for S in X if key in S]

    if len(left_elems) > 0:
        return make_set_tree(key,
            build_tree_rec(left_elems, U[1:]),
            build_tree_rec([ S for S in X if not key in S ], U[1:]))

    return build_tree_rec(X, U[1:])

#Build a search tree recursively
def build_tree(X):
    U = compute_freqs(X)
    return build_tree_rec(X, U)


#Query a set tree to find all subsets contained in a given set
def query_tree(T, S):
    if not isinstance(T, SetTree):
        return [ [] ] if T else []
    if T.key in S:
        return [ U + [ T.key ] for U in query_tree(T.left, S) ] + query_tree(T.right, S)
    return query_tree(T.right, S)

#Debugging function: Converts a tree to a tuple for printing
def tree_to_tuple(T):
    if isinstance(T, SetTree):
        return (T.key, tree_to_tuple(T.left), tree_to_tuple(T.right))
    return T

现在这里是一个示例用法：

In [15]: search_tree = set_search.build_tree(set_family)

In [16]: set_search.tree_to_tuple(search_tree)
Out[16]: 
(2,
 (4, (5, True, True), (5, True, (3, True, False))),
 (4, (5, True, False), (1, True, False)))

In [17]: set_search.query_tree(search_tree, set([2,3,4,5]))
Out[17]: [[5, 4, 2], [4, 2], [5, 2], [3, 2], [5, 4]]

In [18]: set_search.query_tree(search_tree, set([1,2,3,4,5]))
Out[18]: [[5, 4, 2], [4, 2], [5, 2], [3, 2], [5, 4], [1]]

In [19]: set_search.query_tree(search_tree, set([2,4,5]))
Out[19]: [[5, 4, 2], [4, 2], [5, 2], [5, 4]]

In [20]: set_search.query_tree(search_tree, set([2,5]))
Out[20]: [[5, 2]]

In [21]: set_search.query_tree(search_tree, set([1]))
Out[21]: [[1]]

In [22]: set_search.query_tree(search_tree, set([15]))
Out[22]: []

请注意，query_tree 执行的工作量与代表 query_tree 返回的所有结果集的子树的大小成正比。因此，我们的目标是计算其中一个子尝试的大小（平均），然后作为次要目标来最小化该数量。实现此目的的一种方法是按照降序频率对共相元素进行重新排序，以便它们在树的较低级别中重复的次数尽可能少。上面的代码中也做了这样的优化。第二个优化是缓存已经搜索过的树，以避免重做不必要的工作。

编辑：就在我完成输入后，我看到了 btilly 的答案，它对这个问题得出了或多或少相同的结论（模数一些技术挑剔，我已将其移至他的帖子的评论中。）

编辑 2：实现这实际上只是二元决策图的一个特例。现在确实没有足够的精力来修复这篇文章，所以就保持原样吧。也许明天修复它。 http://en.wikipedia.org/wiki/Binary_decision_diagram

The approach suggested by PengOne would work, but it is not very efficient. To see why it fails, consider the following pathological example:

Suppose you have a universe U, which has n distinct elements, and let the set of all the sets you are searching over consist of all subsets of U with exactly k elements. Then it is true that no pair of sets here are strictly contained in one another; and so in the worst case you would have to search over all n choose k possible sets! In other words, using his proposed data structure is no better than a naive linear search in the worst case.

Clearly you can do much better than this, and the correct data structure to use would be a trie: http://en.wikipedia.org/wiki/Trie

To adapt a trie to work for sets instead of just strings, it is sufficient to fix an ordering on the elements of the universal set, then encode each of your subsets as a binary string of finite length, where the ith character is 0 or 1 depending on whether the set contains the ith element. Here is an implementation in python

import math

class SetTree:
    def __init__(self, index, key, left, right):
        self.index = index
        self.key = key
        self.left = left
        self.right = right

cached_trees = { }
cached_index = 2

def get_index(T):
    if isinstance(T, SetTree):
        return T.index
    if T:
        return 1
    return 0        

def make_set_tree(key, left, right):
    global cached_trees, cached_index
    code = (key, get_index(left), get_index(right))
    if not code in cached_trees:
        cached_trees[code] = SetTree(cached_index, key, left, right)
        cached_index += 1
    return cached_trees[code]

def compute_freqs(X):
    freqs, total = {}, 0
    for S in X:
        for a in S:
            if a in freqs:
                freqs[a] += 1
            else:
                freqs[a] = 1
            total += 1
    U = [ (-f, a) for a,f in freqs.items() ]
    U.sort()
    return U

#Constructs the tree recursively
def build_tree_rec(X, U):
    if len(X) == 0:
        return False
    if len(U) == 0:
        return True

    key = U[0][1]

    left_elems = [ S for S in X if key in S]

    if len(left_elems) > 0:
        return make_set_tree(key,
            build_tree_rec(left_elems, U[1:]),
            build_tree_rec([ S for S in X if not key in S ], U[1:]))

    return build_tree_rec(X, U[1:])

#Build a search tree recursively
def build_tree(X):
    U = compute_freqs(X)
    return build_tree_rec(X, U)


#Query a set tree to find all subsets contained in a given set
def query_tree(T, S):
    if not isinstance(T, SetTree):
        return [ [] ] if T else []
    if T.key in S:
        return [ U + [ T.key ] for U in query_tree(T.left, S) ] + query_tree(T.right, S)
    return query_tree(T.right, S)

#Debugging function: Converts a tree to a tuple for printing
def tree_to_tuple(T):
    if isinstance(T, SetTree):
        return (T.key, tree_to_tuple(T.left), tree_to_tuple(T.right))
    return T

Now here is an example usage:

In [15]: search_tree = set_search.build_tree(set_family)

In [16]: set_search.tree_to_tuple(search_tree)
Out[16]: 
(2,
 (4, (5, True, True), (5, True, (3, True, False))),
 (4, (5, True, False), (1, True, False)))

In [17]: set_search.query_tree(search_tree, set([2,3,4,5]))
Out[17]: [[5, 4, 2], [4, 2], [5, 2], [3, 2], [5, 4]]

In [18]: set_search.query_tree(search_tree, set([1,2,3,4,5]))
Out[18]: [[5, 4, 2], [4, 2], [5, 2], [3, 2], [5, 4], [1]]

In [19]: set_search.query_tree(search_tree, set([2,4,5]))
Out[19]: [[5, 4, 2], [4, 2], [5, 2], [5, 4]]

In [20]: set_search.query_tree(search_tree, set([2,5]))
Out[20]: [[5, 2]]

In [21]: set_search.query_tree(search_tree, set([1]))
Out[21]: [[1]]

In [22]: set_search.query_tree(search_tree, set([15]))
Out[22]: []

Note that the amount of work performed by query_tree is proportional to the size of the subtree which represents the set of all results returned by query_tree. Thus our goal is to compute the size of one of the subtries (on average) and then as a secondary goal to minimize this quantity. One way to do this is to reorder the elements of the universal in terms of descending frequency, so that they are repeated as few times as possible in the lower levels of the tree. This optimization is also done in the above code. A secondary optimization is to cache the trees which have already been searched to avoid having to redo unnecessary work.

EDIT: Just after I got done typing this up, I saw btilly's answer, which is comes to more or less the same conclusion about the problem (modulo some technical nitpicks which I have moved into the comments on his post.)

EDIT 2: Realized that this is really just a special case of a binary decision diagram. Don't really have enough energy to fix the write up right now, so will leave it as is. Perhaps fix it tomorrow. http://en.wikipedia.org/wiki/Binary_decision_diagram

回复收藏 0 原文

╰ゝ天使的微笑 2024-11-25 02:20:39

这很有趣。我喜欢 PengOne 建议的哈斯图方法，但我认为你可以使用素数技巧非常快速地构建哈斯图。假设所有集合的并集得到自然数 1 到 N。将这些数字中的每一个与相应的素数映射，例如：

PrimeMap [1] = 2;
PrimeMap [2] = 3;
PrimeMap [3] = 5;

接下来，通过将与中的数字对应的每个素数相乘来计算每个集合的“分数”该集。例如，集合 {1,2,3} 的分数为 2*3*5 = 30。现在，要使集合 A 成为另一个集合 B 的真子集，分数（A）必须除以分数（B）（分数对于 {1,2}、{2,3} 和 {1,3} 分别为 6、15 和 10，每个都除以 30)。使用此分数构建您的哈斯图。

编辑：这似乎是很好的理论解决方案之一。可能不是该走的路。 yi_H 建议的位集同样好，并且不会遇到大整数问题。

This is interesting. I like the Hasse diagram approach PengOne suggests but I think you can build the Hasse diagram really quickly using a prime number trick. Lets say the union of all of the sets results in natural numbers 1 to N. Map each of these numbers with corresponding primes, like:

PrimeMap [1] = 2;
PrimeMap [2] = 3;
PrimeMap [3] = 5;

Next, caluclate a 'score' for each set by multiplying each of the prime numbers corresponding to the number in the set. For instance a set {1,2,3} would have a score 2*3*5 = 30. Now, for a set A to be a proper subset of another set B score(A) must divide score(B) (scores for {1,2}, {2,3} and {1,3} are 6, 15 and 10, each of which divide 30). Use this score to build your Hasse diagram.

Edit: This seems like one of the nice theoretical solutions. Probably not the way to go. Bitsets as suggested by yi_H is just as good and does not suffer from big integer troubles.

回复收藏 0 原文