用于查找严格子集(从给定列表中)的快速数据结构
我有一大组集合,例如 {{2,4,5} , {4,5}, ...}。
给定这些子集之一,我想迭代所有其他子集,这些子集是该子集的严格子集。也就是说,如果我对集合A
感兴趣,例如{2,4,5}
,我想找到所有集合B
,其中相对补B / A = {},
空集。一些可能性可能是 {2,4}
、{2,5}
但不是 {2,3}
我当然可以线性搜索每次都检查,但我正在为较大的集合和子集寻找有效的数据结构(如果重要的话)。子集的数量通常为数十万,但如果它有所不同,我会对它可能达到数亿的情况感兴趣。子集的大小通常为 10 秒。
我正在用 C++ 编程,
谢谢
I have a large set of sets e.g. {{2,4,5} , {4,5}, ...}.
Given one of these subsets, I would like to iterate through all other subsets which are strict subsets of this subset. That is, if I am interested in set A
, e.g. {2,4,5}
, I want to find all sets B
where the relative complement B / A = {},
the empty set. Some possibilities could be {2,4}
, {2,5}
but not {2,3}
I could of course search linearly and check each time, but am looking for an efficient data structure both for the larger set and the subset (if it matters). The number of subsets is typically in the 10s of thousands, but if it makes a difference I would be interested in cases where it could be in the hundreds of millions. The size of the subsets is typically in 10s.
I am programming in C++
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
从数学上讲,您应该为您的集合构建哈斯图,这将是带有顶点的部分有序集合你的集合和箭头由遏制给出。本质上,您想要创建一个带有箭头的有向非循环图
A --> ; B
如果A
严格包含B
并且不存在C
使得A
严格包含C
和C
严格包含B
。这实际上将是一个排名偏序集,这意味着您可以根据集合的基数跟踪有向图的“级别”。这有点像创建一个哈希表来跳转到正确的集合。
从
A
开始,只需沿着图进行 BFS 即可找到A
的所有真子集。如何实现:(伪代码)
为了使该子例程和所有子例程更快,您可以将每个集合编码为二进制,其中数字
i
为1i
是C
语言,则为0
,否则为0
。这使得测试遏制和确定等级变得微不足道。如果您拥有所有可能的子集,则上述方法有效。由于您可能会遗漏一些内容,因此您必须检查更多内容。对于伪代码,您需要将
rank(C)-1
更改为最大整数l
l
l <等级(C)
使得HasseDiagram的某些元素具有等级l
,对于rank(C)+1
也类似。然后,当您将集合C
添加到图中时:如果
A
覆盖C
,那么您只需要检查较低的排名集B
也被A
覆盖。如果
C
覆盖B
,那么您只需要检查也被B
覆盖的较高排名的集合A
>.“
X
覆盖Y
” 我的意思是有一个箭头X -> Y
,不仅仅是一条路径。此外,当您使用上述检查之一在
A
和B
之间插入C
时,您将需要删除箭头A - -> B
当您添加A --> 时C
和C --> B。
Mathematically, you should construct the Hasse diagram for your sets, which will be the partially ordered set with vertices your sets and arrows given by containment. Essentially, you want to create a directed, acyclic graph with an arrow
A --> B
ifA
strictly containsB
and there is noC
such thatA
strictly containsC
andC
strictly containsB
.This is actually going to be a ranked poset, meaning that you can keep track of "levels" of the digraph based on the cardinality of the sets. This is sort of like creating a hash table to jump to the right set.
From
A
, just do a BFS down the graph to find all proper subsets ofA
.How to implement this: (in pseudocode)
To make this and all the subroutines fast, you can encode each set an a binary where digit
i
is1
ifi
is inC
and0
otherwise. This makes testing containment and determining rank trivial.The above method works if you have all possible subsets. Since you may be missing some, you'll have to check more things. For the pseudocode, you'll need to change
rank(C)-1
to the largest integerl < rank(C)
such that some element of the HasseDiagram has rankl
, and similarly forrank(C)+1
. Then, when you're adding the setC
to the diagram:If
A
coversC
, then you only need to check lower ranked setsB
that are also covered byA
.If
C
coversB
, then you only need to check higher ranked setsA
that also cover byB
.By "
X
coversY
" I mean there is an arrowX -> Y
, not just a path.Furthermore, when you insert
C
betweenA
andB
using one of the above checks, you will need to remove the arrowA --> B
when you addA --> C
andC --> B
.我建议将所有集合存储在树中。树的每个节点都代表包含指定初始整数列表的所有集合。我会让节点包含以下信息:
给定这棵树和一个子集,您可以通过递归和回溯来搜索该集合的所有子集。在搜索中,您从子集的第一个元素开始,查找包含该元素的所有子集,然后搜索不包含该元素的所有子集。
构建这棵树最多需要时间和空间
O(n * m * k)
,其中n
是子集的数量m
是平均数量每个子集的元素数量,k
是集合中元素范围的大小。使用比您的k
元素的子集可能的宇宙小得多的随机集合,您将无法构建树的大部分内容,并且需要O(n * m)< /code> 为你的树。
理论上,遍历这棵树的时间可能是
O(n)
。 但是实际上,您会很早就修剪树的分支,并且不会遍历大多数其他子集。粗略计算表明,如果您有k
个元素宇宙中的n
个随机集合,并且n << 2k
那么树的搜索是O(n0.5k)
。 (对于每个整数,一半的时间它在你的集合中,你正在搜索它的子集,你将搜索分成 2 个,一半的时间它不在你的集合中,你消除了一半的空间。j
个整数2j/2
搜索大小为2-j 的集合因此,当您完成搜索时。要与单个其他子集进行比较,需要进行
O(n0.5)
次搜索。位图的最终比较是O(k)
。 )注意:通过这种粗略的计算,我确信每个<的平均性能为
o(n0.5+epsilon)
代码>epsilon> 0,但是收敛速度很慢。更准确地说,我怀疑性能的算术平均值为n0.5 + O(sqrt(log(n))))
。但是那个 sqrt(log(n)) 片段需要很长时间才能收敛。请注意,使用树中此时或下方的最小集合中的附加元素数量,可以让您的搜索轻松过滤掉所有太大而无法成为子集的集合。根据您的数据集,这可能会也可能不会带来有用的加速。
I would suggest storing all of the sets in a tree. Each node of the tree would represent all sets that contained a specified initial list of integers. I would have the nodes contain the following pieces of information:
Given this tree, and a subset you can do a search with recursion and backtracking for all subsets of the set. In your search you start with the first element of the subset, look for all subsets that contain that element, then you search for all subsets that don't contain that element.
Building this tree takes time and space at most
O(n * m * k)
wheren
is the number of subsetsm
is the average number of elements per subset, andk
is the size of the universe of elements that can be in the sets. With random sets of sets that are much smaller than the possible universe of subsets of yourk
elements you won't construct most of the tree, and it will takeO(n * m)
for your tree.In theory traversing this tree could be time
O(n)
. But in practice you'll trim branches of the tree fairly early, and won't traverse most of the other subsets. A back of the envelope calculation suggests that if you haven
random sets out of ak
element universe withn << 2k
then a search of the tree isO(n0.5k)
. (At each integer, half the time it is in your set you're searching for subsets of and you split your search into 2, and half the time it isn't in your set and you eliminate half of your space. Afterj
integers you've got2j/2
searches going of sets of sets of size2-jn
. Thus by the time you get the searches down to single other subsets to compare with, there areO(n0.5)
searches going. The final comparison of bitmaps isO(k)
.)Note: I'm convinced by this back of the envelope calculation that the average performance is
o(n0.5+epsilon)
for everyepsilon > 0
, but the convergence is very slow. More precisely I suspect that the arithmetic average of the performance isn0.5 + O(sqrt(log(n))))
. But thatsqrt(log(n))
piece takes a long time to converge.Note that using the number of additional elements in the smallest set at this point or below in the tree lets your search trivially filter out all sets that are too large to be subsets. Depending on your dataset, this may or may not lead to useful speedups.
PengOne建议的方法是可行的,但效率不是很高。要了解它失败的原因,请考虑以下病态示例:
假设您有一个宇宙 U,其中有 n 个不同的元素,并让您正在搜索的所有集合的集合由 U 的所有子集组成,其中恰好有 k 个元素。那么这里确实没有一对集合是严格相互包含的;因此,在最坏的情况下,您将不得不搜索所有 n 选择 k 个可能的集合!换句话说,在最坏的情况下,使用他提出的数据结构并不比简单的线性搜索更好。
显然,您可以做得更好,并且要使用的正确数据结构是 trie: http:// en.wikipedia.org/wiki/Trie
要使 trie 适应集合而不仅仅是字符串,只需修复通用集合元素的排序,然后将每个子集编码为二进制有限长度的字符串,其中第 i 个字符是 0 或 1,取决于集合是否包含第 i 个元素。下面是 python 中的实现
现在这里是一个示例用法:
请注意,query_tree 执行的工作量与代表 query_tree 返回的所有结果集的子树的大小成正比。因此,我们的目标是计算其中一个子尝试的大小(平均),然后作为次要目标来最小化该数量。实现此目的的一种方法是按照降序频率对共相元素进行重新排序,以便它们在树的较低级别中重复的次数尽可能少。上面的代码中也做了这样的优化。第二个优化是缓存已经搜索过的树,以避免重做不必要的工作。
编辑:就在我完成输入后,我看到了 btilly 的答案,它对这个问题得出了或多或少相同的结论(模数一些技术挑剔,我已将其移至他的帖子的评论中。)
编辑 2:实现这实际上只是二元决策图的一个特例。现在确实没有足够的精力来修复这篇文章,所以就保持原样吧。也许明天修复它。 http://en.wikipedia.org/wiki/Binary_decision_diagram
The approach suggested by PengOne would work, but it is not very efficient. To see why it fails, consider the following pathological example:
Suppose you have a universe U, which has n distinct elements, and let the set of all the sets you are searching over consist of all subsets of U with exactly k elements. Then it is true that no pair of sets here are strictly contained in one another; and so in the worst case you would have to search over all n choose k possible sets! In other words, using his proposed data structure is no better than a naive linear search in the worst case.
Clearly you can do much better than this, and the correct data structure to use would be a trie: http://en.wikipedia.org/wiki/Trie
To adapt a trie to work for sets instead of just strings, it is sufficient to fix an ordering on the elements of the universal set, then encode each of your subsets as a binary string of finite length, where the ith character is 0 or 1 depending on whether the set contains the ith element. Here is an implementation in python
Now here is an example usage:
Note that the amount of work performed by query_tree is proportional to the size of the subtree which represents the set of all results returned by query_tree. Thus our goal is to compute the size of one of the subtries (on average) and then as a secondary goal to minimize this quantity. One way to do this is to reorder the elements of the universal in terms of descending frequency, so that they are repeated as few times as possible in the lower levels of the tree. This optimization is also done in the above code. A secondary optimization is to cache the trees which have already been searched to avoid having to redo unnecessary work.
EDIT: Just after I got done typing this up, I saw btilly's answer, which is comes to more or less the same conclusion about the problem (modulo some technical nitpicks which I have moved into the comments on his post.)
EDIT 2: Realized that this is really just a special case of a binary decision diagram. Don't really have enough energy to fix the write up right now, so will leave it as is. Perhaps fix it tomorrow. http://en.wikipedia.org/wiki/Binary_decision_diagram
这很有趣。我喜欢 PengOne 建议的哈斯图方法,但我认为你可以使用素数技巧非常快速地构建哈斯图。假设所有集合的并集得到自然数 1 到 N。将这些数字中的每一个与相应的素数映射,例如:
接下来,通过将与中的数字对应的每个素数相乘来计算每个集合的“分数”该集。例如,集合 {1,2,3} 的分数为 2*3*5 = 30。现在,要使集合 A 成为另一个集合 B 的真子集,分数(A)必须除以分数(B)(分数对于 {1,2}、{2,3} 和 {1,3} 分别为 6、15 和 10,每个都除以 30)。使用此分数构建您的哈斯图。
编辑:这似乎是很好的理论解决方案之一。可能不是该走的路。 yi_H 建议的位集同样好,并且不会遇到大整数问题。
This is interesting. I like the Hasse diagram approach PengOne suggests but I think you can build the Hasse diagram really quickly using a prime number trick. Lets say the union of all of the sets results in natural numbers 1 to N. Map each of these numbers with corresponding primes, like:
Next, caluclate a 'score' for each set by multiplying each of the prime numbers corresponding to the number in the set. For instance a set {1,2,3} would have a score 2*3*5 = 30. Now, for a set A to be a proper subset of another set B score(A) must divide score(B) (scores for {1,2}, {2,3} and {1,3} are 6, 15 and 10, each of which divide 30). Use this score to build your Hasse diagram.
Edit: This seems like one of the nice theoretical solutions. Probably not the way to go. Bitsets as suggested by yi_H is just as good and does not suffer from big integer troubles.
看看这个实现哈斯图的 python 库 python-lattice]1
Have a look at this python library that implements Hasse diagrams python-lattice]1