从一组集合中查找集合子集的最佳方法
首先,对含糊不清的标题表示歉意。
假设我有以下一组集合:
Group 1
s1 = ( x1, y1 )
s2 = ( x2 )
Group 2
m1 = ( x1, y1, y2 )
m2 = ( x1 )
m3 = ( x1 , x2 )
对于 Group 1
中的每个集合 - 调用集合 s
,我需要在 中找到集合>组 2
- 称之为 m
- 这样 m
是 s
的子集。
因此,对于我的示例,答案是:
s1 -> m2
s2 -> nothing
目前,我将值存储在 std:set
中,但如果需要,我可以更改它。此外,集合可能会变大,因此算法需要高效。目前我采用的是蛮力方法,但我对此并不完全满意。
有什么建议吗?
First, sorry for the ambiguous title.
Assume I have the following group of sets:
Group 1
s1 = ( x1, y1 )
s2 = ( x2 )
Group 2
m1 = ( x1, y1, y2 )
m2 = ( x1 )
m3 = ( x1 , x2 )
For each of the sets in Group 1
- call the set s
, I need to find the sets in Group 2
- call it m
- such that m
is a subset of s
.
So, for my example, the answer would be:
s1 -> m2
s2 -> nothing
For now, I'm storing the values in std:set
, but I can change that if needed. Also, the sets can get big, so the algorithm needs to be efficient. For now I have a brute-force approach, which I'm not entirely satisfied with.
Any suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
第一步是根据基数(即大小)对组 1 进行排序。那么算法的顺序是:
这应该具有时间复杂度 ~O(MSR),其中 M 是“组 2”中的集合数,S 是“组 1”中的集合数,R 是大小“组#1”中最大的集合。
编辑:我只是想到使用
S.find()
可能比调用std::includes()
更有效(按顺序迭代)但我认为只有当 M.size() 远小于 S.size() 时,这才是正确的——O(M+S) vs O(MlogS)。The first step would be to sort Group 1 according to cardinality (i.e. size). Then the algorithm is something on the order of:
This should have time complexity ~O(MSR), where M is the # of sets in "Group 2", S the # of sets in "Group 1", and R is the size of largest set in "Group #1".
Edit: It just occurred to me that it might be more efficient to use
S.find()
rather than callingstd::includes()
(which iterates sequentially) but I think that would only be true if M.size() is much smaller than S.size() -- O(M+S) vs O(MlogS).您没有具体说明您的方法有多暴力。只要您在 std:: 命名空间中使用设置查询函数,那么它们就可能尽可能高效。
例如测试 set_intersection( s1.begin(), s2.end(), m1.begin(), m1.end() ) 是否等于 m1。
您可能比这更有效,因为您不需要匹配元素的副本,只是想知道它们都出现了。这可以通过复制 set_intersection 的代码但更改实现以简单地计算匹配元素的数量而不是将它们复制出来来完成。然后,如果计数与 m 的大小相同,则匹配。
至于容器,对于大型集合,我通常更喜欢排序双端队列而不是集合。内存在堆上的分布要少得多,这有助于缓存。它还避免了底层树的开销。当容器被创建一次但被搜索多次时,这尤其有用。
You are not specific about how brute-force your approach is. As long as you are using the set query functions in the std:: namespace then they are likely to be as efficient as they can be.
For example testing if set_intersection( s1.begin(), s2.end(), m1.begin(), m1.end() ) is equivalent to m1.
You could be more efficient than this, as you do not want a copy of the matching elements, just to know they all appear. This could be done by copying the code of set_intersection but changing the implementation to simply count the number of matching elements rather than copying them out. Then if the count is the same as the size of m then you have a match.
As for containers, I often prefer a sorted deque over a set for large collections. The memory is much less distributed over the heap which helps with caching. It also avoids the overhead of the underlying tree. This is especially beneficial when the containers are created once, but are searched multiple times.
您的集合是否经常修改或者它们是只读的/大部分是只读的?
std::set
可以在修改和排序性能之间取得良好的平衡。std::vector
。排序的成本很高,但实际上比在std::set
中构建整个树要便宜,因此如果您很少进行排序,性能会更好。一旦您完成了排序容器(无论是“自动排序”
std::set
还是手动排序std::vector
),您可以使用 <代码>std::includes。顺便说一句,如果您需要找到正确的子集,您可以随后比较元素计数。Are your sets frequently modified or are they read-only/mostly?
std::set
is a fine balance between modification and sort performance.std::vector
. Sorting is expensive, but is actually cheaper than constructing a whole tree in thestd::set
, so performance is better if you do it rarely enough.Once you have made the sorted containers (be it "auto-sorted"
std::set
or manually-sortedstd::vector
), you can test for subset usingstd::includes
. BTW, if you need to find proper subsets, you can compare element counts afterwards.你可以尝试这样的事情。
步骤:
You can try something like this.
Steps: