从一组集合中查找集合子集的最佳方法

发布于 2025-01-06 16:30:15 字数 558 浏览 4 评论 0原文

首先,对含糊不清的标题表示歉意。

假设我有以下一组集合:

Group 1

s1 = ( x1, y1 )
s2 = ( x2 )

Group 2

m1 = ( x1, y1, y2 )
m2 = ( x1 )
m3 = ( x1 , x2 )

对于 Group 1 中的每个集合 - 调用集合 s,我需要在 中找到集合>组 2 - 称之为 m - 这样 ms 的子集。

因此,对于我的示例,答案是:

s1 -> m2
s2 -> nothing

目前,我将值存储在 std:set 中,但如果需要,我可以更改它。此外,集合可能会变大,因此算法需要高效。目前我采用的是蛮力方法,但我对此并不完全满意。

有什么建议吗?

First, sorry for the ambiguous title.

Assume I have the following group of sets:

Group 1

s1 = ( x1, y1 )
s2 = ( x2 )

Group 2

m1 = ( x1, y1, y2 )
m2 = ( x1 )
m3 = ( x1 , x2 )

For each of the sets in Group 1 - call the set s, I need to find the sets in Group 2 - call it m - such that m is a subset of s.

So, for my example, the answer would be:

s1 -> m2
s2 -> nothing

For now, I'm storing the values in std:set, but I can change that if needed. Also, the sets can get big, so the algorithm needs to be efficient. For now I have a brute-force approach, which I'm not entirely satisfied with.

Any suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

梦途 2025-01-13 16:30:15

第一步是根据基数(即大小)对组 1 进行排序。那么算法的顺序是:

foreach std::set M in "Group 2" {
  foreach std::set S in "Group 1" and S.size()>=M.size() {  // replace with binary search
     if ( std::includes(S.begin(),S.end(),M.begin(),M.end()) )
       { /* M is a subset of S */ }
    }
  }
}

这应该具有时间复杂度 ~O(MSR),其中 M 是“组 2”中的集合数,S 是“组 1”中的集合数,R 是大小“组#1”中最大的集合。

编辑:我只是想到使用S.find()可能比调用std::includes()更有效(按顺序迭代)但我认为只有当 M.size() 远小于 S.size() 时,这才是正确的——O(M+S) vs O(MlogS)。

The first step would be to sort Group 1 according to cardinality (i.e. size). Then the algorithm is something on the order of:

foreach std::set M in "Group 2" {
  foreach std::set S in "Group 1" and S.size()>=M.size() {  // replace with binary search
     if ( std::includes(S.begin(),S.end(),M.begin(),M.end()) )
       { /* M is a subset of S */ }
    }
  }
}

This should have time complexity ~O(MSR), where M is the # of sets in "Group 2", S the # of sets in "Group 1", and R is the size of largest set in "Group #1".

Edit: It just occurred to me that it might be more efficient to use S.find() rather than calling std::includes() (which iterates sequentially) but I think that would only be true if M.size() is much smaller than S.size() -- O(M+S) vs O(MlogS).

看海 2025-01-13 16:30:15

您没有具体说明您的方法有多暴力。只要您在 std:: 命名空间中使用设置查询函数,那么它们就可能尽可能高效。
例如测试 set_intersection( s1.begin(), s2.end(), m1.begin(), m1.end() ) 是否等于 m1。

您可能比这更有效,因为您不需要匹配元素的副本,只是想知道它们都出现了。这可以通过复制 set_intersection 的代码但更改实现以简单地计算匹配元素的数量而不是将它们复制出来来完成。然后,如果计数与 m 的大小相同,则匹配。

至于容器,对于大型集合,我通常更喜欢排序双端队列而不是集合。内存在堆上的分布要少得多,这有助于缓存。它还避免了底层树的开销。当容器被创建一次但被搜索多次时,这尤其有用。

You are not specific about how brute-force your approach is. As long as you are using the set query functions in the std:: namespace then they are likely to be as efficient as they can be.
For example testing if set_intersection( s1.begin(), s2.end(), m1.begin(), m1.end() ) is equivalent to m1.

You could be more efficient than this, as you do not want a copy of the matching elements, just to know they all appear. This could be done by copying the code of set_intersection but changing the implementation to simply count the number of matching elements rather than copying them out. Then if the count is the same as the size of m then you have a match.

As for containers, I often prefer a sorted deque over a set for large collections. The memory is much less distributed over the heap which helps with caching. It also avoids the overhead of the underlying tree. This is especially beneficial when the containers are created once, but are searched multiple times.

千と千尋 2025-01-13 16:30:15

您的集合是否经常修改或者它们是只读的/大部分是只读的?

  • 如果经常修改,std::set 可以在修改和排序性能之间取得良好的平衡。
  • 如果是只读或主要读取,您可以使用排序的 std::vector。排序的成本很高,但实际上比在 std::set 中构建整个树要便宜,因此如果您很少进行排序,性能会更好。

一旦您完成了排序容器(无论是“自动排序”std::set 还是手动排序 std::vector),您可以使用 <代码>std::includes。顺便说一句,如果您需要找到正确的子集,您可以随后比较元素计数。

Are your sets frequently modified or are they read-only/mostly?

  • If frequently modified, std::set is a fine balance between modification and sort performance.
  • If read-only or read-mostly, you can use a sorted std::vector. Sorting is expensive, but is actually cheaper than constructing a whole tree in the std::set, so performance is better if you do it rarely enough.

Once you have made the sorted containers (be it "auto-sorted" std::set or manually-sorted std::vector), you can test for subset using std::includes. BTW, if you need to find proper subsets, you can compare element counts afterwards.

菊凝晚露 2025-01-13 16:30:15

你可以尝试这样的事情。
步骤:

  • 创建一个包含两个组中所有对象的数组
  • 将每个 s 和 m 转换为位数组,其中如果集合包含 object(i),则 array(i)=1,否则
  • m(k) 是 s( 的子集) 0 j) 如果 m(k) AND s(j) = m(k)

You can try something like this.
Steps:

  • create an array that contains all objects in both group
  • convert every s and m in a bit array, where array(i)=1 if the set contains object(i), 0 otherwise
  • m(k) is a subset of s(j) if m(k) AND s(j) = m(k)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文