所有相交集的并集

发布于 2024-07-23 08:56:32 字数 1343 浏览 12 评论 0原文

给定具有多个属性的对象列表,我需要找到由所有相交子集的并集创建的集合列表。

具体来说,这些是 Person 对象,每个对象都有许多属性。 我需要根据一些唯一标识符(例如 SSN、DLN 等)创建“主”集列表。

例如,如果人员 A 和人员 B 具有相同的 SSN,他们将创建一个集 i。 然后,如果 B 和 C 具有相同的 DLN,他们将创建一个集合 ii。 人员 D 和 E 具有相同的 SSN,但它(以及所有其他标识符)与人员 A、B 或 C 的任何标识符都不匹配。合并所有相交子集后,我最终会得到人员 A、B、C 的一组另一组是 D、E。

这是我的解决方案的伪代码。 我很好奇是否有人已经想出一种更有效的方法来合并所有可能的相交集。 请记住,集合之间的链接可能有 X 个人长(即 A 通过 SSN 匹配 B,B 通过 DLN 匹配 C,C 通过 SSN 匹配 D,D 通过某些其他标识符匹配 E,将导致一组中出现人员 AE)。 还假设将要实现的语言支持集合运算。

bigSetList = array of all of the uniq Sets
fullyTested = false
while (bigSetList.size() > 1) or (fullyTested is false)
    foreach thisSet in bigSetList  order by size desc
        if count(sets that intersect with thisSet) > 0
            newThisSet = thisSet
            intersectingSets = []
            bigSetList.delete(thisSet)
            foreach testSet in bigSetList
                if thisSet.intersects(testSet)
                    newThisSet.addAll(testSet)
                    intersectingSets.push(testSetID)
                end if
            end
            bigSetList.delete(intersectingSets)
            bigSetList.push(newThisSet)
            bigSetList.sort()
            break
        end if
    end foreach
    fullyTested = true  // have looped through every set in the list and found 0 intersect partners
end

Given a list of objects with multiple attributes I need to find the list of sets created by a union of all intersecting subsets.

Specifically these are Person objects, each with many attributes. I need to create a list of 'master' sets based on a handful of unique identifiers such as SSN, DLN, etc.

For instance, if Person A and Person B have the same SSN they create a set i. Then if Person B and C have the same DLN, they create a set ii. Person D and E have the same SSN but it (and all other identifiers) does not match any of the identifiers of Persons A, B or C. After merging all intersecting subsets I would end up with one set with Persons A,B,C and another set with Persons D, E.

Here is the psuedo-code for my solution. I am curious if anyone has already come up with a more efficient way of merging all possible intersecting sets. Keep in mind that the links between sets could be X Persons long (i.e. A matches B by SSN and B matches C by DLN and C matches D by SSN and D matches E by some other identifier would result in Persons A-E in one set). Also assume that the language this will be implemented in supports set operations.

bigSetList = array of all of the uniq Sets
fullyTested = false
while (bigSetList.size() > 1) or (fullyTested is false)
    foreach thisSet in bigSetList  order by size desc
        if count(sets that intersect with thisSet) > 0
            newThisSet = thisSet
            intersectingSets = []
            bigSetList.delete(thisSet)
            foreach testSet in bigSetList
                if thisSet.intersects(testSet)
                    newThisSet.addAll(testSet)
                    intersectingSets.push(testSetID)
                end if
            end
            bigSetList.delete(intersectingSets)
            bigSetList.push(newThisSet)
            bigSetList.sort()
            break
        end if
    end foreach
    fullyTested = true  // have looped through every set in the list and found 0 intersect partners
end

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

眼中杀气 2024-07-30 08:56:32

为了扩展我在原始帖子中的评论,您想要创建一个集合列表,其中给定集合的每个成员与该集合中的至少一个其他成员共享至少一个属性。

天真地,这可以通过找到共享属性的所有对并迭代地将具有相同伙伴的对合并在一起来解决。 这将是 O(N^3)(N^2 用于迭代对,最多 N 个单独的集合来确定成员资格)。

您还可以将此问题视为确定图的连通分量,其中每个对象和每个唯一的属性值都是一个节点; 每个对象都将连接到它的每个属性值。 设置该图需要线性时间,您可以通过广度或深度优先搜索在线性时间内确定连接的组件。

To expand on my comment in the original post, you want to create a list of sets where each member of a given set shares at least one attribute with at least one other member of that set.

Naively, this can be solved either by finding all pairs that share an attribute and merging pairs together that have the same partner iteratively. This would be O(N^3) (N^2 for iterating over pairs, and up to N separate sets to determine membership).

You can also think of this problem as determining the connected component of a graph, where every object and every unique attribute value is a node; each object would be connected to each of its attribute values. Setting up that graph would take linear time, and you could determine the connected components in linear time with a breadth or depth first search.

灼疼热情 2024-07-30 08:56:32

我猜想您的 Person 对象有一组相对较小的属性(与您正在考虑的 Person 对象的数量相比)。 如果要减少多次遍历 Person 对象列表,可以获取一个 Person,将其属性放入已知可能连接的列表中,然后继续处理下一个 Person。 对于每个连续的 Person,您可以查看它是否与任何先前的连接有关。 如果是这样,那么您可以将其独特的属性添加到可能的连接中。 您应该能够一次性处理所有 Person 对象。 结果中可能会有一些断开连接的集合,因此在创建第一个图后检查未连接的 Person 对象可能是值得的。

I would guess that you have a relatively small set of attributes for the Person object (as compared to the number of Person objects you're considering). If you want to reduce traversing the list of Person objects multiple times, you can take a Person, put its attributes into a list of known possible connections and then move on to the next Person. With each successive Person, you see if it is connected to any prior connection. If so, then you add its unique attributes to the possible connections. You should be able to process all Person objects in one pass. It's possible that you'll have some disconnected sets in the results, so it may be worth examining the unconnected Person objects after you've created the first graph.

属性 2024-07-30 08:56:32
while (!people.isEmpty()) {
    Person first = people.get(0);
    people.remove(first);
    Set<Person> set = makeSet(first);
    for (Person person : people) {
        for (Person other : set) {
            if (person.isRelatedTo(other)) {
                set.add(person);
                people.remove(person);
            }
        }
    }
    sets.add(set);
}
for (Set<Person> a : sets) {
    for (Set<Person> b : sets.except(a)) {
        for (Person person : a)
            for (Person other : b) {
                if (person.isRelatedTo(other)) {
                    a.addAll(b);
                    b.clear();
                    sets.remove(b);
                    break;
                }
            }
    }
}
while (!people.isEmpty()) {
    Person first = people.get(0);
    people.remove(first);
    Set<Person> set = makeSet(first);
    for (Person person : people) {
        for (Person other : set) {
            if (person.isRelatedTo(other)) {
                set.add(person);
                people.remove(person);
            }
        }
    }
    sets.add(set);
}
for (Set<Person> a : sets) {
    for (Set<Person> b : sets.except(a)) {
        for (Person person : a)
            for (Person other : b) {
                if (person.isRelatedTo(other)) {
                    a.addAll(b);
                    b.clear();
                    sets.remove(b);
                    break;
                }
            }
    }
}
荒岛晴空 2024-07-30 08:56:32

首先,标识符中是否存在某种固有的层次结构,并且较高类别的矛盾标识符是否会抵消较低类别的相同标识符? 例如,如果 A 和 B 具有相同的 SSN,B 和 C 具有相同的 DLN,并且 C 和 D 具有相同的 SSN,但与 A 和 B 的 SSN 不匹配,这是否意味着有两个组或一个?

假设矛盾并不重要,您正在处理等价类,正如用户 57368(未知的 Google)所述。 对于等价类,人们经常转向 Union-find 结构。 至于如何执行这些联合,这并不是一件小事,因为我假设当 A 和 B 具有相同的 SSN 时,您没有直接链接 AB。 相反,我们的集合将由两种元素组成。 每个(attribute type, attribute value) = attribute 对都是一个元素。 您还拥有与 object 相对应的元素。 当您迭代对象的属性列表时,请执行联合(object, attribute)

并查数据结构的重要特征之一是结果结构代表集合。 它可以让你查询“A 属于哪个集合?” 如果这还不够,请告诉我们,我们可以改进结果。

但最重要的特征是该算法具有类似于每个联合和查询操作的恒定时间行为。

First, is there some inherent hierarchy in identifiers, and do contradicting identifiers of a higher sort cancel out the same identifier of a lower sort? For example, if A and B have the same SSN, B and C have the same DLN, and C and D have the same SSN which does not match A and B's SSN, does that mean that there are two groups or one?

Assuming contradictions don't matter, you are dealing with equivalence classes, as user 57368 (unknown Google) states. For equivalence classes, people often turn to the Union-find structure. As for how to perform these unions, it's not immediately trivial because I assume you don't have the direct link A-B when both A and B have the same SSN. Instead, our sets will consist of two kinds of elements. Each (attribute type, attribute value) = attribute pair is an element. You also have elements corresponding to objects. When you iterate through the list of attributes for an object, perform the union (object, attribute).

One of the important features of the Union-find data structure is that the resulting structure represents the set. It lets you query "What set is A in?" If this is not enough, let us know and we can improve the result.

But the most important feature is that the algorithm has something which resembles constant-time behavior for each union and query operation.

讽刺将军 2024-07-30 08:56:32

因此,您的集合示例可能如下所示:

A { ss |-> 42, dl |-> 123 }
B { ss |-> 42, dl |-> 456 }
C { ss |-> 23, dl |-> 456 }
D { ss |-> 89, dl |-> 789 }
E { ss |-> 89, dl |-> 432 }

然后我建议使用一种算法,通过增量合并或将每个集合插入到多集合中来构建多集合:

迭代 1。第一个集合成为唯一的多集合:

{A} { ss |-> [42], dl |-> [123] }

迭代 2. 由于 SSN 已存在,因此将下一个集合合并到第一个集合中:

{A,B} { ss |-> [42], dl |-> [123,456] }

迭代 3. 再次合并,因为 DLN 已存在:

{A,B,C} { ss |-> [23,42], dl |-> [123,456] }

迭代 4. 由于没有匹配项,因此插入新的多重集合:

{A,B,C} { ss |-> [23,42], dl |-> [123,456] }
{D}     { ss |-> [89],    dl |-> [789]     }

迭代 5. 与第二个集合合并多集合,因为 SSN 在那里:

{A,B,C} { ss |-> [23,42], dl |-> [123,456] }
{D,E}   { ss |-> [89],    dl |-> [432,789] }

因此,在每次迭代(每个集合一个)中,您必须识别与您正在处理的集合具有共同值的所有个多集合,并合并 >所有这些加在一起。

一般来说,如果有 n 个集合,每个集合都有固定的 k 个属性,则该算法的运行时间为 O(nnk) = O(n2)。 如果所有属性值都不同,则会出现最坏情况的行为。 当属性值之间有更多共享时,插入和确定属性值集(如[23,42])中的成员资格所需的时间将成为主导因素,因此属性值集应该是高效的。

如果您使用最佳不相交集,则每个查找或合并操作将在摊销时间内运行O(α(n))。

因此,对于每次迭代,最多会有 n 个多重集合(到目前为止还没有合并多重集合的情况)。 要将新集合集成到多集合中,您必须对每个多集合 k 集执行查找操作,以识别要合并的所有多集合,这需要 O(nkα(n)) 的时间。 要合并以这种方式找到的最多 k 个多重集合,需要 O(k2α(n))。

因此,对于所有迭代,时间都受 O(n(nkα(n)+k2α(n))) = O(n(nkα(n))) = O(n2kα(n)) = O(n2α(n)) 因为 k 是一个常数。

由于 α(n) 在所有实际用途中也是一个常数,因此总时间受 O(n2) 限制。

So your collection example could look like this:

A { ss |-> 42, dl |-> 123 }
B { ss |-> 42, dl |-> 456 }
C { ss |-> 23, dl |-> 456 }
D { ss |-> 89, dl |-> 789 }
E { ss |-> 89, dl |-> 432 }

Then I would suggest to use an algorithm where you build up multi-collections by incrementally merging or inserting each collection into the multi-collections:

Iteration 1. The first collection becomes the only multi-collection:

{A} { ss |-> [42], dl |-> [123] }

Iteration 2. Merge the next collection into the first since SSN is already present:

{A,B} { ss |-> [42], dl |-> [123,456] }

Iteration 3. Merge again, since the DLN is already there:

{A,B,C} { ss |-> [23,42], dl |-> [123,456] }

Iteration 4. Insert a new multi-collection since there is no match:

{A,B,C} { ss |-> [23,42], dl |-> [123,456] }
{D}     { ss |-> [89],    dl |-> [789]     }

Iteration 5. Merge with second multi collection, since the SSN is there:

{A,B,C} { ss |-> [23,42], dl |-> [123,456] }
{D,E}   { ss |-> [89],    dl |-> [432,789] }

So in each iteration (one for each collection), you must identify all multi-collections that have values in common with the collection you are processing, and merge all these together.

In general, if there are n collections each with a constant k number of attributes, then this algorithm will run in time O(nnk) = O(n2). The worst-case behaviour is exibited if all attribute values are distinct. When there is more sharing between attribute values, the time that it takes to insert and determine membership in the attribute value sets (like [23,42]) gets to be the dominant factor, so the attribute value sets should be efficient.

If you use optimal disjoint sets, then each Find or Merge operation will run in amortized time O(α(n)).

Thus, for each iteration there will be at most n multi-collections (the situation when no multi-collections have been merged so far). To integrate the new collection into the multi-collections, you will have to perform a Find operation on each of the multi-collections k sets to identify all multi-collections to be merged, which takes time bounded by O(nkα(n)). To merge the at most k multi-collections found this way takes O(k2α(n)).

So for all iteration the time is bounded by O(n(nkα(n)+k2α(n))) = O(n(nkα(n))) = O(n2kα(n)) = O(n2α(n)) since k is a constant.

Because α(n) for all practical purposes is also a constant, the total time is bounded by O(n2).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文