用于对等价类元素进行分组的数据结构
我必须实现一个对等价类的元素进行分组的数据结构。
API:
interface Grouper<T>{
void same(T l, T r);
Set<EquivalenceClass<T>> equivalenceClasses();
}
interface EquivalenceClass<T>{
Set<T> members();
}
例如,分组的行为如下:
Grouper g;
g.same(a, b);
g.equivalenceClasses() -> [[a,b]]
g.same(b, a);
g.equivalenceClasses() -> [[a,b]]
g.same(b, c);
g.equivalenceClasses() -> [[a,b,c]]
g.same(d, e);
g.equivalenceClasses() -> [[a,b,c], [d,e]]
g.same(c, d);
g.equivalenceClasses() -> [[a,b,c,d]]
我正在寻找一个最多可处理约 1000 万个条目的实现。应该对其进行优化以填充它并一次获得等价类。
I have to implement a data structure that groups the elements of a equivalence classes.
The API:
interface Grouper<T>{
void same(T l, T r);
Set<EquivalenceClass<T>> equivalenceClasses();
}
interface EquivalenceClass<T>{
Set<T> members();
}
For example the grouping behaves like this:
Grouper g;
g.same(a, b);
g.equivalenceClasses() -> [[a,b]]
g.same(b, a);
g.equivalenceClasses() -> [[a,b]]
g.same(b, c);
g.equivalenceClasses() -> [[a,b,c]]
g.same(d, e);
g.equivalenceClasses() -> [[a,b,c], [d,e]]
g.same(c, d);
g.equivalenceClasses() -> [[a,b,c,d]]
I'm looking for an implementation that works up to ~10 million entries. It should be optimized to fill it and get the equivalence classes once.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
查看并查找。并集(“相同”)可以在
O(log N)
内轻松完成,并且通过一些优化可以有效地在O(1)
内完成。 “equivalenceClasses”是O(N)
,这是访问所有内容的成本。Take a look at Union-Find. The union ("same") can be done trivially in
O(log N)
, and can be done in effectivelyO(1)
with some optimizations. The "equivalenceClasses" isO(N)
, which is the cost of visiting everything anyways.如果您只想查询等价类一次,最好的解决方案是在元素上构建无向图。每个等价是两个项之间的无向边,等价类对应于连接的组件。如果你做得正确,时间和空间复杂度都将是线性的。
或者,您可以使用并查数据结构,这将为您提供几乎线性的时间复杂度。它也可以被认为更简单,因为所有复杂性都被封装到数据结构中。 Union-Find 不是线性的原因归结为在类增长时支持高效查询。
If you are only going to query the equivalences classes once, the best solution is to build an undirected graph over the elements. Each equivalence is an undirected edge between the two items, and the equivalence classes correspond to the connected components. The time and space complexity will both be linear if you do it right.
Alternatively, you can use a Union-Find data structure, which will give you almost-linear time complexity. It may also be considered simpler, because all the complexities are encapsulated into the data structure. The reason Union-Find is not linear comes down to supporting efficient queries while the classes are growing.
只要您只关心总运行时间(某些操作可能很慢,但所有操作的总成本保证接近线性),联合查找就是最适合您问题的数据结构。不过,教科书中普通版本的 union-find 通常不支持枚举每个集合的成员。顾名思义,union-find 通常只支持 union(即,
same
)和 find,后者返回的标识符保证与调用 find 中元素所返回的标识符相同。同一套。如果您需要枚举每个集合的成员,您可能必须自己实现它,以便您可以添加子指针,以便您可以遍历代表集合的每个树。如果您自己实现这一点,则不必实现完整的并查找数据结构来实现每个操作的摊销 O(lg n) 时间。本质上,在这个“轻量级”版本的 union-find 中,每个集合都是一个单链表,每个节点内都有一个额外的指针,该指针指向一个集合标识符节点,该节点可用于测试两个节点是否属于同一列表。当执行
same
方法时,您只需将较小的列表追加到较大的列表中,并更新较小列表中元素的集合标识符。每个元素的总成本最多为 O(lg n),因为元素最多可以成为参与相同
操作的较小列表的成员 O(lg n) 次。Union-find is the best data structure for your problem, as long you only care about total running time (some operations may be slow, but the total cost of all operations is guaranteed to be nearly linear). Enumerating the members of each set is not typically supported in the plain version of union-find in textbooks though. As the name suggests, union-find typically only supports union (i.e.,
same
) and find, which returns an identifier guaranteed to be the same as the identifier returned by a call to find on an element in the same set. If you need to enumerate the members of each set, you may have to implement it yourself so you can add, for example, child pointers so that you can traverse each tree representing a set.If you are implementing this yourself, you don't have to implement the full union-find data structure to achieve amortized O(lg n) time per operation. Essentially, in this "light" version of union-find, each set would be a singly linked list with an extra pointer inside each node that points to a set identifier node that can be used to test whether two nodes belong to the same list. When the
same
method is executed, you can just append the smaller list to the larger and update the set identifiers for the elements of the smaller list. The total cost is at most O(lg n) per element because an element can be a member of the smaller list involved in asame
operation at most O(lg n) times.