快速检查集合是否是存储集合的超集

发布于 2025-01-07 06:57:22 字数 1312 浏览 1 评论 0原文

问题是

我得到了 N 个 C 布尔值数组。我想将它们组织成一个数据结构,使我能够尽快执行以下操作:给定一个新数组,如果该数组是任何存储数组的“超集”,则返回 true。对于超集,我的意思是:如果 A[i] 对于每个 i 都为真,且 B[i] 为真,则 A 是 B 的超集。如果 B[i] 为假,则 A[i] 可以是任何值。

或者,就集合而不是数组而言:

将 N 个集合(每个集合有 C 个可能的元素)存储到数据结构中,以便您可以快速查找给定集合是否是任何存储集合的超集。< /strong>

构建数据结构可以花费尽可能长的时间,但查找应该尽可能高效,并且数据结构不能占用太多空间。

在某些上下文中,

我认为这本身就是一个有趣的问题,但对于我真正想要解决的问题,您可以假设以下内容:

  • N = 10000
  • C = 1000
  • 存储的数组是稀疏的
  • 查找的数组是随机的(因此不稀疏)

到目前为止我所想出的

  1. 对于 O(NC) 查找:只需迭代所有数组即可。但这太慢了。

  2. 对于 O(C) 查找:我在这里有一个很长的描述,但正如 Amit 在评论中指出的那样,它基本上是一个 BDD。虽然它的查找速度很快,但它的节点数量呈指数级增长。由于 N 和 C 如此之大,这占用了太多空间。

我希望在这个 O(N*C) 和 O(C) 解决方案之间,可能存在一个不需要指数空间的 O(log(N)*C) 解决方案。

编辑:我想出了一个新想法

  • 对于 O(sqrt(N)C) 查找:将数组存储为前缀特里树。查找数组 A 时,如果 A[i]=0,则访问相应的子树;如果 A[i]=1,则访问两个子树。

    我的直觉告诉我,如果您假设存储的数组是随机的,那么这应该使查找的(平均)复杂度为 O(sqrt(N)C)。但是: 1.它们不是,数组很稀疏。 2.这只是直觉,我无法证明这一点。

我将尝试这个新想法和 BDD 方法,看看这两种方法中哪一种效果最好。

但与此同时,这个问题不是更频繁地发生吗?它没有名字吗?之前不是有研究吗?真的感觉就像我在这里重新发明轮子。

The problem

I am given N arrays of C booleans. I want to organize these into a datastructure that allows me to do the following operation as fast as possible: Given a new array, return true if this array is a "superset" of any of the stored arrays. With superset I mean this: A is a superset of B if A[i] is true for every i where B[i] is true. If B[i] is false, then A[i] can be anything.

Or, in terms of sets instead of arrays:

Store N sets (each with C possible elements) into a datastructure so you can quickly look up if a given set is a superset of any of the stored sets.

Building the datastructure can take as long as possible, but the lookup should be as efficient as possible, and the datastructure can't take too much space.

Some context

I think this is an interesting problem on its own, but for the thing I'm really trying to solve, you can assume the following:

  • N = 10000
  • C = 1000
  • The stored arrays are sparse
  • The looked up arrays are random (so not sparse)

What I've come up with so far

  1. For O(NC) lookup: Just iterate all the arrays. This is just too slow though.

  2. For O(C) lookup: I had a long description here, but as Amit pointed out in the comments, it was basically a BDD. While this has great lookup speed, it has an exponential number of nodes. With N and C so large, this takes too much space.

I hope that in between this O(N*C) and O(C) solution, there's maybe a O(log(N)*C) solution that doesn't require an exponential amount of space.

EDIT: A new idea I've come up with

  • For O(sqrt(N)C) lookup: Store the arrays as a prefix trie. When looking up an array A, go to the appropriate subtree if A[i]=0, but visit both subtrees if A[i]=1.

    My intuition tells me that this should make the (average) complexity of the lookup O(sqrt(N)C), if you assume that the stored arrays are random. But: 1. they're not, the arrays are sparse. And 2. it's only intuition, I can't prove it.

I will try out both this new idea and the BDD method, and see which of the 2 work out best.

But in the meantime, doesn't this problem occur more often? Doesn't it have a name? Hasn't there been previous research? It really feels like I'm reinventing the wheel here.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

暖树树初阳… 2025-01-14 06:57:22

只是为了向前缀特里解决方案添加一些背景信息,最近我发现了以下论文:

I.Savnik:用于快速子集和超集查询的索引数据结构CD-ARES,IFIP LNCS,2013。

本文提出了该集合-trie数据结构(容器),为使用trie数据结构的集合的高效存储和查询提供支持,支持查找所有集合等操作集合集合中给定集合的超集/子集。

对于任何对实际实现感兴趣的 python 用户,我想出了一个部分基于上述论文的 python3 包。它包含一个基于 trie 的集合容器以及一个其中键为集合的映射容器。您可以在 github 上找到它。

Just to add some background information to the prefix trie solution, recently I found the following paper:

I.Savnik: Index data structure for fast subset and superset queries. CD-ARES, IFIP LNCS, 2013.

The paper proposes the set-trie data structure (container) which provides support for efficient storage and querying of sets of sets using the trie data structure, supporting operations like finding all the supersets/subsets of a given set from a collection of sets.

For any python users interested in an actual implementation, I came up with a python3 package based partly on the above paper. It contains a trie-based container of sets and also a mapping container where the keys are sets. You can find it on github.

小红帽 2025-01-14 06:57:22

我认为前缀特里树是一个很好的开始。

由于你的数组很稀疏,我还会对它们进行批量测试。如果 (B1 ∪ B2) ⊂ A,则两者都包含在内。因此,我们的想法是按对对数组进行“或”打包,并重复直到只有一个“根”数组(只需要两倍的空间)。它允许您提前对您的问题回答“是”,这在如果您不需要知道数组实际上是否包含时非常有用。

您可以独立地为每个数组应用一个保留排序的哈希函数。

即:B ⊂ A ⇒ h(B) ≺ h(A)

将位组合在一起就是这样一个函数,但您也可以在数组的适当分区中对每个 1 位进行计数。在这里,您可以更快地消除候选者(对于特定数组回答“否”)。

I think prefix trie is a great start.

Since yours arrays are sparse, I would additionally test them in bulk. If (B1 ∪ B2) ⊂ A, both are included. So the idea is to OR-pack arrays by pairs, and to reiterate until there is only one "root" array (it would take only twice as much space). It allows to answer 'Yes' to your question earlier, which is mainly useful if you don't need to know with array is actually contained.

Independently, you can apply for each array a hash function preserving ordering.

Ie : B ⊂ A ⇒ h(B) ≺ h(A)

ORing bits together is such a function, but you can also count each 1-bit in adequate partitions of the array. Here, you can eliminate candidates faster (answering 'No' for a particular array).

爱冒险 2025-01-14 06:57:22

您可以通过首先将集合列表减少为“最小”集合来简化问题:仅保留那些不是任何其他集合的超集的集合。问题仍然是相同的,因为如果某个输入集 A 是您删除的某个集 B 的超集,那么它也是至少一个“最小”子集的超集< B 的 code>C 未删除。这样做的好处是你倾向于消除大集合,这使得问题的成本更低。

从那里我将使用某种 ID3 或 C4.5 算法。

You can simplify the problem by first reducing your list of sets to "minimal" sets: keep only those sets which are not supersets of any other ones. The problem remains the same because if some input set A is a superset of some set B you removed, then it is also a superset of at least one "minimal" subset C of B which was not removed. The advantage of doing this is that you tend to eliminate large sets, which makes the problem less expensive.

From there I would use some kind of ID3 or C4.5 algorithm.

一刻暧昧 2025-01-14 06:57:22

基于 trie 解决方案和 @mmihaltz 提到的论文,还可以通过使用 Python 现有的高效 trie 实现来实现查找子集的方法。下面我使用包datrie。唯一的缺点是键必须转换为字符串,这可以通过 "".join(chr(i) for i in myset) 来完成。然而,这将元素的范围限制为大约 110000 个。

from datrie import BaseTrie, BaseState

def existsSubset(trie, setarr, trieState=None):

    if trieState is None:
        trieState = BaseState(trie)

    trieState2 = BaseState(trie)
    trieState.copy_to(trieState2)
    for i, elem in enumerate(setarr):
        if trieState2.walk(elem):
            if trieState2.is_terminal() or existsSubset(trie, setarr[i:], trieState2): 
                return True
            trieState.copy_to(trieState2)
    return False

trie 可以像字典一样使用,但必须在开头提供可能元素的范围:

alphabet = "".join(chr(i) for i in range(100))
trie = BaseTrie(alphabet)

for subset in sets:
   trie["".join(chr(i) for i in subset)] = 0 # the assigned value does not matter

请注意,上面的 trie 实现仅有效键大于(且不等于)0。否则,整数到字符的映射将无法正常工作。这个问题可以通过索引移位来解决。

可以找到还涵盖元素转换的 cython 实现 此处

Building on the trie solution and the paper mentioned by @mmihaltz, it is also possible to implement a method to find subsets by using already existing efficient trie implementations for python. Below I use the package datrie. The only downside is that the keys must be converted to strings, which can be done with "".join(chr(i) for i in myset). This, however, limits the range of elements to about 110000.

from datrie import BaseTrie, BaseState

def existsSubset(trie, setarr, trieState=None):

    if trieState is None:
        trieState = BaseState(trie)

    trieState2 = BaseState(trie)
    trieState.copy_to(trieState2)
    for i, elem in enumerate(setarr):
        if trieState2.walk(elem):
            if trieState2.is_terminal() or existsSubset(trie, setarr[i:], trieState2): 
                return True
            trieState.copy_to(trieState2)
    return False

The trie can be used like dictionary, but the range of possible elements has to be provided at the beginning:

alphabet = "".join(chr(i) for i in range(100))
trie = BaseTrie(alphabet)

for subset in sets:
   trie["".join(chr(i) for i in subset)] = 0 # the assigned value does not matter

Note that the trie implementation above works only with keys larger than (and not equal to) 0. Otherwise, the integer to character mapping does not work properly. This problem can be solved with an index shift.

A cython implementation that also covers the conversion of elements can be found here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文