快速检查集合是否是存储集合的超集

发布于 2025-01-07 06:57:22 字数 1312 浏览 4 评论 0原文

问题是

我得到了 N 个 C 布尔值数组。我想将它们组织成一个数据结构，使我能够尽快执行以下操作：给定一个新数组，如果该数组是任何存储数组的“超集”，则返回 true。对于超集，我的意思是：如果 A[i] 对于每个 i 都为真，且 B[i] 为真，则 A 是 B 的超集。如果 B[i] 为假，则 A[i] 可以是任何值。

或者，就集合而不是数组而言：

将 N 个集合（每个集合有 C 个可能的元素）存储到数据结构中，以便您可以快速查找给定集合是否是任何存储集合的超集。< /strong>

构建数据结构可以花费尽可能长的时间，但查找应该尽可能高效，并且数据结构不能占用太多空间。

在某些上下文中，

我认为这本身就是一个有趣的问题，但对于我真正想要解决的问题，您可以假设以下内容：

N = 10000
C = 1000
存储的数组是稀疏的
查找的数组是随机的（因此不稀疏）

到目前为止我所想出的

对于 O(NC) 查找：只需迭代所有数组即可。但这太慢了。
对于 O(C) 查找：我在这里有一个很长的描述，但正如 Amit 在评论中指出的那样，它基本上是一个 BDD。虽然它的查找速度很快，但它的节点数量呈指数级增长。由于 N 和 C 如此之大，这占用了太多空间。

我希望在这个 O(N*C) 和 O(C) 解决方案之间，可能存在一个不需要指数空间的 O(log(N)*C) 解决方案。

编辑：我想出了一个新想法

对于 O(sqrt(N)C) 查找：将数组存储为前缀特里树。查找数组 A 时，如果 A[i]=0，则访问相应的子树；如果 A[i]=1，则访问两个子树。
我的直觉告诉我，如果您假设存储的数组是随机的，那么这应该使查找的（平均）复杂度为 O(sqrt(N)C)。但是： 1.它们不是，数组很稀疏。 2.这只是直觉，我无法证明这一点。

我将尝试这个新想法和 BDD 方法，看看这两种方法中哪一种效果最好。

但与此同时，这个问题不是更频繁地发生吗？它没有名字吗？之前不是有研究吗？真的感觉就像我在这里重新发明轮子。

原文

The problem

I am given N arrays of C booleans. I want to organize these into a datastructure that allows me to do the following operation as fast as possible: Given a new array, return true if this array is a "superset" of any of the stored arrays. With superset I mean this: A is a superset of B if A[i] is true for every i where B[i] is true. If B[i] is false, then A[i] can be anything.

Or, in terms of sets instead of arrays:

Store N sets (each with C possible elements) into a datastructure so you can quickly look up if a given set is a superset of any of the stored sets.

Building the datastructure can take as long as possible, but the lookup should be as efficient as possible, and the datastructure can't take too much space.

Some context

I think this is an interesting problem on its own, but for the thing I'm really trying to solve, you can assume the following:

N = 10000
C = 1000
The stored arrays are sparse
The looked up arrays are random (so not sparse)

What I've come up with so far

For O(NC) lookup: Just iterate all the arrays. This is just too slow though.
For O(C) lookup: I had a long description here, but as Amit pointed out in the comments, it was basically a BDD. While this has great lookup speed, it has an exponential number of nodes. With N and C so large, this takes too much space.

I hope that in between this O(N*C) and O(C) solution, there's maybe a O(log(N)*C) solution that doesn't require an exponential amount of space.

EDIT: A new idea I've come up with

For O(sqrt(N)C) lookup: Store the arrays as a prefix trie. When looking up an array A, go to the appropriate subtree if A[i]=0, but visit both subtrees if A[i]=1.
My intuition tells me that this should make the (average) complexity of the lookup O(sqrt(N)C), if you assume that the stored arrays are random. But: 1. they're not, the arrays are sparse. And 2. it's only intuition, I can't prove it.

I will try out both this new idea and the BDD method, and see which of the 2 work out best.

But in the meantime, doesn't this problem occur more often? Doesn't it have a name? Hasn't there been previous research? It really feels like I'm reinventing the wheel here.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暖树树初阳… 2025-01-14 06:57:22

只是为了向前缀特里解决方案添加一些背景信息，最近我发现了以下论文：

I.Savnik：用于快速子集和超集查询的索引数据结构。 CD-ARES，IFIP LNCS，2013。

本文提出了该集合-trie数据结构（容器），为使用trie数据结构的集合的高效存储和查询提供支持，支持查找所有集合等操作集合集合中给定集合的超集/子集。

对于任何对实际实现感兴趣的 python 用户，我想出了一个部分基于上述论文的 python3 包。它包含一个基于 trie 的集合容器以及一个其中键为集合的映射容器。您可以在 github 上找到它。

回复收藏 0 原文

小红帽 2025-01-14 06:57:22

我认为前缀特里树是一个很好的开始。

由于你的数组很稀疏，我还会对它们进行批量测试。如果 (B1 ∪ B2) ⊂ A，则两者都包含在内。因此，我们的想法是按对对数组进行“或”打包，并重复直到只有一个“根”数组（只需要两倍的空间）。它允许您提前对您的问题回答“是”，这在如果您不需要知道数组实际上是否包含时非常有用。

您可以独立地为每个数组应用一个保留排序的哈希函数。

即：B ⊂ A ⇒ h(B) ≺ h(A)

将位组合在一起就是这样一个函数，但您也可以在数组的适当分区中对每个 1 位进行计数。在这里，您可以更快地消除候选者（对于特定数组回答“否”）。

回复收藏 0 原文

爱冒险 2025-01-14 06:57:22

您可以通过首先将集合列表减少为“最小”集合来简化问题：仅保留那些不是任何其他集合的超集的集合。问题仍然是相同的，因为如果某个输入集 A 是您删除的某个集 B 的超集，那么它也是至少一个“最小”子集的超集< B 的 code>C 未删除。这样做的好处是你倾向于消除大集合，这使得问题的成本更低。

从那里我将使用某种 ID3 或 C4.5 算法。

回复收藏 0 原文

一刻暧昧 2025-01-14 06:57:22

基于 trie 解决方案和 @mmihaltz 提到的论文，还可以通过使用 Python 现有的高效 trie 实现来实现查找子集的方法。下面我使用包datrie。唯一的缺点是键必须转换为字符串，这可以通过 "".join(chr(i) for i in myset) 来完成。然而，这将元素的范围限制为大约 110000 个。

from datrie import BaseTrie, BaseState

def existsSubset(trie, setarr, trieState=None):

    if trieState is None:
        trieState = BaseState(trie)

    trieState2 = BaseState(trie)
    trieState.copy_to(trieState2)
    for i, elem in enumerate(setarr):
        if trieState2.walk(elem):
            if trieState2.is_terminal() or existsSubset(trie, setarr[i:], trieState2): 
                return True
            trieState.copy_to(trieState2)
    return False

trie 可以像字典一样使用，但必须在开头提供可能元素的范围：

alphabet = "".join(chr(i) for i in range(100))
trie = BaseTrie(alphabet)

for subset in sets:
   trie["".join(chr(i) for i in subset)] = 0 # the assigned value does not matter

请注意，上面的 trie 实现仅有效键大于（且不等于）0。否则，整数到字符的映射将无法正常工作。这个问题可以通过索引移位来解决。

可以找到还涵盖元素转换的 cython 实现此处。

Building on the trie solution and the paper mentioned by @mmihaltz, it is also possible to implement a method to find subsets by using already existing efficient trie implementations for python. Below I use the package datrie. The only downside is that the keys must be converted to strings, which can be done with "".join(chr(i) for i in myset). This, however, limits the range of elements to about 110000.

from datrie import BaseTrie, BaseState

def existsSubset(trie, setarr, trieState=None):

    if trieState is None:
        trieState = BaseState(trie)

    trieState2 = BaseState(trie)
    trieState.copy_to(trieState2)
    for i, elem in enumerate(setarr):
        if trieState2.walk(elem):
            if trieState2.is_terminal() or existsSubset(trie, setarr[i:], trieState2): 
                return True
            trieState.copy_to(trieState2)
    return False

The trie can be used like dictionary, but the range of possible elements has to be provided at the beginning:

alphabet = "".join(chr(i) for i in range(100))
trie = BaseTrie(alphabet)

for subset in sets:
   trie["".join(chr(i) for i in subset)] = 0 # the assigned value does not matter

Note that the trie implementation above works only with keys larger than (and not equal to) 0. Otherwise, the integer to character mapping does not work properly. This problem can be solved with an index shift.

A cython implementation that also covers the conversion of elements can be found here.

回复收藏 0 原文

~没有更多了~

关于作者

铁憨憨

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

快速检查集合是否是存储集合的超集

问题是

在某些上下文中，

到目前为止我所想出的

编辑：我想出了一个新想法

The problem

Some context

What I've come up with so far

EDIT: A new idea I've come up with

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

快速检查集合是否是存储集合的超集

问题是

在某些上下文中，

到目前为止我所想出的

编辑：我想出了一个新想法

The problem

Some context

What I've come up with so far

EDIT: A new idea I've come up with

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。