当前位置：文江博客话题详情

高效的海量字符串搜索问题

发布于 2024-08-27 01:05:33 字数 322 浏览 12 评论 0原文

问题：提供了一个大型静态字符串列表。由数据和通配符元素（* 和 ?）组成的模式字符串。这个想法是返回所有与模式匹配的字符串——足够简单。

当前解决方案：我目前正在使用线性方法扫描大列表并根据模式对每个条目进行通配。

我的问题：是否有任何合适的数据结构可以存储大型列表，以使搜索的复杂度小于 O(n)？

也许类似于后缀trie？我还考虑过在哈希表中使用双元组和三元组，但是根据返回的单词列表的合并评估匹配所需的逻辑和模式是一场噩梦，而且我不相信它是正确的方法。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

习惯成性 2024-09-03 01:05:33

我同意后缀特里树是一个值得尝试的好主意，除了数据集的庞大规模可能会使其构建消耗的时间与其使用节省的时间一样多。如果您必须多次查询它们以摊销构建成本，那么它们是最好的。也许有几百个查询。

另请注意，这是并行性的一个很好的借口。将列表切成两部分，然后将其交给两个不同的处理器，从而使您的工作完成速度提高一倍。

回复收藏 0 原文

樱＆纷飞 2024-09-03 01:05:33

您可以构建一个常规特里树并添加通配符边。那么你的复杂度将为 O(n)，其中 n 是模式的长度。您必须首先在模式中将 ** 的运行替换为 * （也是 O(n) 操作）。

如果单词列表是 I am an ox 那么 trie 看起来有点像这样：

  (I ($ [I])
   a (m ($ [am])
      n ($ [an])
      ? ($ [am an])
      * ($ [am an]))
   o (x ($ [ox])
      ? ($ [ox])
      * ($ [ox]))
   ? ($ [I]
      m ($ [am])
      n ($ [an])
      x ($ [ox])
      ? ($ [am an ox])
      * ($ [I am an ox]
         m ($ [am]) ...)
   * ($ [I am an ox]
      I ...
    ...

下面是一个示例 python 程序：

import sys

def addWord(root, word):
    add(root, word, word, '')

def add(root, word, tail, prev):
    if tail == '':
        addLeaf(root, word)
    else:
        head = tail[0]
        tail2 = tail[1:]
        add(addEdge(root, head), word, tail2, head)
        add(addEdge(root, '?'), word, tail2, head)
    if prev != '*':
        for l in range(len(tail)+1):
           add(addEdge(root, '*'), word, tail[l:], '*')

def addEdge(root, char):
    if not root.has_key(char):
        root[char] = {}
    return root[char]

def addLeaf(root, word):
    if not root.has_key('
):
        root['
] = []
    leaf = root['
]
    if word not in leaf:
        leaf.append(word)

def findWord(root, pattern):
    prev = ''
    for p in pattern:
        if p == '*' and prev == '*':
            continue
        prev = p
        if not root.has_key(p):
            return []
        root = root[p]
    if not root.has_key('
):
        return []
    return root['
]

def run():
    print("Enter words, one per line terminate with a . on a line")
    root = {}
    while 1:
        line = sys.stdin.readline()[:-1]
        if line == '.': break
        addWord(root, line)
    print(repr(root))
    print("Now enter search patterns. Do not use multiple sequential '*'s")
    while 1:
        line = sys.stdin.readline()[:-1]
        if line == '.': break
        print(findWord(root, line))

run()

you could build a regular trie and add wildcard edges. then your complexity would be O(n) where n is the length of the pattern. You would have to replace runs of ** with * in the pattern first (also an O(n) operation).

If the list of words were I am an ox then the trie would look a bit like this:

  (I ($ [I])
   a (m ($ [am])
      n ($ [an])
      ? ($ [am an])
      * ($ [am an]))
   o (x ($ [ox])
      ? ($ [ox])
      * ($ [ox]))
   ? ($ [I]
      m ($ [am])
      n ($ [an])
      x ($ [ox])
      ? ($ [am an ox])
      * ($ [I am an ox]
         m ($ [am]) ...)
   * ($ [I am an ox]
      I ...
    ...

And here is a sample python program:

import sys

def addWord(root, word):
    add(root, word, word, '')

def add(root, word, tail, prev):
    if tail == '':
        addLeaf(root, word)
    else:
        head = tail[0]
        tail2 = tail[1:]
        add(addEdge(root, head), word, tail2, head)
        add(addEdge(root, '?'), word, tail2, head)
    if prev != '*':
        for l in range(len(tail)+1):
           add(addEdge(root, '*'), word, tail[l:], '*')

def addEdge(root, char):
    if not root.has_key(char):
        root[char] = {}
    return root[char]

def addLeaf(root, word):
    if not root.has_key('
):
        root['
] = []
    leaf = root['
]
    if word not in leaf:
        leaf.append(word)

def findWord(root, pattern):
    prev = ''
    for p in pattern:
        if p == '*' and prev == '*':
            continue
        prev = p
        if not root.has_key(p):
            return []
        root = root[p]
    if not root.has_key('
):
        return []
    return root['
]

def run():
    print("Enter words, one per line terminate with a . on a line")
    root = {}
    while 1:
        line = sys.stdin.readline()[:-1]
        if line == '.': break
        addWord(root, line)
    print(repr(root))
    print("Now enter search patterns. Do not use multiple sequential '*'s")
    while 1:
        line = sys.stdin.readline()[:-1]
        if line == '.': break
        print(findWord(root, line))

run()

回复收藏 0 原文

缘字诀 2024-09-03 01:05:33

如果您不关心内存并且可以预处理列表，请创建每个后缀的排序数组，指向原始单词，例如，对于 ['hello', 'world']，存储以下内容

[('d'    , 'world'),
 ('ello' , 'hello'),
 ('hello', 'hello'),
 ('ld'   , 'world'),
 ('llo'  , 'hello'),
 ('lo'   , 'hello'),
 ('o'    , 'hello'),
 ('orld' , 'world'),
 ('rld'  , 'world'),
 ('world', 'world')]

：该数组使用模式片段构建候选匹配集。

例如，如果模式为 *or*，则使用子字符串 or< 上的二元截断查找候选匹配 ('orld' , 'world') /code>，然后使用正常的通配方法确认匹配。

如果通配符更复杂，例如 h*o，则为 h 和 o 构建候选集，并在最终线性之前找到它们的交集全局。

If you don't care about memory and you can afford to pre-process the list, create a sorted array of every suffix, pointing to the original word, e.g., for ['hello', 'world'], store this:

[('d'    , 'world'),
 ('ello' , 'hello'),
 ('hello', 'hello'),
 ('ld'   , 'world'),
 ('llo'  , 'hello'),
 ('lo'   , 'hello'),
 ('o'    , 'hello'),
 ('orld' , 'world'),
 ('rld'  , 'world'),
 ('world', 'world')]

Use this array to build sets of candidate matches using pieces of the pattern.

For instance, if the pattern is *or*, find the candidate match ('orld' , 'world') using a binary chop on the substring or, then confirm the match using a normal globbing approach.

If the wildcard is more complex, e.g., h*o, built sets of candidates for h and o and find their intersection before the final linear glob.

回复收藏 0 原文

情未る 2024-09-03 01:05:33

你说你目前正在进行线性搜索。这是否为您提供了有关最常执行的查询模式的任何数据？例如，在您当前的用户中，blah* 比 bl?h（我认为是这样）更常见吗？

有了这种先验知识，您就可以将索引工作集中在常用案例上，并将它们降低到 O(1)，而不是试图解决更困难、但更不值得的问题，即使每个可能的查询同样快。

回复收藏 0 原文

溺深海 2024-09-03 01:05:33

您可以通过计算字符串中的字符数来实现简单的加速。没有 b 或单个 b 的字符串永远无法与查询 abba* 匹配，因此测试它是没有意义的。如果您的字符串是由这些单词组成的，那么这对整个单词效果更好，因为单词比字符多得多；另外，有很多库可以为您构建索引。另一方面，它与您提到的 n-gram 方法非常相似。

如果您不使用为您执行此操作的库，您可以通过首先在索引中查找全局最不常见的字符（或单词或 n-gram）来优化查询。这允许您预先丢弃更多不匹配的字符串。

一般来说，所有加速都将基于丢弃不可能匹配的东西的想法。索引的内容和程度取决于您的数据。例如，如果典型的模式长度接近字符串长度，您可以简单地检查字符串是否足够长以容纳该模式。

回复收藏 0 原文

寒冷纷飞旳雪 2024-09-03 01:05:33

对于多字符串搜索有很多好的算法。谷歌“纳瓦罗字符串搜索”，你会看到对多字符串选项的很好的分析。许多算法对于“正常”情况非常有效（搜索相当长的字符串：Wu-Manber；搜索包含在要搜索的文本中很少见的字符的字符串：并行 Horspool）。 Aho-Corasick 是一种算法，可以保证每个输入字符的工作量（微小）有限，无论输入文本如何调整以在搜索中产生最差的行为。对于像 Snort 这样的程序来说，面对拒绝服务攻击，这一点非常重要。如果您对如何实现真正高效的 Aho-Corasick 搜索感兴趣，请查看 ACISM - Aho-Corasick 交错状态矩阵。

回复收藏 0 原文

~没有更多了~