高效的海量字符串搜索问题
问题:提供了一个大型静态字符串列表。由数据和通配符元素(* 和 ?)组成的模式字符串。这个想法是返回所有与模式匹配的字符串——足够简单。
当前解决方案:我目前正在使用线性方法扫描大列表并根据模式对每个条目进行通配。
我的问题:是否有任何合适的数据结构可以存储大型列表,以使搜索的复杂度小于 O(n)?
也许类似于后缀trie?我还考虑过在哈希表中使用双元组和三元组,但是根据返回的单词列表的合并评估匹配所需的逻辑和模式是一场噩梦,而且我不相信它是正确的方法。
The Problem: A large static list of strings is provided. A pattern string comprised of data and wildcard elements (* and ?). The idea is to return all the strings that match the pattern - simple enough.
Current Solution: I'm currently using a linear approach of scanning the large list and globbing each entry against the pattern.
My Question: Are there any suitable data structures that I can store the large list into such that the search's complexity is less than O(n)?
Perhaps something akin to a suffix-trie? I've also considered using bi- and tri-grams in a hashtable, but the logic required in evaluating a match based on a merge of the list of words returned and the pattern is a nightmare, furthermore I'm not convinced its the correct approach.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我同意后缀特里树是一个值得尝试的好主意,除了数据集的庞大规模可能会使其构建消耗的时间与其使用节省的时间一样多。如果您必须多次查询它们以摊销构建成本,那么它们是最好的。也许有几百个查询。
另请注意,这是并行性的一个很好的借口。将列表切成两部分,然后将其交给两个不同的处理器,从而使您的工作完成速度提高一倍。
I agree that a suffix trie is a good idea to try, except that the sheer size of your dataset might make it's construction use up just as much time as its usage would save. Theyre best if youve got to query them multiple times to amortize the construction cost. Perhaps a few hundred queries.
Also note that this is a good excuse for parallelism. Cut the list in two and give it to two different processors and have your job done twice as fast.
您可以构建一个常规特里树并添加通配符边。那么你的复杂度将为 O(n),其中 n 是模式的长度。您必须首先在模式中将
**
的运行替换为*
(也是 O(n) 操作)。如果单词列表是 I am an ox 那么 trie 看起来有点像这样:
下面是一个示例 python 程序:
you could build a regular trie and add wildcard edges. then your complexity would be O(n) where n is the length of the pattern. You would have to replace runs of
**
with*
in the pattern first (also an O(n) operation).If the list of words were I am an ox then the trie would look a bit like this:
And here is a sample python program:
如果您不关心内存并且可以预处理列表,请创建每个后缀的排序数组,指向原始单词,例如,对于 ['hello', 'world'],存储以下内容
:该数组使用模式片段构建候选匹配集。
例如,如果模式为
*or*
,则使用子字符串or< 上的二元截断查找候选匹配
('orld' , 'world')
/code>,然后使用正常的通配方法确认匹配。如果通配符更复杂,例如
h*o
,则为h
和o
构建候选集,并在最终线性之前找到它们的交集全局。If you don't care about memory and you can afford to pre-process the list, create a sorted array of every suffix, pointing to the original word, e.g., for ['hello', 'world'], store this:
Use this array to build sets of candidate matches using pieces of the pattern.
For instance, if the pattern is
*or*
, find the candidate match('orld' , 'world')
using a binary chop on the substringor
, then confirm the match using a normal globbing approach.If the wildcard is more complex, e.g.,
h*o
, built sets of candidates forh
ando
and find their intersection before the final linear glob.你说你目前正在进行线性搜索。这是否为您提供了有关最常执行的查询模式的任何数据?例如,在您当前的用户中,
blah*
比bl?h
(我认为是这样)更常见吗?有了这种先验知识,您就可以将索引工作集中在常用案例上,并将它们降低到 O(1),而不是试图解决更困难、但更不值得的问题,即使每个 可能的查询同样快。
You say you're currently doing linear search. Does this give you any data on the most frequently performed query patterns? e.g. is
blah*
much more common thanbl?h
(which i'd assume it was) among your current users?With that kind of prior knowledge you can focus your indexing efforts on the commonly used cases and get them down to O(1), rather than trying to solve the much more difficult, and yet much less worthwhile, problem of making every possible query equally fast.
您可以通过计算字符串中的字符数来实现简单的加速。没有
b
或单个b
的字符串永远无法与查询abba*
匹配,因此测试它是没有意义的。如果您的字符串是由这些单词组成的,那么这对整个单词效果更好,因为单词比字符多得多;另外,有很多库可以为您构建索引。另一方面,它与您提到的 n-gram 方法非常相似。如果您不使用为您执行此操作的库,您可以通过首先在索引中查找全局最不常见的字符(或单词或 n-gram)来优化查询。这允许您预先丢弃更多不匹配的字符串。
一般来说,所有加速都将基于丢弃不可能匹配的东西的想法。索引的内容和程度取决于您的数据。例如,如果典型的模式长度接近字符串长度,您可以简单地检查字符串是否足够长以容纳该模式。
You can achieve a simple speedup by keeping counts of the characters in your strings. A string with no
b
s or a singleb
can never match the queryabba*
, so there is no point in testing it. This works much better on whole words, if your strings are made of those, since there are many more words than characters; plus, there are plenty of libraries that can build the indexes for you. On the other hand, it is very similar to the n-gram approach you mentioned.If you do not use a library that does it for you, you can optimize queries by looking up the most globally infrequent characters (or words, or n-grams) first in your indexes. This allows you to discard more non-matching strings up front.
In general, all speedups will be based on the idea of discarding things that cannot possibly match. What and how much to index depends on your data. For example, if the typical pattern length is near to the string length, you can simply check to see if the string is long enough to hold the pattern.
对于多字符串搜索有很多好的算法。谷歌“纳瓦罗字符串搜索”,你会看到对多字符串选项的很好的分析。许多算法对于“正常”情况非常有效(搜索相当长的字符串:Wu-Manber;搜索包含在要搜索的文本中很少见的字符的字符串:并行 Horspool)。 Aho-Corasick 是一种算法,可以保证每个输入字符的工作量(微小)有限,无论输入文本如何调整以在搜索中产生最差的行为。对于像 Snort 这样的程序来说,面对拒绝服务攻击,这一点非常重要。如果您对如何实现真正高效的 Aho-Corasick 搜索感兴趣,请查看 ACISM - Aho-Corasick 交错状态矩阵。
There are plenty of good algorithms for multi-string search. Google "Navarro string search" and you'll see a good analysis of multi-string options. A number of algorithsm are extremely good for "normal" cases (search strings that are fairly long: Wu-Manber; search strings with characters that are modestly rare in the text to be searched: parallel Horspool). Aho-Corasick is an algorithm that guarantees a (tiny) bounded amount of work per input character, no matter how the input text is tuned to create worst behaviour in the search. For programs like Snort, that's really important, in the face of denial-of-service attacks. If you are interested in how a really efficient Aho-Corasick search can be implemented, take a look at ACISM - an Aho-Corasick Interleaved State Matrix.