搜索文本中多个字符串之一的有效算法?
我需要搜索传入的不是很长的文本片段以查找给定字符串的出现情况。字符串在整个会话中是恒定的,并且数量不多(~10)。额外的简化是没有任何字符串包含在任何其他字符串中。
我目前正在使用与 str1 | 匹配的 boost 正则表达式str2 | ...
。这项任务的表现很重要,所以我想知道我是否可以改进它。并不是说我可以比 boost 人员更好地编程,但也许专用的实现比通用的实现更有效。
由于字符串长时间保持不变,我可以预先构建一个数据结构,例如状态转换表。
例如,如果字符串是 abcx
、bcy
和 cz
,并且到目前为止我已阅读 abc
,我应该处于组合状态,这意味着您要么将 3 个字符放入字符串 1,要么将 2 个字符放入字符串 2,要么将 1 个字符放入字符串 1
。然后读取x
接下来会将我移动到string 1 matches
状态等,并且除xyz
之外的任何字符都将移动到初始状态,我不需要退回到b
。
任何想法或参考表示赞赏。
I need to search incoming not-very-long pieces of text for occurrences of given strings. The strings are constant for the whole session and are not many (~10). Additional simplification is that none of the strings is contained in any other.
I am currently using boost regex matching with str1 | str2 | ...
. The performance of this task is important, so I wonder if I can improve it. Not that I can program better than the boost guys, but perhaps a dedicated implementation is more efficient than a general one.
As the strings stay constant over long time, I can afford building a data structure, like a state transition table, upfront.
e.g., if the strings are abcx
, bcy
and cz
, and I've read so far abc
, I should be in a combined state that means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1
. Then reading x
next will move me to the string 1 matched
state etc., and any char other than xyz
will move to the initial state, and I will not need to retract back to b
.
Any ideas or references are appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
查看Aho–Corasick 字符串匹配算法!
Check out the Aho–Corasick string matching algorithm!
查看后缀树。
Take a look at Suffix Tree.
看看这个: http: //www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/configuration/algorithm.html
递归/非递归区别的存在是一个非常强烈的建议,BOOST不一定是线性时间离散有限状态机。因此,您很有可能可以更好地解决您的特定问题。
最佳答案在很大程度上取决于您有多少个干草堆以及针的最小尺寸。如果最小的针长于几个字符,您可能会比通用正则表达式库做得更好一点。
基本上所有字符串搜索都是通过测试当前位置(光标)是否匹配来进行的,如果没有找到,则将光标进一步向右滑动再次尝试。
Rabin-Karp 从您正在搜索的字符串(或多个字符串)构建 DFSM,以便将测试和光标运动组合在单个操作中。然而,Rabin-Karp 最初是为单针设计的,因此如果一个匹配可能是另一个匹配的正确前缀,则需要支持回溯。 (请记住,当您想重用代码时。)
另一种策略是如果可能的话,将光标向右滑动多个字符。博耶-摩尔就是这样做的。它通常是为单针而设计的。构建一个包含所有字符以及它们在针中出现的最右边位置(如果有的话)的表。现在,将光标定位在 len(needle)-1 处。该表条目将告诉您 (a) 可以找到针距光标的向左偏移量,或者 (b) 您可以将光标 len(needle) 进一步向右移动。
当您有多个针时,工作台的构造和使用会变得更加复杂,但它仍然可能会节省您一个数量级的探头。您可能仍然想要创建 DFSM,但您不调用常规搜索方法,而是调用 does_this_DFSM_match_at_this_offset()。
另一种策略是一次测试 8 位以上。有一个垃圾邮件杀手工具可以一次查看 32 位机器字。然后,它执行一些简单的哈希代码,将结果装入 12 位,然后查看表以查看是否有命中。每个模式有四个条目(从模式开始的偏移量为 0、1、2 和 3),这样,尽管表中有数千个模式,但它们只测试主题中每个 32 位字的一到两个线。
所以总的来说,是的,当针头恒定时,你可以比正则表达式更快。
Look at this: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/configuration/algorithm.html
The existence of a recursive/non-recursive distinction is a pretty strong suggestion that BOOST is not necessarily a linear-time discrete finite-state machine. Therefore, there's a good chance you can do better for your particular problem.
The best answer depends quite a bit on how many haystacks you have and the minimum size of a needle. If the smallest needle is longer than a few characters, you may be able to do a little bit better than a generalized regex library.
Basically all string searches work by testing for a match at the current position (cursor), and if none is found, then trying again with the cursor slid farther to the right.
Rabin-Karp builds a DFSM out of the string (or strings) for which you are searching so that the test and the cursor motion are combined in a single operation. However, Rabin-Karp was originally designed for a single needle, so you would need to support backtracking if one match could ever be a proper prefix of another. (Remember that for when you want to reuse your code.)
Another tactic is to slide the cursor more than one character to the right if at all possible. Boyer-Moore does this. It's normally built for a single needle. Construct a table of all characters and the rightmost position that they appear in the needle (if at all). Now, position the cursor at len(needle)-1. The table entry will tell you (a) what leftward offset from the cursor that the needle might be found, or (b) that you can move the cursor len(needle) farther to the right.
When you have more than one needle, the construction and use of your table grows more complicated, but it still may possibly save you an order of magnitude on probes. You still might want to make a DFSM but instead of calling a general search method, you call does_this_DFSM_match_at_this_offset().
Another tactic is to test more than 8 bits at a time. There's a spam-killer tool that looks at 32-bit machine words at a time. It then does some simple hash code to fit the result into 12 bits, and then looks in a table to see if there's a hit. You have four entries for each pattern (offsets of 0, 1, 2, and 3 from the start of the pattern) and then this way despite thousands of patterns in the table they only test one or two per 32-bit word in the subject line.
So in general, yes, you can go faster than regexes WHEN THE NEEDLES ARE CONSTANT.
我一直在查看答案,但似乎没有一个答案非常明确......而且大部分都归结为几个链接。
这里让我感兴趣的是你的问题的独特性,到目前为止暴露的解决方案根本没有利用我们正在大海捞针中同时寻找几根针这一事实。
我肯定会看看 KMP / Boyer Moore,但我不会盲目地应用它们(至少如果你有时间的话),因为它们是为单针量身定制的,我非常相信我们可以利用这样一个事实:我们有多个字符串,并使用自定义状态机(或 BM 的自定义表)一次检查所有字符串。
当然,它不太可能改善大 O(Boyer Moore 每个字符串运行 3n,因此无论如何它都是线性的),但您可能会获得常数因子。
I've been looking at the answers but none seem quite explicit... and mostly boiled down to a couple of links.
What intrigues me here is the uniqueness of your problem, the solutions exposed so far do not capitalize at all on the fact that we are looking for several needles at once in the haystack.
I would take a look at KMP / Boyer Moore, for sure, but I would not apply them blindly (at least if you have some time on your hand), because they are tailored for a single needle, and I am pretty convinced we could capitalized on the fact that we have several strings and check all of them at once, with a custom state machine (or custom tables for BM).
Of course, it's unlikely to improve the big O (Boyer Moore runs in 3n for each string, so it'll be linear anyway), but you could probably gain on the constant factor.
正则表达式引擎初始化预计会有一些开销,
因此,如果不涉及真正的正则表达式,
C - memcmp() 应该没问题。
如果你能告诉文件大小并给出一些
具体用例,我们可以建立一个基准
(我认为这很有趣)。
有趣: memcmp 探索 和 时序差异
问候
rbo
Regex engine initialization is expected to have some overhead,
so if there are no real regular expressions involved,
C - memcmp() should do fine.
If you can tell the file sizes and give some
specific use cases, we could build a benchmark
(I consider this very interesting).
Interesting: memcmp explorations and timing differences
Regards
rbo
总有博耶摩尔
There is always Boyer Moore
除了 Rabin-Karp-Algorithm 和 Knuth-Morris-Pratt-Algorithm 之外,我的算法书建议使用 有限用于字符串匹配的状态机。对于每个搜索字符串,您需要构建这样一个有限状态机。
Beside Rabin-Karp-Algorithm and Knuth-Morris-Pratt-Algorithm, my Algorithm-Book suggests a Finite State Machine for String Matching. For every Search String you need to build such a Finite State Machine.
您可以使用非常流行的 Lex & 来做到这一点。 Yacc 工具,并支持 Flex 和 Bison 工具。
您可以使用 Lex 来获取字符串的标记。
将预定义的字符串与 Lexer 返回的标记进行比较。
找到匹配项后,执行所需的操作。
有很多网站描述了 Lex 和 Yacc。
其中一个网站是 http://epaperpress.com/lexandyacc/
You can do it with very popular Lex & Yacc tools, with the support of Flex and Bison tools.
You can use Lex for getting tokens of the string.
Compare your pre-defined strings with the tokens returned from Lexer.
When match is found, perform the desired action.
There are many sites which describe about Lex and Yacc.
One such site is http://epaperpress.com/lexandyacc/