子字符串与正则表达式匹配更快?

发布于 2024-09-10 16:45:42 字数 224 浏览 3 评论 0原文

在阅读了 RE/NFA 和 DFA 后,似乎使用 RE 查找字符串中的子字符串实际上可能比暴力 O(mn) 查找更快。我的理由是,DFA 实际上会维护状态并避免多次处理“干草堆”中的每个字符。因此,如果使用正则表达式进行长字符串搜索,实际上可能会快得多。

当然,这仅对从 NFA 转换为 DFA 的 RE 匹配器有效。

在现实生活中,是否有人在使用 RE 而不是暴力匹配器时体验到更好的字符串匹配性能?

After having read up on RE/NFA and DFA, it seems that finding a substring within a string might actually be asymptotically faster using an RE rather than a brute force O(mn) find. My reasoning is that a DFA would actually maintain state and avoid processing each character in the "haystack" more than once. Hence, searches in long strings may actually be much faster if done with regular expressions.

Of course, this is valid only for RE matchers that convert from NFA to DFA.

Has anyone experienced better string match performance in real life when using RE rather than a brute force matcher?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

嘴硬脾气大 2024-09-17 16:45:42

首先,我建议您阅读有关几种语言的正则表达式内部原理的文章: Regular表达式匹配可以简单而快速

由于许多语言中的正则表达式不仅用于匹配,还提供了组捕获和反向引用的可能性,因此在执行从给定正则表达式构建的 NFA 时,几乎所有实现都使用所谓的“回溯”。而且这个实现具有指数时间复杂度(在最坏的情况下)。

可以通过 DFA(使用组捕获)实现 RE,但它会产生开销(请参阅 Laurikari 的论文 带有标记转换的 NFA、它们到确定性自动机的转换以及在正则表达式中的应用)。

对于简单的子字符串搜索,您可以使用 Knuth-Morris-Pratt 算法,构建DFA来搜索子串,并且具有最优的O(len(s))复杂度。但它也有开销,如果您在现实世界的单词和短语(不那么重复)上针对此最佳算法测试朴素方法(O(nm)),您可能会发现朴素方法平均而言更好。

对于精确的子字符串搜索,您还可以尝试 Boyer–Moore 算法,该算法最坏情况的复杂度为 O(mn),但在实际数据上的平均效果比 KMP 更好。

First of all, I would recommend you read the article about internals of regular expressions in several languages: Regular Expression Matching Can Be Simple And Fast.

Because regexps in many languages are not just for matching, but also provide possibility of group-capturing and back-referencing, almost all implementations use so called "backtracking" when execute NFA built from the given regexp. And this implementation has exponential time complexity (in worst case).

There could be RE implementation through the DFA (with group capturing), but it has an overhead (see Laurikari's paper NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions).

For simple substring searching you could use Knuth-Morris-Pratt algorithm, which build DFA to search substring, and it has optimal O(len(s)) complexity. But it hase overhead also, and if you test naive approach (O(nm)) against this optimal algorithm on real-world words and phrases (which are not so repetitive), you could find that naive approach is better in average.

For exact substring searching you could also try Boyer–Moore algo, which has O(mn) worst-case complexity, but work better than KMP in average on real-world data.

节枝 2024-09-17 16:45:42

实践中使用的大多数正则表达式是PCRE(Perl兼容正则表达式),它比正则语言更广泛,因此不能用正则语法来表达。 PCRE 具有诸如正向/负向先行/后向断言甚至递归之类的功能,因此解析可能需要多次处理某些字符。当然,这一切都取决于特定的 RE 实现:如果表达式保持在正则语法的范围内,它是否会被优化。

就我个人而言,我还没有对两者进行任何类型的性能比较。然而,根据我的经验,我从来没有遇到过暴力查找和替换的性能问题,而我不得不不止一次地处理 RE 性能瓶颈。

Most regular expressions used in practice are PCRE (Perl-Compatible Regular Expressions), which are wider than regular language and thus cannot be expressed with a regular grammar. PCRE has things like positive/negative lookahead/lookbehind assertions and even recursion, so parsing may require processing some characters more than once. Surely, it all comes down to particular RE implementation: whether it is optimized if the expressions stays within bounds of regular grammar or not.

Personally, I haven't done any sort of performance comparisons between the two. However, in my experience I never ever had performance issues with brute force find-and-replace, while I had to deal with RE performance bottlenecks on more than one occasion.

淑女气质 2024-09-17 16:45:42

如果您查看大多数语言的文档,它会提到,如果您不需要正则表达式的功能,则出于性能原因,您应该使用非正则表达式版本...示例: http://www.php.net/manual/en/function.preg-split.php 声明:“如果您不需要正则表达式的强大功能,您可以选择更快(尽管更简单)的替代方案,例如explode()或str_split()。”

这是一种到处都存在的权衡。也就是说,解决方案越灵活、功能越丰富,其性能就越差。

If you look at documentation for most languages it will mention that if you dont need to power of regex you should use the non-regex version for performance reasons... Example: http://www.php.net/manual/en/function.preg-split.php states: "If you don't need the power of regular expressions, you can choose faster (albeit simpler) alternatives like explode() or str_split()."

This is a trade off that exists everywhere. That is the more flexible and feature rich a solution is the poorer its performance.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文