为什么正则表达式完成 3000 行的 XML 文件非常慢?

发布于 2024-07-17 22:32:08 字数 403 浏览 3 评论 0原文

我注意到正则表达式完成一个 3000 行的 XML 文件非常慢 [1]:

\(<Annotation\(\s*\w\+="[^"]\{-}"\s\{-}\)*>\)\@<=\(\(<\/Annotation\)\@!\_.\)\{-}"MATCH\_.\{-}\(<\/Annotation>\)\@=

我一直认为正则表达式是高效的。 为什么需要这么长时间才能完成正则表达式?

[1] 如何在VIM中重复匹配从A到B?

I noticed that it is very slow for a Regex to finish a XML file with 3000 lines [1]:

\(<Annotation\(\s*\w\+="[^"]\{-}"\s\{-}\)*>\)\@<=\(\(<\/Annotation\)\@!\_.\)\{-}"MATCH\_.\{-}\(<\/Annotation>\)\@=

I always thought that Regexes are efficient. Why does it take so long to finish the Regex?

[1] How can I repeatedly match from A until B in VIM?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

坏尐絯℡ 2024-07-24 22:32:08

高效与否取决于正则表达式本身。 造成效率低下的是回溯。 为了避免这种情况,正则表达式必须尽可能不同。

我们以正则表达式a.*b为例,将其应用于字符串abcdef。 该算法首先将 a.*b 中的文字 aabcdef 中的 a 进行匹配。 接下来将处理表达式.*。 在普通贪婪模式下,乘数扩展至最大,它将匹配 abdef 中的整个剩余 bcdef。 然后将处理 a.*b 中的最后一个文字 b。 但由于已经到达字符串末尾并且正在使用乘法器,因此算法将尝试回溯以匹配整个模式。 .* (bcdef) 的匹配将减少一个字符 (bcde),并且算法会尝试遵守模式的其余部分。 但a.*b 中的babcdef 中的f 不匹配。 因此 .* 将再减少一个字符,直到它匹配空字符串(因此 . 重复零次)和 b 中的 < code>a.*b 与 abcdef 中的 b 匹配。

正如您所看到的,应用于abdefa.*b需要对.*进行6次回溯,直到整个正则表达式匹配。 但是,如果我们更改正则表达式并使用 a[^b]*b 来使其不同,则无需回溯,并且正则表达式可以在第一种方法中匹配。

如果您现在考虑使用惰性修饰符,我必须告诉您,此规则适用于每个修饰符,包括贪婪修饰符和惰性修饰符。 区别在于,惰性修饰符不是首先将匹配扩展到最大,然后通过一次减少一个字符(贪婪)来进行回溯,而是首先将惰性修饰符扩展到最小匹配,然后一次增加一个字符。

It depends on the regular expression itself if it is efficient or not. What it makes inefficient is backtracking. And to avoid this, the regular expression has to be as distinct as possible.

Let’s take the regular expression a.*b as an example and apply it to the string abcdef. The algorithm will first match the literal a in a.*b to the a in abcdef. Next the expression .* will be processed. In the normal greedy mode, where multipliers are expanded to the maximum, it will match to the whole rest bcdef in abdef. Than the last literal b in a.*b will be processed. But as the end of the string is already reached and a mulpliplier is in use, the algorithm will try backtracking to match the whole pattern. The match of .* (bcdef) will be decreased by one character (bcde) and the algorithm tries to comply the rest of the pattern. But the b in a.*b doesn’t match the f in abcdef. So .* will be decreased by one more character until it matches the empty string (thus . is repeated zero times) and the b in a.*b matches the b in abcdef.

As you can see, a.*b applied to abdef needs 6 backtracking approaches for .* until the whole regular expression matches. But if we alter the regular expression and make it distinct by using a[^b]*b instead, there is be no backtracking necessary and the regular expression can be matches within the first approach.

And if you now consider using lazy modifiers instead, I’ve to tell you, that this rules apply to every modifier, both the greedy and lazy modifiers. The difference is instead of first expanding the match to the maximum and than doing backtracking by decreasing the match one character at a time (greedy), the lazy modifiers will first be expanded to the minimum match and than be increased one character at a time.

把梦留给海 2024-07-24 22:32:08

RE 很高效,但它们并不神奇:-)

拖慢速度的并不是输入文件中的行数(3,000 是一个非常小的数字)。 当编译 RE 时,它基本上必须(概念上)编写一个可以根据该 RE 解析输入文件的程序。

该“程序”的速度始终取决于 RE 的复杂性。

例如,"^$" 会非常快,"[0-9]{1,9}" 会稍微慢一些,以及任何涉及回溯的内容(即必须在 RE 中进行备份,通常任何涉及多个可变元素数量子句的内容(您的就是一个例子)都会更慢。

任何可以预先减少行数的方法都会在一定程度上有所帮助,但对于优化 RE 本身来说,这通常被认为是一种魔法。 一种可能性是首先删除除注释停止和开始的行之间的行之外的行。

我不太担心优化我的 RE(但它们通常不会这么复杂)。 我的观点是,他们会花多久的时间。 如果时间太长,我通常会寻找另一种更快但适应性较差的解决方案。

在您的 RE 中,您想要获取 about 属性包含 MATCH 的所有注释 XML,我会使用 Perl(或者对于我们老前辈来说 awk :-) 来完成,因为输入文件是合理固定的格式:

  • 第一行 [a] 上的“
  • “MATCH”也在第一行 [a]。
  • 在最后一行并单独显示[b]。

这将像简单的行扫描仪一样快速,当满足 [a] 条件时打开回显(并打印该行),当回显打开时打印任何其他行,并在满足 [b] 条件时关闭回显(之后)印刷线)。

是的,适应性差得多,但几乎肯定更快(考虑到您的格式良好的输入)。

REs are efficient, but they're not magic :-)

It is not the number of lines in a input file which slows you down (3,000 is a very small number). When an RE is compiled, it basically has to (conceptually) write a program which can parse the input file based on that RE.

The speed of that "program" always depends on the complexity of the RE.

For example, "^$" will be very fast, "[0-9]{1,9}" will be somewhat slower and anything that involves backtracking (i.e., having to back up in the RE, usually anything involving multiple variable-number-of-elements clauses, of which yours is an example) will be slower still.

Anything you can do to minimize the number of lines beforehand will help to some extent but as to optimizing the RE itself, that's often considered a black art. One possibility is to first strip out the lines other than those between lines where the Annotation stops and starts.

I don't tend to worry too much about optimizing my REs (but they're not usually this complex). My view is that they will take as long as they take. If that's too long, I usually look for another solution which is faster but not so adaptable.

In the case of your RE where you wanted to get all Annotation XML where the about attribute contains MATCH, I would do it in Perl (or awk for us old-timers :-) since the input file was reasonably fixed format:

  • "<Annotation " on first line [a].
  • "MATCH" also on first line [a].
  • </Annotation> on last line and on its own [b].

This would be fast as a simple line scanner, turning on echo when the [a] conditions were met (and printing that line), printing any other line when echo was on, and turning echo off when [b] conditions were met (after printing line).

Yes, far less adaptable but almost certainly faster (given your well-formatted input).

硪扪都還晓 2024-07-24 22:32:08

我发现很难阅读带有所有反斜杠的正则表达式; 如果我正确地解释了正则表达式(vim?)的这种方言,那么您所说的是详细的 pcre 会拼写的内容:

(?<= <Annotation(\s*\w+="[^"]*?"\s*?)*> )
    ((?! <\/Annotation).)*?
    "MATCH.*?
(?= <\/Annotation> )

跳出的一个明显问题是它以可变长度后向断言开头。 可变长度的lookbehind本身就很麻烦(并且在许多正则表达式方言中不受支持),但随后由于后面跟着自由匹配(任何带有负lookahead的字符)而变得更加复杂。

这对于处理器来说很难匹配; 这种组合几乎消除了它可能实现的任何可能的优化。 对于文件中的每个字符,它必须回溯(可能到文件的开头/上一个匹配的末尾)以查看后向匹配是否匹配,然后向前前进(可能到文件的末尾)寻找“MATCH”。

你可以帮助它通过仅使用固定大小的lookbehind,或者简单地忘记lookarounds并将整个字符串包含在匹配中,这将更有效地查找,但代价是需要更多的后处理来挑选出您的组重新感兴趣。

<Annotation(\s+(\w+="[^"]))*\s*>
    (.*? "MATCH .*?)
<Annotation>

但是,一如既往,对于此类问题,Clippy sez:

看起来您正在使用正则表达式解析 XML。

您需要帮助吗?

( ) 使用真正的 XML 解析器,因为正则表达式本质上无法解析 XML

( ) 只是努力制作不可读的复杂正则表达式,但仍然不起作用

[ ] 不要再向我显示此提示

I'm finding it very hard to read your regex with all the backslashes; if I'm interpreting this dialect of regex (vim?) correctly, what you're saying is what verbose pcre would spell:

(?<= <Annotation(\s*\w+="[^"]*?"\s*?)*> )
    ((?! <\/Annotation).)*?
    "MATCH.*?
(?= <\/Annotation> )

An obvious problem that jumps out is that it starts with a variable-length lookbehind assertion. Variable-length lookbehind is troublesome enough as it is (and not supported in many regex dialects), but then it's compounded by being followed by a liberal match (any character with negative lookahead).

This is hard for a processor to match; the combination pretty much kills any possible optimisation it might have been able to achieve. For every character in the file it has to backtrack (potentially to the beginning of file/end of previous match) to see if the lookbehind matches, then step forward (potentially to the end of file) looking for "MATCH.

You could help it by using only a fixed-size lookbehind, or simply forgetting about the lookarounds and including the whole string in the match, which will be more efficient to find, at the expense of needing a bit more post-processing to pick out the groups you're interested in.

<Annotation(\s+(\w+="[^"]))*\s*>
    (.*? "MATCH .*?)
<Annotation>

However, as always with this kind of question, Clippy sez:

It looks like you're parsing XML with regex.

Would you like help?

( ) Use a real XML parser because regex is inherently incapable of parsing XML

( ) Just soldier on making unreadable complex regexes which still don't work

[ ] Don't show me this tip again

哭泣的笑容 2024-07-24 22:32:08

“正则表达式是高效的”这个描述太宽泛,不准确。 如果您忽略不同正则表达式引擎提供的相对性能的差异,则正则表达式的效率仍然取决于模式的复杂性。

当正则表达式失败得更快,而不是更快地成功匹配文本的一部分时,它就是高效的。 IOW,它需要足够具体,以便不完全匹配的文本应该在匹配过程的早期被拒绝,而不是到达最后。 这也意味着需要最大限度地减少回溯。 请注意,回溯可能会达到灾难性级别,这甚至会导致最好的正则表达式引擎冻结。 在 Jan Goyvaerts 提供的示例中,甚至没有提及文件的大小,并且该大小也是不相关的。 然而,重要的是字符串的大小。

这是一个讨论线程几年前我开始在 RegexAdvice.com 上尝试获得人们讨论正则表达式的性能。 我想,它确实有一些值得学习的地方。

"Regexes are efficient" is too wide a characterization to be accurate. If you ignore the differences in relative peformance offered by different Regex engines, how efficient your Regex is still depends upon the complexity of the pattern.

A Regex is efficient when it fails faster, rather than succeeds to match a portion of text faster. IOW, it needs to be specific enough so that text that isn't going to match completely should be rejected early on in the matching process rather than reaching the end. This also means that backtracking needs to be minimized. Note that backtracking can reach catastrophic levels which would result even the best Regex engine freezing up. In the example provided by Jan Goyvaerts, the size of the file isn't even mentioned and is not relevant. It is the size of the string which matters, however.

Here's a discussion thread I started on RegexAdvice.com a few years ago that attempted to get people discussing about performance of Regular expressions. It does have a few pointers to learn from, I guess.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文