如何处理来自用户提交的正则表达式的无休止的匹配

发布于 2024-07-18 09:06:50 字数 555 浏览 8 评论 0原文

让我们考虑 C# 中的以下两行（使用框架 .NET 3.5）

Regex regex = new Regex(@"^((E|e)t )?(M|m)oi (?<NewName>[A-Za-z]\.?\w*((\-|\s)?[A-Za-z]?\w{1,})+)$", RegexOptions.Compiled | RegexOptions.IgnoreCase);
Match m = regex.Match("moi aussi jaimerai etre un ordinateur pour pas m'énnerver ");

（抱歉，这是一个法国程序:)）

执行它们时，进程会卡在 Match() 方法中并且永远不会退出。我猜正则表达式模式中的空格存在一些问题，但我想做的不是更改模式（实际上它是由我的工具的最终用户在程序外部设置的），而是能够停止该过程（例如超时）。

有人知道这是否是 .NET 正则表达式的众所周知的问题，以及是否有一种简单的方法可以解决它，或者我是否必须对这些行进行线程化并在需要时中止它们（我绝对不想这样做））。

原文

Let's consider the two following lines in C# (using framework .NET 3.5)

Regex regex = new Regex(@"^((E|e)t )?(M|m)oi (?<NewName>[A-Za-z]\.?\w*((\-|\s)?[A-Za-z]?\w{1,})+)$", RegexOptions.Compiled | RegexOptions.IgnoreCase);
Match m = regex.Match("moi aussi jaimerai etre un ordinateur pour pas m'énnerver ");

(sorry it's a french program :))

When they are executed, the process gets stuck in the Match() method and never exits. I guess there is some problem with the white space in the regex pattern but what I would like to do is not changing the pattern (actually it is set outside the program, by the end users of my tool) but being able to stop the process (with a timeout for instance).

Does someone know if this is well-known problem with the .NET Regular Expression and if there is an easy way to work around it or do I have to thread these lines and abort them if needed (definitely I wouldn't like to do that).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤独陪着我 2024-07-25 09:06:50

如果我在 Regexbuddy 中输入表达式，它会显示以下消息

比赛尝试提前中止
因为正则表达式太
复杂的。您计划使用的正则表达式引擎
使用它可能无法处理
它根本就崩溃了。抬头
中的“灾难性回溯”
帮助文件以了解如何避免这种情况
情况。

查找灾难性回溯给出了以下解释

失控的正则表达式：灾难性的回溯
考虑正则表达式 (x+x+)+y。
在你惊恐地尖叫并说之前
这个人为的例子应该是
写为 xx+y 以完全匹配
没有那些可怕的嵌套也一样
量词：假设每个“x”
代表更复杂的东西，
与某些字符串匹配
都是“x”。请参阅有关 HTML 的部分
下面的文件是一个真实的例子。
让我们看看您申请后会发生什么
这个正则表达式为 xxxxxxxxxy。首先
x+ 将匹配所有 10 个 x 字符。这
第二个 x+ 失败。第一个 x+ 然后
回溯到 9 场比赛，并且
第二个选取剩余的 x。
该组现已匹配一次。这
小组重复，但第一次失败
x+。由于重复了一次
足够了，小组赛。 y
匹配 y 并且总体匹配是
成立。正则表达式已声明
功能，代码被发送到
顾客，他的电脑爆炸了。
差不多了。
当 y 时，上面的正则表达式变得丑陋
主题字符串中缺失。
当 y 失败时，正则表达式引擎
回溯。集团有一个
它可以回溯到迭代。这
第二个 x+ 只匹配一个 x，所以它
无法走回头路。但第一个 x+ 可以
放弃一个x。第二个 x+ 立即
匹配 xx。团里又来了一个
迭代，下一次失败，并且
y 失败了。再次回溯，
第二个 x+ 现在有一个回溯
位置，减少自身以匹配 x。
该小组尝试进行第二次迭代。
第一个 x+ 匹配，但第二个是
卡在绳子的末端。
再次回溯，第一个x+
该组的第一次迭代减少
本身为 7 个字符。第二个x+
匹配 xxx。如果 y 失败，则第二个 x+
减少到 xx，然后减少到 x。现在
组可以匹配第二次迭代，
每个 x+ 对应一个 x。但是这个
(7,1),(1,1) 组合也失败。所以
它转到 (6,4) 然后 (6,2)(1,1)
然后 (6,1),(2,1) 然后
(6,1),(1,2) 然后我想你开始
获得漂移。
如果您在 10x 字符串上尝试此正则表达式
在 RegexBuddy 的调试器中，需要
2558步计算出最终的y
不见了。对于 11x 弦，它
需要 5118 步。对于12，需要
10238 步。显然我们有一个
这里的指数复杂度为 O(2^n)。
在 21x 时，调试器在 2.8 时退出
万步，诊断不良案例
灾难性的回溯。
RegexBuddy 很宽容，因为它
检测到它在转圈，并且
中止匹配尝试。其他正则表达式
引擎（如.NET）将继续运行
永远，而其他人会崩溃
堆栈溢出（像 Perl 一样，之前
版本 5.10）。堆栈溢出是
在 Windows 上尤其令人讨厌，因为
他们倾向于让你申请
消失得无影无踪，无任何解释。
如果您运行网络，请务必小心
允许用户提供的服务
他们自己的正则表达式。人们
缺乏正则表达式经验
令人惊讶的想出技巧
指数复数正则
表达式。

我假设你必须用代码来处理它。我建议您联系 Regexbuddy 的作者并询问检测这种情况需要什么。

If I enter the expression in Regexbuddy, it presents following message

The match attempt was aborted early
because the regular expression is too
complex. The regex engine you plan to
use it with may not be able to handle
it at all and crash. Look up
"catastrophic backtracking" in the
help file to learn how to avoid this
situation.

Looking up catastrophic backtracking gives the following explanation

Runaway Regular Expressions: Catastrophic Backtracking
Consider the regular expression (x+x+)+y.
Before you scream in horror and say
this contrived example should be
written as xx+y to match exactly the
same without those terribly nested
quantifiers: just assume that each "x"
represents something more complex,
with certain strings being matched by
both "x". See the section on HTML
files below for a real example.
Let's see what happens when you apply
this regex to xxxxxxxxxxy. The first
x+ will match all 10 x characters. The
second x+ fails. The first x+ then
backtracks to 9 matches, and the
second one picks up the remaining x.
The group has now matched once. The
group repeats, but fails at the first
x+. Since one repetition was
sufficient, the group matches. y
matches y and an overall match is
found. The regex is declared
functional, the code is shipped to the
customer, and his computer explodes.
Almost.
The above regex turns ugly when the y
is missing from the subject string.
When y fails, the regex engine
backtracks. The group has one
iteration it can backtrack into. The
second x+ matched only one x, so it
can't backtrack. But the first x+ can
give up one x. The second x+ promptly
matches xx. The group again has one
iteration, fails the next one, and the
y fails. Backtracking again, the
second x+ now has one backtracking
position, reducing itself to match x.
The group tries a second iteration.
The first x+ matches but the second is
stuck at the end of the string.
Backtracking again, the first x+ in
the group's first iteration reduces
itself to 7 characters. The second x+
matches xxx. Failing y, the second x+
is reduced to xx and then x. Now, the
group can match a second iteration,
with one x for each x+. But this
(7,1),(1,1) combination fails too. So
it goes to (6,4) and then (6,2)(1,1)
and then (6,1),(2,1) and then
(6,1),(1,2) and then I think you start
to get the drift.
If you try this regex on a 10x string
in RegexBuddy's debugger, it'll take
2558 steps to figure out the final y
is missing. For an 11x string, it
needs 5118 steps. For 12, it takes
10238 steps. Clearly we have an
exponential complexity of O(2^n) here.
At 21x the debugger bows out at 2.8
million steps, diagnosing a bad case
of catastrophic backtracking.
RegexBuddy is forgiving in that it
detects it's going in circles, and
aborts the match attempt. Other regex
engines (like .NET) will keep going
forever, while others will crash with
a stack overflow (like Perl, before
version 5.10). Stack overflows are
particularly nasty on Windows, since
they tend to make your application
vanish without a trace or explanation.
Be very careful if you run a web
service that allows users to supply
their own regular expressions. People
with little regex experience have
surprising skill at coming up with
exponentially complex regular
expressions.