我们研究的另一件事是修补 re 模块,以便在回溯次数过多时引发异常。这是可能的,但需要更改 Python C 源代码并重新编译,因此不可移植。我们还提交了一个补丁,以在与 python 字符串匹配时释放 GIL,但我认为它没有被核心接受(python 仅保存 GIL,因为正则表达式可以针对可变缓冲区运行)。
I have worked on a program that allows users to enter their own regex and you are right - they can (and do) enter regex that can take a long time to finish - sometimes longer than than the lifetime of the universe. What is worse, while processing a regex Python holds the GIL, so it will not only hang the thread that is running the regex, but the entire program.
Limiting the length of the regex will not work, since the problem is backtracking. For example, matching the regex r"(\S+)+x" on a string of length N that does not contain an "x" will backtrack 2**N times. On my system this takes about a second to match against "a"*21 and the time doubles for each additional character, so a string of 100 characters would take approximately 19167393131891000 years to complete (this is an estimate, I have not timed it).
For more information read the O'Reilly book "Mastering Regular Expressions" - this has a couple of chapters on performance.
edit To get round this we wrote a regex analysing function that tried to catch and reject some of the more obvious degenerate cases, but it is impossible to get all of them.
Another thing we looked at was patching the re module to raise an exception if it backtracks too many times. This is possible, but requires changing the Python C source and recompiling, so is not portable. We also submitted a patch to release the GIL when matching against python strings, but I don't think it was accepted into the core (python only holds the GIL because regex can be run against mutable buffers).
对于临时用户来说,为他们提供子集语言要简单得多。例如,shell 的通配规则位于 fnmatch 中。 SQL LIKE 条件规则是另一个示例。
将用户的语言翻译成正确的正则表达式以便在运行时执行。
It's much simpler for casual users to give them a subset language. The shell's globbing rules in fnmatch, for example. The SQL LIKE condition rules are another example.
Translate the user's language into a proper regex for execution at runtime.
Compiling the regular expression should be reasonably safe. Although what it compiles into is not strictly an NFA (backreferences mean it's not quite as clean) it should still be sort of straightforward to compile into.
Now as to performance characteristics, this is another problem entirely. Even a small regular expression can have exponential time characteristics because of backtracking. It might be better to define a certain subset of features and only support very limited expressions that you translate yourself.
If you really want to support general regular expressions you either have to trust your users (sometimes an option) or limit the amount of space and time used. I believe that space used is determined only by the length of the regular expression.
edit: As Dave notes, apparently the global interpreter lock is held during regex matching, which would make setting that timeout harder. If that is the case, your only option to set a timeout is to run the match in a separate process. While not exactly ideal it is doable. I completely forgot about multiprocessing. Point of interest is this section on sharing objects. If you really need the hard constraints, separate processes are the way to go here.
It's not necessary to use compile() except when you need to reuse a lot of different regular expressions. The module already caches the last expressions.
The point 2 (at execution) could be a very difficult one if you allow the user to input any regular expression. You can make a complex regexp with few characters, like the famous (x+x+)+y one. I think it's a problem yet to be resolved in a general way. A workaround could be launching a different thread and monitor it, if it exceeds the allowed time, kill the thread and return with an error.
I really don't think it is possible to execute code simply by passing it into an re.compile. The way I understand it, re.compile (or any regex system in any language) converts the regex string into a finite automaton (DFA or NFA), and despite the ominous name 'compile' it has nothing to do with the execution of any code.
You technically don't need to use re.compile() to perform a regular expression operation on a string. In fact, the compile method can often be slower if you're only executing the operation once since there's overhead associated with the initial compiling.
If you're worried about the word "compile" then avoid it all together and simply pass the raw expression to match, search, etc. You may wind up improving the performance of your code slightly anyways.
发布评论
评论(6)
我开发了一个程序,允许用户输入自己的正则表达式,你是对的 - 他们可以(并且确实)输入可能需要很长时间才能完成的正则表达式 - 有时比宇宙的生命周期还要长。更糟糕的是,在处理正则表达式时,Python 持有 GIL,因此它不仅会挂起运行正则表达式的线程,还会挂起整个程序。
限制正则表达式的长度是行不通的,因为问题是回溯。例如,在长度为 N 且不包含“x”的字符串上匹配正则表达式
r"(\S+)+x"
将回溯 2**N 次。在我的系统上,这大约需要一秒钟来匹配"a"*21
,并且每增加一个字符,时间就会加倍,因此 100 个字符的字符串将需要大约 19167393131891000 年才能完成(这是估计值) ,我没有计时)。有关更多信息,请阅读 O'Reilly 的书“掌握正则表达式”——其中有几个关于性能的章节。
编辑
为了解决这个问题,我们编写了一个正则表达式分析函数,试图捕获并拒绝一些更明显的退化情况,但不可能获得所有这些情况。
我们研究的另一件事是修补 re 模块,以便在回溯次数过多时引发异常。这是可能的,但需要更改 Python C 源代码并重新编译,因此不可移植。我们还提交了一个补丁,以在与 python 字符串匹配时释放 GIL,但我认为它没有被核心接受(python 仅保存 GIL,因为正则表达式可以针对可变缓冲区运行)。
I have worked on a program that allows users to enter their own regex and you are right - they can (and do) enter regex that can take a long time to finish - sometimes longer than than the lifetime of the universe. What is worse, while processing a regex Python holds the GIL, so it will not only hang the thread that is running the regex, but the entire program.
Limiting the length of the regex will not work, since the problem is backtracking. For example, matching the regex
r"(\S+)+x"
on a string of length N that does not contain an "x" will backtrack 2**N times. On my system this takes about a second to match against"a"*21
and the time doubles for each additional character, so a string of 100 characters would take approximately 19167393131891000 years to complete (this is an estimate, I have not timed it).For more information read the O'Reilly book "Mastering Regular Expressions" - this has a couple of chapters on performance.
edit
To get round this we wrote a regex analysing function that tried to catch and reject some of the more obvious degenerate cases, but it is impossible to get all of them.
Another thing we looked at was patching the re module to raise an exception if it backtracks too many times. This is possible, but requires changing the Python C source and recompiling, so is not portable. We also submitted a patch to release the GIL when matching against python strings, but I don't think it was accepted into the core (python only holds the GIL because regex can be run against mutable buffers).
对于临时用户来说,为他们提供子集语言要简单得多。例如,shell 的通配规则位于 fnmatch 中。 SQL LIKE 条件规则是另一个示例。
将用户的语言翻译成正确的正则表达式以便在运行时执行。
It's much simpler for casual users to give them a subset language. The shell's globbing rules in fnmatch, for example. The SQL LIKE condition rules are another example.
Translate the user's language into a proper regex for execution at runtime.
编译正则表达式应该相当安全。虽然它编译成的内容并不是严格意义上的 NFA(反向引用意味着它不太干净),但它仍然应该很容易编译成。
至于性能特征,这完全是另一个问题。由于回溯,即使很小的正则表达式也可能具有指数时间特征。最好定义功能的某个子集,并且仅支持您自己翻译的非常有限的表达式。
如果您确实想支持通用正则表达式,您要么必须信任您的用户(有时是一个选项),要么限制使用的空间和时间量。我相信所使用的空间仅由正则表达式的长度决定。
编辑:正如戴夫所指出的,显然全局解释器锁是在正则表达式匹配期间保持的,这将使设置超时变得更加困难。如果是这种情况,设置超时的唯一选择是在单独的进程中运行匹配。虽然不完全理想,但它是可行的。我完全忘记了
multiprocessing
。感兴趣的地方是关于共享的本节对象。如果您确实需要硬性约束,那么可以采用单独的流程。Compiling the regular expression should be reasonably safe. Although what it compiles into is not strictly an NFA (backreferences mean it's not quite as clean) it should still be sort of straightforward to compile into.
Now as to performance characteristics, this is another problem entirely. Even a small regular expression can have exponential time characteristics because of backtracking. It might be better to define a certain subset of features and only support very limited expressions that you translate yourself.
If you really want to support general regular expressions you either have to trust your users (sometimes an option) or limit the amount of space and time used. I believe that space used is determined only by the length of the regular expression.
edit: As Dave notes, apparently the global interpreter lock is held during regex matching, which would make setting that timeout harder. If that is the case, your only option to set a timeout is to run the match in a separate process. While not exactly ideal it is doable. I completely forgot about
multiprocessing
. Point of interest is this section on sharing objects. If you really need the hard constraints, separate processes are the way to go here.除非您需要重用许多不同的正则表达式,否则没有必要使用compile()。该模块已经缓存了最后的表达式。
如果允许用户输入任何正则表达式,第 2 点(执行时)可能会非常困难。您可以使用很少的字符创建一个复杂的正则表达式,例如著名的
(x+x+)+y
。我认为这是一个尚未得到普遍解决的问题。解决方法可以是启动不同的线程并监视它,如果超过允许的时间,则终止该线程并返回错误。
It's not necessary to use compile() except when you need to reuse a lot of different regular expressions. The module already caches the last expressions.
The point 2 (at execution) could be a very difficult one if you allow the user to input any regular expression. You can make a complex regexp with few characters, like the famous
(x+x+)+y
one. I think it's a problem yet to be resolved in a general way.A workaround could be launching a different thread and monitor it, if it exceeds the allowed time, kill the thread and return with an error.
我真的不认为仅仅通过将代码传递到 re.compile 中就可以执行代码。据我理解, re.compile (或任何语言的任何正则表达式系统)将正则表达式字符串转换为 有限自动机(DFA 或 NFA),尽管有一个不祥的名字“编译”,但它与任何代码的执行无关。
I really don't think it is possible to execute code simply by passing it into an re.compile. The way I understand it, re.compile (or any regex system in any language) converts the regex string into a finite automaton (DFA or NFA), and despite the ominous name 'compile' it has nothing to do with the execution of any code.
从技术上讲,您不需要使用
re.compile()
对字符串执行正则表达式操作。事实上,如果您只执行一次操作,则编译方法通常会更慢,因为存在与初始编译相关的开销。如果您担心“编译”这个词,那么就避免使用它,只需将原始表达式传递给
match
、search
等。您最终可能会提高性能无论如何,你的代码稍微有点。You technically don't need to use
re.compile()
to perform a regular expression operation on a string. In fact, the compile method can often be slower if you're only executing the operation once since there's overhead associated with the initial compiling.If you're worried about the word "compile" then avoid it all together and simply pass the raw expression to
match
,search
, etc. You may wind up improving the performance of your code slightly anyways.