python:如何中断正则表达式匹配
我迭代大量下载的文本文件中的行,并对每一行进行正则表达式匹配。通常,比赛时间不到一秒钟。然而,有时一场比赛需要几分钟,有时比赛根本没有完成,代码就挂起(有几次等了一个小时,然后放弃了)。因此,我需要引入某种超时,并以某种方式告诉正则表达式匹配代码在 10 秒左右后停止。我可以接受这样一个事实:我将丢失正则表达式应该返回的数据。
我尝试了以下方法(当然,一个代码示例中显示了两种不同的基于线程的解决方案):
def timeout_handler():
print 'timeout_handler called'
if __name__ == '__main__':
timer_thread = Timer(8.0, timeout_handler)
parse_thread = Thread(target=parse_data_files, args=(my_args))
timer_thread.start()
parse_thread.start()
parse_thread.join(12.0)
print 'do we ever get here ?'
但我既没有调用 timeout_handler,也没有得到 do we get here ?输出中的
行,代码只是停留在 parse_data_files
中。
更糟糕的是,我什至无法使用 CTRL-C 停止程序,而是需要查找 python 进程号并终止该进程。一些研究表明,Python 人员知道正则表达式 C 代码正在运行: http://bugs.python.org/issue846388
我确实使用信号取得了一些成功:
signal(SIGALRM, timeout_handler)
alarm(8)
data_sets = parse_data_files(config(), data_provider)
alarm(0)
这让我在输出中得到了 timeout_handler called
行 - 而且我仍然可以使用 CTRL- 停止我的脚本C。如果我现在像这样修改 timeout_handler:
class TimeoutException(Exception):
pass
def timeout_handler(signum, frame):
raise TimeoutException()
并将对 re.match(...)
的实际调用包含在 try
... except TimeoutException 子句,正则表达式匹配实际上确实被中断。不幸的是,这只适用于我用来尝试东西的简单单线程沙箱脚本。这个解决方案有一些问题:
- 信号仅触发一次,如果有多个有问题的行,我会卡在第二行上,
- 计时器就在那里开始计数,而不是在实际解析开始时,
- 因为 GIL ,我必须在主线程中完成所有信号设置,并且信号仅在主线程中接收;这与多个文件应该在单独的线程中同时解析的事实相冲突 - 也只有一个全局超时异常引发,我不知道如何知道我需要在哪个线程中对其做出反应
- 我已经阅读了几个现在线程和信号不能很好地混合
,我也考虑过在单独的进程中进行正则表达式匹配,但在我开始之前,我想我最好在这里检查一下是否有人以前遇到过这个问题并且可以给出给我一些关于如何解决它的提示。
更新
正则表达式如下所示(好吧,无论如何,其中之一,其他正则表达式也会出现问题;这是最简单的一个):
'^(\d{5}), .+?, (\d {8}), (\d{4}), .+?, .+?,' + 37 * ' (.*?),' + ' (.*?)$'
示例数据:
<代码>95756, "库恩", 20110311, 2130, -34.00, 151.21, 260, 06.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999, -9999, 07.0, -9999, 9、 -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999 , -9999, -9999, -9999, -9999, -9999, -
如前所述,正则表达式通常执行正常 - 我可以解析数百个文件,其中包含数百行不到一分钟。不过,此时文件已完成 - 代码似乎会因行不完整的文件而挂起,例如
`95142, "YMGD ", 20110311, 1700, -12.06, 134.23, 310, 05.0, 25.8, 23.7 , 1004.7, 20.6, 0.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999 , -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999
我也遇到过正则表达式似乎立即返回并报告不匹配的情况。
更新 2
我只是快速阅读了灾难性文章,但据我所知到目前为止,这不是原因 - 我没有嵌套任何重复运算符。
我使用的是 Mac OSX,因此无法使用 RegexBuddy 来分析我的正则表达式。我尝试了 RegExhibit (显然在内部使用了 Perl RegEx 引擎) - 然后就跑掉了, 也。
I iterate over the lines in a large number of downloaded text files and do a regex match on each line. Usually, the match takes less than a second. However, at times a match takes several minutes, sometimes the match does not finish at all and the code just hangs (waited an hour a couple of times, then gave up). Therefore, I need to introduce some kind of timeout and tell the regex match code in some way to stop after 10 seconds or so. I can live with the fact that I will lose the data the regex was supposed to return.
I tried the following (which of course is already 2 different, thread-based solutions shown in one code sample):
def timeout_handler():
print 'timeout_handler called'
if __name__ == '__main__':
timer_thread = Timer(8.0, timeout_handler)
parse_thread = Thread(target=parse_data_files, args=(my_args))
timer_thread.start()
parse_thread.start()
parse_thread.join(12.0)
print 'do we ever get here ?'
but I do neither get the timeout_handler called
nor the do we ever get here ?
line in the output, the code is just stuck in parse_data_files
.
Even worse, I can't even stop the program with CTRL-C
, instead I need to look up the python process number and kill that process. Some research showed that the Python guys are aware of regex C code running away: http://bugs.python.org/issue846388
I did achieve some success using signals:
signal(SIGALRM, timeout_handler)
alarm(8)
data_sets = parse_data_files(config(), data_provider)
alarm(0)
this gets me the timeout_handler called
line in the output - and I can still stop my script using CTRL-C
. If I now modify the timeout_handler like this:
class TimeoutException(Exception):
pass
def timeout_handler(signum, frame):
raise TimeoutException()
and enclose the actual call to re.match(...)
in a try
... except TimeoutException
clause, the regex match actually does get interrupted. Unfortunately, this only works in my simple, single-threaded sandbox script I'm using to try out stuff. There is a few things wrong with this solution:
- the signal triggers only once, if there is more than one problematic line, I'm stuck on the second one
- the timer starts counting right there, not when the actual parsing starts
- because of the GIL, I have to do all the signal setup in the main thread and signals are only received in the main thread; this clashes with the fact that multiple files are meant to be parsed simultaneously in separate threads - there is also only one global timeout exception raised and I don't see how to know in which thread I need to react to it
- I've read several times now that threads and signals do not mix very well
I have also considered doing the regex match in a separate process, but before I get into that, I thought I'd better check here if anyone has come across this problem before and could give me some hints on how to solve it.
Update
the regex looks like this (well, one of them anyway, the problem occurs with other regexes, too; this is the simplest one):
'^(\d{5}), .+?, (\d{8}), (\d{4}), .+?, .+?,' + 37 * ' (.*?),' + ' (.*?)$'
sample data:
95756, "KURN ", 20110311, 2130, -34.00, 151.21, 260, 06.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -
As said, the regex usually performs ok - I can parse several hundreds of files with several hundreds of lines in less than a minute. That's when the files are complete, though - the code seems to hang with files that have incomplete lines, such as e.g.
`95142, "YMGD ", 20110311, 1700, -12.06, 134.23, 310, 05.0, 25.8, 23.7, 1004.7, 20.6, 0.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999
I do also get cases where the regex seems to return right away and reports a non-match.
Update 2
I have only quickly read through the catastrophic article, but as far as I can tell so far, that's not the cause - I do not nest any repetition operators.
I'm on Mac OSX, so I can't use RegexBuddy to analyze my regex. I tried RegExhibit (which apparently uses a Perl RegEx engine internally) - and that runs away, too.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
你正在遭遇灾难性的回溯;不是因为嵌套量词,而是因为您的量化字符也可以匹配分隔符,并且由于分隔符很多,因此在某些情况下您将获得指数时间。
除了它看起来更像 CSV 解析器的工作之外,请尝试以下操作:
通过明确禁止逗号在分隔符之间匹配,您将极大地加快正则表达式的速度。
例如,如果逗号可能出现在带引号的字符串内,则只需将
[^,]+
(在您期望的地方)与说明:
使用您的正则表达式与第一个示例相比,RegexBuddy 在正则表达式引擎执行 793 个步骤后报告成功匹配。对于第二个(不完整的行)示例,它在正则表达式引擎执行 1.000.000 步后报告匹配失败(这是 RegexBuddy 放弃的地方;Python 将继续搅动)。
使用我的正则表达式,成功匹配发生在 173 个步骤中,失败发生在 174 个步骤中。
You are running into catastrophic backtracking; not because of nested quantifiers but because your quantified characters also can match the separators, and since there are a lot of them, you'll get exponential time in certain cases.
Aside from the fact that it looks more like a job for a CSV parser, try the following:
By explicitly disallowing the comma to match between separators, you'll speed up the regex enormously.
If commas may be present inside quoted strings, for example, then just exchange
[^,]+
(in places where you'd expect this) withTo illustrate:
Using your regex against the first example, RegexBuddy reports a successful match after 793 steps of the regex engine. For the second (incomplete line) example, it reports a match failure after 1.000.000 steps of the regex engine (this is where RegexBuddy gives up; Python will keep on churning).
Using my regex, the successful match happens in 173 steps, the failure in 174.
你不能用线程来做到这一点。继续您的想法,在单独的流程中进行比赛。
You can't do it with threads. Go ahead with your idea of doing the match in a separate process.
与其尝试通过超时来解决正则表达式挂起问题,也许值得考虑一种完全不同的方法。如果您的数据确实只是逗号分隔值,则使用
csv 应该会获得更好的性能
-模块或仅使用line.split(",")
。Instead of trying to solve the regexp hangup issue with timeouts, maybe it would be worthwhile to consider a completely different kind of approach. If your data really is just comma-separated values, you should get much better performance with the
csv
-module or just usingline.split(",")
.Python 中的线程是一个奇怪的野兽。全局解释器锁本质上是解释器周围的一个大锁,这意味着一次只有一个线程在解释器内执行。
线程调度委托给操作系统。 Python 本质上是向操作系统发出信号,表明另一个线程可以在一定数量的“指令”之后获取锁定。因此,如果 Python 由于正则表达式失控而繁忙,它就永远没有机会向操作系统发出信号,表明它可能会尝试获取另一个线程的锁。这就是使用信号的原因;他们是打断的唯一方法。
我支持 Nosklo,请继续使用单独的流程。或者,尝试重写正则表达式,使其不会跑掉。请参阅与回溯相关的问题。这可能是也可能不是正则表达式性能不佳的原因,并且可能无法更改正则表达式。但如果这是原因并且可以改变,那么通过避免多个过程,您将省去很多麻烦。
Threading in Python is a weird beast. The Global Interpreter Lock is essentially one big Lock around the interpreter, that means only one thread at a time gets to execute within the interpreter.
Thread scheduling is delegated to the OS. Python essentially signals the OS that another thread may take the lock after a certain number of 'instructions'. So if Python is busy due to a run-away regular expression, it never gets the chance to signal the OS that it may try to take the lock for another thread. Hence the reason for using signals; they are the only way to interrupt.
I'm with Nosklo, go ahead and use separate processes. Or, try to rewrite the regular expression so that it doesn't run away. See the problems associated with backtracking. This may or may not be the cause for the poor regex performance, and changing your regex may not be possible. But if this is the cause and it can be changed, you'll save yourself a whole lot of headache by avoiding multiple processes.