python：如何中断正则表达式匹配

发布于 2024-10-22 05:58:26 字数 3024 浏览 2 评论 0原文

我迭代大量下载的文本文件中的行，并对每一行进行正则表达式匹配。通常，比赛时间不到一秒钟。然而，有时一场比赛需要几分钟，有时比赛根本没有完成，代码就挂起（有几次等了一个小时，然后放弃了）。因此，我需要引入某种超时，并以某种方式告诉正则表达式匹配代码在 10 秒左右后停止。我可以接受这样一个事实：我将丢失正则表达式应该返回的数据。

我尝试了以下方法（当然，一个代码示例中显示了两种不同的基于线程的解决方案）：

def timeout_handler():
    print 'timeout_handler called'

if __name__ == '__main__':
    timer_thread = Timer(8.0, timeout_handler)
    parse_thread = Thread(target=parse_data_files, args=(my_args))
    timer_thread.start()
    parse_thread.start()
    parse_thread.join(12.0)
    print 'do we ever get here ?'

但我既没有调用 timeout_handler，也没有得到 do we get here ？输出中的 行，代码只是停留在 parse_data_files 中。

更糟糕的是，我什至无法使用 CTRL-C 停止程序，而是需要查找 python 进程号并终止该进程。一些研究表明，Python 人员知道正则表达式 C 代码正在运行： http://bugs.python.org/issue846388

我确实使用信号取得了一些成功：

signal(SIGALRM, timeout_handler)
alarm(8)
data_sets = parse_data_files(config(), data_provider)
alarm(0)

这让我在输出中得到了 timeout_handler called 行 - 而且我仍然可以使用 CTRL- 停止我的脚本C。如果我现在像这样修改 timeout_handler：

class TimeoutException(Exception): 
    pass 

def timeout_handler(signum, frame):
    raise TimeoutException()

并将对 re.match(...) 的实际调用包含在 try ... except TimeoutException 子句，正则表达式匹配实际上确实被中断。不幸的是，这只适用于我用来尝试东西的简单单线程沙箱脚本。这个解决方案有一些问题：

信号仅触发一次，如果有多个有问题的行，我会卡在第二行上，
计时器就在那里开始计数，而不是在实际解析开始时，
因为 GIL ，我必须在主线程中完成所有信号设置，并且信号仅在主线程中接收；这与多个文件应该在单独的线程中同时解析的事实相冲突 - 也只有一个全局超时异常引发，我不知道如何知道我需要在哪个线程中对其做出反应
我已经阅读了几个现在线程和信号不能很好地混合

，我也考虑过在单独的进程中进行正则表达式匹配，但在我开始之前，我想我最好在这里检查一下是否有人以前遇到过这个问题并且可以给出给我一些关于如何解决它的提示。

更新

正则表达式如下所示（好吧，无论如何，其中之一，其他正则表达式也会出现问题；这是最简单的一个）：

'^(\d{5}), .+?, (\d {8}), (\d{4}), .+?, .+?,' + 37 * ' (.*?),' + ' (.*?)$'

示例数据：

<代码>95756, "库恩", 20110311, 2130, -34.00, 151.21, 260, 06.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999, -9999, 07.0, -9999, 9、 -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999 , -9999, -9999, -9999, -9999, -9999, -

如前所述，正则表达式通常执行正常 - 我可以解析数百个文件，其中包含数百行不到一分钟。不过，此时文件已完成 - 代码似乎会因行不完整的文件而挂起，例如

`95142, "YMGD ", 20110311, 1700, -12.06, 134.23, 310, 05.0, 25.8, 23.7 , 1004.7, 20.6, 0.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999 , -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999

我也遇到过正则表达式似乎立即返回并报告不匹配的情况。

更新 2

我只是快速阅读了灾难性文章，但据我所知到目前为止，这不是原因 - 我没有嵌套任何重复运算符。

我使用的是 Mac OSX，因此无法使用 RegexBuddy 来分析我的正则表达式。我尝试了 RegExhibit （显然在内部使用了 Perl RegEx 引擎） - 然后就跑掉了，也。

原文

I iterate over the lines in a large number of downloaded text files and do a regex match on each line. Usually, the match takes less than a second. However, at times a match takes several minutes, sometimes the match does not finish at all and the code just hangs (waited an hour a couple of times, then gave up). Therefore, I need to introduce some kind of timeout and tell the regex match code in some way to stop after 10 seconds or so. I can live with the fact that I will lose the data the regex was supposed to return.

I tried the following (which of course is already 2 different, thread-based solutions shown in one code sample):

def timeout_handler():
    print 'timeout_handler called'

if __name__ == '__main__':
    timer_thread = Timer(8.0, timeout_handler)
    parse_thread = Thread(target=parse_data_files, args=(my_args))
    timer_thread.start()
    parse_thread.start()
    parse_thread.join(12.0)
    print 'do we ever get here ?'

but I do neither get the timeout_handler called nor the do we ever get here ? line in the output, the code is just stuck in parse_data_files.

Even worse, I can't even stop the program with CTRL-C, instead I need to look up the python process number and kill that process. Some research showed that the Python guys are aware of regex C code running away: http://bugs.python.org/issue846388

I did achieve some success using signals:

signal(SIGALRM, timeout_handler)
alarm(8)
data_sets = parse_data_files(config(), data_provider)
alarm(0)

this gets me the timeout_handler called line in the output - and I can still stop my script using CTRL-C. If I now modify the timeout_handler like this:

class TimeoutException(Exception): 
    pass 

def timeout_handler(signum, frame):
    raise TimeoutException()

and enclose the actual call to re.match(...) in a try ... except TimeoutException clause, the regex match actually does get interrupted. Unfortunately, this only works in my simple, single-threaded sandbox script I'm using to try out stuff. There is a few things wrong with this solution:

the signal triggers only once, if there is more than one problematic line, I'm stuck on the second one
the timer starts counting right there, not when the actual parsing starts
because of the GIL, I have to do all the signal setup in the main thread and signals are only received in the main thread; this clashes with the fact that multiple files are meant to be parsed simultaneously in separate threads - there is also only one global timeout exception raised and I don't see how to know in which thread I need to react to it
I've read several times now that threads and signals do not mix very well

I have also considered doing the regex match in a separate process, but before I get into that, I thought I'd better check here if anyone has come across this problem before and could give me some hints on how to solve it.

Update

the regex looks like this (well, one of them anyway, the problem occurs with other regexes, too; this is the simplest one):

'^(\d{5}), .+?, (\d{8}), (\d{4}), .+?, .+?,' + 37 * ' (.*?),' + ' (.*?)$'

sample data:

95756, "KURN ", 20110311, 2130, -34.00, 151.21, 260, 06.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -

As said, the regex usually performs ok - I can parse several hundreds of files with several hundreds of lines in less than a minute. That's when the files are complete, though - the code seems to hang with files that have incomplete lines, such as e.g.

`95142, "YMGD ", 20110311, 1700, -12.06, 134.23, 310, 05.0, 25.8, 23.7, 1004.7, 20.6, 0.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999

I do also get cases where the regex seems to return right away and reports a non-match.

Update 2

I have only quickly read through the catastrophic article, but as far as I can tell so far, that's not the cause - I do not nest any repetition operators.

I'm on Mac OSX, so I can't use RegexBuddy to analyze my regex. I tried RegExhibit (which apparently uses a Perl RegEx engine internally) - and that runs away, too.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

厌味 2024-10-29 05:58:26

你正在遭遇灾难性的回溯；不是因为嵌套量词，而是因为您的量化字符也可以匹配分隔符，并且由于分隔符很多，因此在某些情况下您将获得指数时间。

除了它看起来更像 CSV 解析器的工作之外，请尝试以下操作：

r'^(\d{5}), [^,]+, (\d{8}), (\d{4}), [^,]+, [^,]+,' + 37 * r' ([^,]+),' + r' ([^,]+)
通过明确禁止逗号在分隔符之间匹配，您将极大地加快正则表达式的速度。
例如，如果逗号可能出现在带引号的字符串内，则只需将 [^,]+ （在您期望的地方）与
(?:"[^"]*"|[^,]+)


说明：
使用您的正则表达式与第一个示例相比，RegexBuddy 在正则表达式引擎执行 793 个步骤后报告成功匹配。对于第二个（不完整的行）示例，它在正则表达式引擎执行 1.000.000 步后报告匹配失败（这是 RegexBuddy 放弃的地方；Python 将继续搅动）。
使用我的正则表达式，成功匹配发生在 173 个步骤中，失败发生在 174 个步骤中。

通过明确禁止逗号在分隔符之间匹配，您将极大地加快正则表达式的速度。

例如，如果逗号可能出现在带引号的字符串内，则只需将 [^,]+ （在您期望的地方）与

说明：

使用您的正则表达式与第一个示例相比，RegexBuddy 在正则表达式引擎执行 793 个步骤后报告成功匹配。对于第二个（不完整的行）示例，它在正则表达式引擎执行 1.000.000 步后报告匹配失败（这是 RegexBuddy 放弃的地方；Python 将继续搅动）。

使用我的正则表达式，成功匹配发生在 173 个步骤中，失败发生在 174 个步骤中。

You are running into catastrophic backtracking; not because of nested quantifiers but because your quantified characters also can match the separators, and since there are a lot of them, you'll get exponential time in certain cases.

Aside from the fact that it looks more like a job for a CSV parser, try the following:

r'^(\d{5}), [^,]+, (\d{8}), (\d{4}), [^,]+, [^,]+,' + 37 * r' ([^,]+),' + r' ([^,]+)
By explicitly disallowing the comma to match between separators, you'll speed up the regex enormously.
If commas may be present inside quoted strings, for example, then just exchange [^,]+ (in places where you'd expect this) with
(?:"[^"]*"|[^,]+)


To illustrate:
Using your regex against the first example, RegexBuddy reports a successful match after 793 steps of the regex engine. For the second (incomplete line) example, it reports a match failure after 1.000.000 steps of the regex engine (this is where RegexBuddy gives up; Python will keep on churning). 
Using my regex, the successful match happens in 173 steps, the failure in 174.

By explicitly disallowing the comma to match between separators, you'll speed up the regex enormously.

If commas may be present inside quoted strings, for example, then just exchange [^,]+ (in places where you'd expect this) with

To illustrate:

Using your regex against the first example, RegexBuddy reports a successful match after 793 steps of the regex engine. For the second (incomplete line) example, it reports a match failure after 1.000.000 steps of the regex engine (this is where RegexBuddy gives up; Python will keep on churning).

Using my regex, the successful match happens in 173 steps, the failure in 174.

回复收藏 0 原文