UNIX 风格的 RegExp Replace 在 Windows 下运行速度极慢。帮助?编辑:负面前瞻断言影响性能
我正在尝试对 1.12 GB 目录中的每个日志文件运行 unix regEXP,然后将匹配的模式替换为 ''
。对 4 meg 文件进行测试运行大约需要 10 分钟,但成功了。显然,某些东西正在对性能造成几个数量级的损害。
更新:我注意到,在包含 77 个匹配项的 5.6 MB 文件中搜索 ^(155[0-2]).*$ 大约需要 7 秒。添加否定先行断言 ?!,使 regExp 变为 ^(?!155[0-2]).*$ 导致它至少需要 5-10 分钟;当然,将会有成千上万的比赛。
当有很多匹配项时,否定的前瞻断言是否会对性能极其不利?
I'm trying to run a unix regEXP on every log file in a 1.12 GB directory, then replace the matched pattern with ''
. Test run on a 4 meg file took about 10 minutes, but worked. Obviously something is damaging performance by several orders of magnitude.
UPDATE: I am noticing that searching for ^(155[0-2]).*$ takes ~7 seconds in a 5.6 MB file with 77 matches. Adding the Negative Lookahead Assertion, ?!, so that the regExp becomes ^(?!155[0-2]).*$ is causing it to take at least 5-10 minutes; granted, there will be thousands and thousands of matches.
Should the negative lookahead assertion be extremely detrimental to performance when there are many matches?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果你能在一开始就去掉那个
.*
会有帮助。它前面可以有什么,只是空格?如果是这样,请尝试:如果它真的可以是任何东西,请尝试使其变得非贪婪:
注意:在这两个示例中,我删除了第二个
.*
,因为第三个也将匹配相同的内容。它有助于思考正则表达式引擎实际上将做什么。
^
(行首)。没问题。.*
相匹配1
吗?如果不匹配,请使.*
少匹配一个字符并重复,直到它匹配1
。您可以看到,这意味着对于每条不匹配的行,它将回溯整行。现在,如果您在开头仅使用
\s*
,那么将仅抓取空白,而不是整行。如果它确实可以是任何东西,那么在与155
模式匹配的行上.*?
会更快,而在不匹配的行上则大致相同。 (在不匹配的行上,它将继续增长.*
直到它占据整行。)If you can get rid of that
.*
at the beginning it would help. What can be before it, just whitespace? If so, try:If it really can be anything, try making it non-greedy:
Note: in both examples, I removed the second
.*
, since the third one would match the same thing as well.It helps to think about what the regex engine will actually be doing.
^
(beginning of line). No problem..*
. This means it grabs the entire line.1
? If not, make the.*
match one fewer character and repeat until it does match a1
.You can see that this means for every line that doesn't match, it will backtrack through the entire line. Now, if you just use
\s*
at the beginning, then that will only grab whitespace, not the entire line. If it really can be anything,.*?
will be faster on lines that do match the155
pattern, and it will be about the same on lines that don't. (On lines that don't match, it will keep growing the.*
until it has grabbed the whole line.)基本上:您使用的正则表达式实现是非线性的,只能以任何效率处理正则表达式语言的子集。请参阅我关于 可以处理机器生成的正则表达式实现的问题有效地使用正则表达式来获取更多背景信息。
如果您可以选择其他实现,那么您很幸运;当我看的时候,这些都很稀缺。两个合理的选项是 RE2 和 TRE,但两者都是库,而不是独立的可执行文件。
您的另一个选择是使用您过去使用过的 unix 实用程序(grep?); grep 当然有一个 Windows 端口,就像许多其他 UNIX 实用程序一样。
Basically: The regex implementation you are using is non-linear and can only deal with a subset of the regular expression language with any efficiency. See my question about a regex implementation that can handle machine generated regexes efficiently for more background.
If you can select another implementation, you're in luck; back when I was looking these were scarce. Two reasonable options are RE2, and TRE, but both are libraries, not standalone executables.
Another option you have is to use the unix utility (grep?) you've used in the past; grep certainly has a windows port as do many other unix utilities.