UNIX 风格的 RegExp Replace 在 Windows 下运行速度极慢。帮助？编辑：负面前瞻断言影响性能

发布于 2024-08-27 11:50:30 字数 304 浏览 9 评论 0原文

我正在尝试对 1.12 GB 目录中的每个日志文件运行 unix regEXP，然后将匹配的模式替换为 ''。对 4 meg 文件进行测试运行大约需要 10 分钟，但成功了。显然，某些东西正在对性能造成几个数量级的损害。

更新：我注意到，在包含 77 个匹配项的 5.6 MB 文件中搜索 ^(155[0-2]).*$ 大约需要 7 秒。添加否定先行断言 ?!，使 regExp 变为 ^(?!155[0-2]).*$ 导致它至少需要 5-10 分钟；当然，将会有成千上万的比赛。

当有很多匹配项时，否定的前瞻断言是否会对性能极其不利？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

懵少女 2024-09-03 11:50:30

如果你能在一开始就去掉那个 .* 会有帮助。它前面可以有什么，只是空格？如果是这样，请尝试：

^(?!\s*155[0-2][0-9]{4}\s).*$

如果它真的可以是任何东西，请尝试使其变得非贪婪：

^(?!.*?155[0-2][0-9]{4}\s).*$

注意：在这两个示例中，我删除了第二个 .*，因为第三个也将匹配相同的内容。

它有助于思考正则表达式引擎实际上将做什么。

匹配 ^ （行首）。没问题。
将否定前瞻断言 Grab 与 .* 相匹配
尝试尽可能地。这意味着它会占据整条线。
下一个字符是1吗？如果不匹配，请使 .* 少匹配一个字符并重复，直到它匹配 1。

您可以看到，这意味着对于每条不匹配的行，它将回溯整行。现在，如果您在开头仅使用 \s*，那么将仅抓取空白，而不是整行。如果它确实可以是任何东西，那么在与 155 模式匹配的行上 .*? 会更快，而在不匹配的行上则大致相同。（在不匹配的行上，它将继续增长 .* 直到它占据整行。）

If you can get rid of that .* at the beginning it would help. What can be before it, just whitespace? If so, try:

^(?!\s*155[0-2][0-9]{4}\s).*$

If it really can be anything, try making it non-greedy:

^(?!.*?155[0-2][0-9]{4}\s).*$

Note: in both examples, I removed the second .*, since the third one would match the same thing as well.

It helps to think about what the regex engine will actually be doing.

Match ^ (beginning of line). No problem.
Try to match the negative look-ahead assertion
Grab as much as possible with .*. This means it grabs the entire line.
Is the next character 1? If not, make the .* match one fewer character and repeat until it does match a 1.

You can see that this means for every line that doesn't match, it will backtrack through the entire line. Now, if you just use \s* at the beginning, then that will only grab whitespace, not the entire line. If it really can be anything, .*? will be faster on lines that do match the 155 pattern, and it will be about the same on lines that don't. (On lines that don't match, it will keep growing the .* until it has grabbed the whole line.)

回复收藏 0 原文