正则表达式优化问题

发布于 2024-09-08 15:01:10 字数 625 浏览 7 评论 0原文

作为个人学习练习，我编写了这个正则表达式，将一元字符串拆分为长度为 2 的递增幂的部分（另请参阅ideone.com）：

    for (String s :
       new String(new char[500])
       .split("(?=(.*))(?<=(?<=(^.*))\\G.*)(?<=(?=\\2\\2.\\1)^.*)")
    ) {
       System.out.printf("%s ", s.length());
    }
    // prints "1 2 4 8 16 32 64 128 245 "

它使用了环视期间捕获、嵌套环视、反向引用匹配和无限长度环视（Java 中未正式支持，但无论如何都可以）的组合。还利用了二的幂和的属性以及字符串具有一元字母表的事实。

这个解决方案既不可读，而且性能也很糟糕。

我的问题是，你将如何“优化”这个正则表达式？

你能让正则表达式更具可读性吗（如果它表现更差也没关系）
你能让正则表达式表现更好吗（如果它可读性较差也没关系）

原文

As a personal learning exercise, I wrote this regex to split a unary string into parts whose length is increasing powers of two (see also on ideone.com):

    for (String s :
       new String(new char[500])
       .split("(?=(.*))(?<=(?<=(^.*))\\G.*)(?<=(?=\\2\\2.\\1)^.*)")
    ) {
       System.out.printf("%s ", s.length());
    }
    // prints "1 2 4 8 16 32 64 128 245 "

This uses a combination of capturing during lookarounds, nested lookarounds, matching on backreferences, and infinite length lookbehind (which isn't officially supported in Java but works anyway). Properties of sums of powers of two and the fact that the string has a unary alphabet is also taken advantage of.

This solution is both unreadable and has a horrible performance.

My question is, how would you "optimize" this regex?

Can you make the regex more readable (it's okay if it performs worse)
Can you make the regex performs better (it's okay if it's less readable)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

妄想挽回 2024-09-15 15:01:10

我不是 Java 人员，因此我的答案基于 .NET Regex 实现。我使用的

"(?<=(^.*)\\G.*)(?<=\\G\\1.)"

基础是 \sum_{i=0}^{n} 2^n = 2^{n+1} - 1。它基本上是“匹配每个位置，其中最后一场比赛之后的部分比最后一场比赛之前的部分长一个”。

它的速度大约是原始版本的两倍（再次在 .NET 上），只需不到 2 秒即可分割 10,000 个字符，而且我认为它的可读性更强。嗯...不太难读。 =）

干杯！好问题！ =)

编辑：再次查看您的正则表达式，在我看来您正在使用相同的方法，但以更复杂的方式。我承认在尝试制定自己的解决方案之前我没有尝试阅读您的解决方案，既因为我喜欢挑战，又因为您的正则表达式非常难以阅读。 =)
由于 Java 正则表达式引擎，这些嵌套环视是否是必要的？

I am not a Java guy, therefore my answers is based on the .NET Regex implementation. I used

"(?<=(^.*)\\G.*)(?<=\\G\\1.)"

based on the fact that \sum_{i=0}^{n} 2^n = 2^{n+1} - 1. It basically reads "Match every position, for which the part after the last match is one longer than the part before the last match."

Its about twice as fast as your original (on .NET, again), taking less than 2 seconds to split 10.000 characters, and I'd argue that its a little more readable. Well... less unreadable. =)

Cheers! Nice question! =)

Edit: Looking at your Regex again, it seems to me that you are using the same approach, but in a more complicated manner. I admit that I have not tried to read yours before trying to make my own solution, both because I like a challenge and because your regex is quite unreadable. =)
Are these nested lookarounds neccessary because of the Java regex engine?

回复收藏 0 原文

回忆躺在深渊里 2024-09-15 15:01:10

我真的不会。我会把整个东西扔掉，然后将其重做为可读的程序代码。

有些事情你确实不应该用正则表达式来做。这是其中之一。我完全赞成自学，但你真的认为这在某些时候会派上用场吗？

也许你最好学习一些实际上可用和可维护的东西:-)

回复收藏 0 原文

岁月打碎记忆 2024-09-15 15:01:10

这些模式在 Java 中对我有用。我最终会将所有内容修改为一个带有完整解释的综合答案。这些都是 Java 字符串文字表示形式。

P000: <代码>"(?=(.*))(?<=(?<=(^.*))\\G.*)(?<=(?=\\2\\2 .\\1)^.*)"
- 原始图案
- 2.23 秒内 500 个 (ideone.com)，4.1 秒内 666 个 (ideone.com)
P001: "(?=(.*))(?<=(?<; =(^.*))\\G.*)(?<=(?=\\2.\\1)\\G.*)"
- 本质上与 P000 相同，但我们不是将 ^\2\G\2.\1 剪掉，而是将头部剪掉为 \G\2.\1 >
- 2.21 秒内达到 500 个 (ideone.com)
P002: "(?=(.*) )(?<=(?<=(^.*))(?=\\2.\\1)\\G.*)"
- 比 P000 慢得多，但更短
- 本质上是 P001 的“重构”，现在两个lookbehind都锚定在\G
- 3.05 秒内 400 (ideone.com)