使一个小的正则表达式更具可读性
我有一个有效的正则表达式,但我想让它更具可读性,而且我距离正则表达式专家还很远,所以我谦虚地希望得到一些提示。
它旨在抓取多个不同编译器、链接器和其他构建工具的输出,并用于构建一个漂亮的小总结报告。它的工作做得很好,但我感觉我写的方式很笨拙,我宁愿学习而不是以错误的方式保留它。
(.*?)\s?:?\s?(informational|warning|error|fatal error)?\s([A-Z]+[0-9][0-9][0-9][0-9]):\s(.*)$
简单细分如下:
(.*?) # non-greedily match up until...
\s?:?\s? # we come across a possible " : "
(informational|warning|error|fatal error)? # possibly followed by one of these
\s([A-Z]+[0-9][0-9][0-9][0-9]):\s # but 100% followed by this alphanum
(.*)$ # and then capture the rest
我最感兴趣的是使上面的第二个和第四个条目更......漂亮。由于某种原因,我使用的正则表达式测试器(The Regulator)与普通空格不匹配,所以我必须使用 \s... 但它并不意味着匹配任何其他空格。
任何学校教育将不胜感激。
I've got a working regular expression, but I'd like to make it a tad more readable, and I'm far from a regex guru, so I was humbly hoping for some tips.
This is designed to scrape the output of several different compilers, linkers, and other build tools, and is used to build a nice little summery report. It does it's job great, but I'm left feeling like I wrote it in a clunky fashion, and I'd sooner learn than keep it the wrong way.
(.*?)\s?:?\s?(informational|warning|error|fatal error)?\s([A-Z]+[0-9][0-9][0-9][0-9]):\s(.*)$
Which, broken down simply, is as follows:
(.*?) # non-greedily match up until...
\s?:?\s? # we come across a possible " : "
(informational|warning|error|fatal error)? # possibly followed by one of these
\s([A-Z]+[0-9][0-9][0-9][0-9]):\s # but 100% followed by this alphanum
(.*)$ # and then capture the rest
I'm mostly interested in making the 2nd and 4th entry above more... beautiful. For some reason, the regex tester I was using (The Regulator) didn't match plain spaces, so I had to use the \s... but it is not meant to match any other whitespace.
Any schooling will be greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使长正则表达式更具可读性的最简单方法是使用“自由间距” (或
\x
)修饰符,它可以让您编写你的正则表达式就像你在第二个代码块中所做的那样——它会忽略空格。然而,并非所有引擎都支持这一点(根据上面链接的页面,.NET、Java、Perl、PCRE、Python、Ruby 和 XPath 支持它)。另请注意,在自由间距模式下,如果您只想匹配空格字符,则可以使用
[ ]
而不是\s
(除非您使用 Java,在在这种情况下,您必须使用\
,这是一个转义空格)。如果您希望每个元素独立于其他元素是可选的,那么您对第二行实际上无能为力,但第四行可以缩短:
\d
是 简写类 相当于[0-9]
和{4 }
指定它应该出现恰好四次。第三行也可以稍微缩短(
(?:…)
指定一个非捕获 group):从效率的角度来看,除非你确实需要在每次使用括号时捕获子模式,否则你可以将它们全部删除,除了第三行,其中 alternation 需要该组) - 但可以将其设置为非 -捕捉。把这些放在一起你会得到:
The easiest way to make a long regex more readable is to use the "free-spacing" (or
\x
) modifier, which would let you write your regex just like you did in the second block of code -- it makes whitespace ignored. This isn't supported by all engines, however (according to the page linked above, .NET, Java, Perl, PCRE, Python, Ruby and XPath support it).Note also that in free-spacing mode, you can use
[ ]
instead of\s
if you want to only match a space character (unless you're using Java, in which case you have to use\
, which is an escaped space).There's not really anything you can do for the second line, if you want each element to be optional independently of the other elements, but the fourth can be shortened:
\d
is a shorthand class equivalent to[0-9]
, and{4}
specifies that it should appear exactly four times.The third line can be slightly shortened as well (
(?:…)
specifies a non-capturing group):From an efficiency standpoint, unless you actually need to capture subpatterns each time you use brackets, you can remove all of them, except for on the third line, where the group is needed for the alternation) -- but that one can be made non-capturing. Putting this all together you'd get:
第 2 行
我认为您的正则表达式与注释不匹配。您可能想要这样:
为了使其不捕获:
您应该能够使用文字空格而不是
\s
。这一定是您所使用的工具的限制。第 4 行
[0-9][0-9][0-9][0-9]
可以替换为[0-9]{4 }
。在某些语言中,
[0-9]
相当于\d
。Line 2
I think your regular expression doesn't match with the comment. You probably want this instead:
To make it non-capturing:
You should be able to use a literal space instead of
\s
. This must be a restriction in the tool you are using.Line 4
[0-9][0-9][0-9][0-9]
can be replaced with[0-9]{4}
.In some languages
[0-9]
is equivalent to\d
.也许您可以从子表达式构建 RE,这样您的最终 RE 将如下所示:
Perhaps you can build the RE from sub-expressions, so that your end RE would look something like this: