无需线性搜索即可找出 Java 正则表达式中的哪个组匹配?
我有一些以编程方式组装的巨大正则表达式,就像这样
(A)|(B)|(C)|...
每个子模式都在其捕获组中。 当我获得匹配项时,如何确定哪个组匹配,而不需要线性测试每个 group(i)
以查看它返回非空字符串?
I have some programmatically assembled huge regex, like this
(A)|(B)|(C)|...
Each sub-pattern is in its capturing group. When I get a match, how do I figure out which group matches without linearly testing each group(i)
to see it returns a non-null string?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您的正则表达式是以编程方式生成的,为什么不以编程方式生成n 个单独的正则表达式并依次测试每个正则表达式? 除非它们共享一个共同的前缀并且 Java 正则表达式引擎很聪明,否则所有替代方案都会经过测试。
更新:我刚刚浏览了 Sun Java 源代码,特别是 java.util.regex.Pattern$Branch.match(),这也只是简单地对所有替代方案进行线性搜索,依次尝试每个替代方案。 使用 Branch 的其他地方并不建议对公共前缀进行任何类型的优化。
If your regex is programmatically generated, why not programmatically generate n separate regexes and test each of them in turn? Unless they share a common prefix and the Java regex engine is clever, all alternatives get tested anyway.
Update: I just looked through the Sun Java source, in particular, java.util.regex.Pattern$Branch.match(), and that does also simply do a linear search over all alternatives, trying each in turn. The other places where Branch is used do not suggest any kind of optimization of common prefixes.
捕获组,而不是:
(A)|(B)|(C)|...
替换为
((?:A)|(?:B)|(?:C))
您可以使用非 捕获组 (?:) 不会包含在组计数中,但分支的结果将捕获到外层 () 组中。
You can use non-capturing groups, instead of:
(A)|(B)|(C)|...
replace with
((?:A)|(?:B)|(?:C))
The non-capturing groups (?:) will not be included in the group count, but the result of the branch will be captured in the outer () group.
将您的正则表达式分成三部分:
替代方案是:
Break up your regex into three:
The alternative is:
我认为您无法绕过线性搜索,但您可以通过使用
start(int)
而不是group(int)
来提高线性搜索的效率。这样,您只需查询表示其起始索引的
int
值,而不是为每个组生成子字符串。I don't think you can get around the linear search, but you can make it a lot more efficient by using
start(int)
instead ofgroup(int)
.This way, instead of generating a substring for every group, you just query an
int
value representing its starting index.从各种评论来看,似乎简单的答案是“否”,并且使用单独的正则表达式是一个更好的主意。 为了改进这种方法,您可能需要在生成它们时找出常见的模式前缀,或者使用您自己的正则表达式(或其他)模式匹配引擎。 但在进行所有这些努力之前,您需要确定这是系统中的一个重要瓶颈。 换句话说,对其进行基准测试,看看性能对于实际输入数据是否可以接受,如果不能,则对其进行分析以了解真正的瓶颈在哪里。
From the various comments, it seems that the simple answer is "no", and that using separate regexes is a better idea. To improve on that approach, you might need to figure out the common pattern prefixes when you generate them, or use your own regex (or other) pattern matching engine. But before you go to all of that effort, you need to be sure that this is a significant bottleneck in your system. In other words, benchmark it and see if the performance is acceptable for realistic input data, and if not the profile it to see where the real bottlenecks are.