无需线性搜索即可找出 Java 正则表达式中的哪个组匹配？

发布于 2024-07-29 05:00:27 字数 162 浏览 7 评论 0原文

我有一些以编程方式组装的巨大正则表达式，就像这样

(A)|(B)|(C)|...

每个子模式都在其捕获组中。当我获得匹配项时，如何确定哪个组匹配，而不需要线性测试每个 group(i) 以查看它返回非空字符串？

原文

I have some programmatically assembled huge regex, like this

(A)|(B)|(C)|...

Each sub-pattern is in its capturing group. When I get a match, how do I figure out which group matches without linearly testing each group(i) to see it returns a non-null string?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

究竟谁懂我的在乎 2024-08-05 05:00:27

如果您的正则表达式是以编程方式生成的，为什么不以编程方式生成n 个单独的正则表达式并依次测试每个正则表达式？除非它们共享一个共同的前缀并且 Java 正则表达式引擎很聪明，否则所有替代方案都会经过测试。

更新：我刚刚浏览了 Sun Java 源代码，特别是 java.util.regex.Pattern$Branch.match()，这也只是简单地对所有替代方案进行线性搜索，依次尝试每个替代方案。使用 Branch 的其他地方并不建议对公共前缀进行任何类型的优化。

回复收藏 0 原文

梦行七里 2024-08-05 05:00:27

捕获组，而不是：

(A)|(B)|(C)|...

替换为

((?:A)|(?:B)|(?:C))

您可以使用非捕获组 (?:) 不会包含在组计数中，但分支的结果将捕获到外层 () 组中。

回复收藏 0 原文

甜心 2024-08-05 05:00:27

将您的正则表达式分成三部分：

String[] regexes = new String[] { "pattern1", "pattern2", "pattern3" };

for(int i = 0; i < regexes.length; i++) {
  Pattern pattern = Pattern.compile(regexes[i]);

  Matcher matcher = pattern.matcher(inputStr);
  if(matcher.matches()) {
     //process, optionally break out of loop
  }
}

public int getMatchedGroupIndex(Matcher matcher) { 
  int index = -1;  

  for(int i = 0; i < matcher.groupCount(); i++) {
    if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) {
      index = i;
    }
  }

  return index;
}

替代方案是：

for(int i = 0; i < matcher.groupCount(); i++) {
  if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) {
     //process, optionally break out of loop
  }
}

Break up your regex into three:

String[] regexes = new String[] { "pattern1", "pattern2", "pattern3" };

for(int i = 0; i < regexes.length; i++) {
  Pattern pattern = Pattern.compile(regexes[i]);

  Matcher matcher = pattern.matcher(inputStr);
  if(matcher.matches()) {
     //process, optionally break out of loop
  }
}

public int getMatchedGroupIndex(Matcher matcher) { 
  int index = -1;  

  for(int i = 0; i < matcher.groupCount(); i++) {
    if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) {
      index = i;
    }
  }

  return index;
}

The alternative is:

for(int i = 0; i < matcher.groupCount(); i++) {
  if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) {
     //process, optionally break out of loop
  }
}

回复收藏 0 原文

浊酒尽余欢 2024-08-05 05:00:27

我认为您无法绕过线性搜索，但您可以通过使用 start(int) 而不是 group(int) 来提高线性搜索的效率。

static int getMatchedGroupIndex(Matcher m)
{ 
  int index = -1;
  for (int i = 1, n = m.groupCount(); i <= n; i++)
  {
    if ( (index = m.start(i)) != -1 )
    {
      break;
    }
  }
  return index;
}

这样，您只需查询表示其起始索引的 int 值，而不是为每个组生成子字符串。

I don't think you can get around the linear search, but you can make it a lot more efficient by using start(int) instead of group(int).

static int getMatchedGroupIndex(Matcher m)
{ 
  int index = -1;
  for (int i = 1, n = m.groupCount(); i <= n; i++)
  {
    if ( (index = m.start(i)) != -1 )
    {
      break;
    }
  }
  return index;
}

This way, instead of generating a substring for every group, you just query an int value representing its starting index.

回复收藏 0 原文

享受孤独 2024-08-05 05:00:27

从各种评论来看，似乎简单的答案是“否”，并且使用单独的正则表达式是一个更好的主意。为了改进这种方法，您可能需要在生成它们时找出常见的模式前缀，或者使用您自己的正则表达式（或其他）模式匹配引擎。但在进行所有这些努力之前，您需要确定这是系统中的一个重要瓶颈。换句话说，对其进行基准测试，看看性能对于实际输入数据是否可以接受，如果不能，则对其进行分析以了解真正的瓶颈在哪里。

回复收藏 0 原文

~没有更多了~