捕获组仅返回每个组的最后一次出现

发布于 2024-11-04 06:48:45 字数 1256 浏览 1 评论 0原文

我有这样的字符串:

String s = "word=PS1,p1,p2,p3=q1,q2|word2=PS3,p4,p5,p6=q3";

或这样:

String s2 = "word3=PS2,p7,p8=q4,q5,q6|=PS3,p9=";

或这样:

String s3 = "=PS3=";

所以,在正式形式中 - 字符串包含字典中的一些单词定义,由“|”分隔象征。

这里:

  • word - 字典中的单词(可选,如 S2 或 S3)

  • PS1、PS2、PS3 - 的一部分语音标记(必需)

  • p1,p2,... - 一些参数(可选)

  • q1, q2, q3, ... - 一些其他参数(也是可选的)

q1, q2, q3, ... -我想构建 正则表达式,它找到文本中出现的所有此类字符串,并给出组:

  • group1 - 单词
  • group2 - 词性标签
  • group3、group4、... - 参数 p
  • group(k)、group(k+1)、。 .. - 另一个参数 (q)

我不关心最后一个 p 参数和第一个 q 参数的组索引。我应该知道,第一组 - 是单词(可能为空),第二组 - 词性,其他组 - 参数 p 和 q。

现在我有这样的正则表达式:

"([a-z]*)?=([A-Z]+)(,?[a-z]+)*=(,?[a-z]+)*")

但它不能正常工作。它只显示最后一个参数 p 和 q。即(对于 S2):

  • group1 = word3 - OK
  • group2 = PS2 - OK
  • group3 = p8 - NOT OK(仅最后一个 p 参数)
  • group4 = q6 - NOT OK(也是最后一个 q 参数)

你能帮我吗?

更新: “=”字符仅是 p 参数和 q 参数之间的分割字符。在我的问题中没有必要。您应该认为,p 参数和 q 参数没有不同。

实际输入的示例:

String s = "bread=NOUN,plur,link=form|=VERB="

I have the string like this:

String s = "word=PS1,p1,p2,p3=q1,q2|word2=PS3,p4,p5,p6=q3";

or like this:

String s2 = "word3=PS2,p7,p8=q4,q5,q6|=PS3,p9=";

or like this:

String s3 = "=PS3=";

So, in formal - string contains some word definitions in dictionary, splitted by "|" symbol.

here:

  • word - word in the dictionary (optional, like in S2 or S3)

  • PS1, PS2, PS3 - Part of speech tag (required)

  • p1,p2,... - some parameters (optional)

  • q1, q2, q3, ... - some another parameters (also optional)

I want to build regex, which finds all occurrences of such strings in the text and gives me the groups:

  • group1 - word
  • group2 - part of speech tag
  • group3, group4, ... - parameters p
  • group(k), group(k+1), ... - another parameters (q)

I don't care for index of group of the last p parameter and first q parameter. I should know, that first group - is word (may be null), second group - part of speech, and other groups - parameters p and q.

Now I have such regex:

"([a-z]*)?=([A-Z]+)(,?[a-z]+)*=(,?[a-z]+)*")

But it doesn't work correctly. It shows me only the last parameters p and q. I.e. (for S2) :

  • group1 = word3 - OK
  • group2 = PS2 - OK
  • group3 = p8 - NOT OK (only last p-parameter)
  • group4 = q6 - NOT OK (also last q-parameter)

Could you help me?

UPDATE:
"="-character only the split-character between p-parameters and q-parameters. It's not necessary in my problem. You should think, that p-parameters and q-parameters are not different.

example of real input:

String s = "bread=NOUN,plur,link=form|=VERB="

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

愿得七秒忆 2024-11-11 06:48:45

在正则表达式中不能有可变数量的捕获组。在 .Net 中,每个组可以有多个捕获,但在 Java 中不行。您面临的问题是正则表达式引擎仅存储每个组的最后一次成功匹配。您能做的最好的事情就是将所有 p 和 q 参数匹配成两大组,然后将它们分开。

Pattern pattern1 = Pattern.compile(
    "([^|=,]*)" +                // Group 1: The word. Zero or more characters.
    "=([^|=,]*)" +               // Group 2: The part of speech.
    ",?([^|=,]*(?:,[^|=,]*)*)" + // Group 3: The p-params
    "=([^|=,]*(?:,[^|=,]*)*)"    // Group 4: The q-params
);
Matcher matcher = pattern1.matcher("word=PS1,p1,p2,p3=q1,q2|word2=PS3,p4,p5,p6=q3");
while (matcher.find()) {
  String word = matcher.group(1);
  String partOfSpeech = matcher.group(2);
  String pParamString = matcher.group(3);
  String qParamString = matcher.group(4);
  String[] pParams = pParamString.split(",");
  String[] qParams = qParamString.split(",");
  // Do something with the above variables...
}

我使用 [^|=,]* 来匹配任何非特殊字符。

You can't have a variable number of capture-groups in Regex. In .Net you could have multiple captures for each group, but not in Java. The problem for you is that the regex engine only stores the last successful match for each group. The best you could do is to match all p- and q- parameters into two big groups, and then split them up.

Pattern pattern1 = Pattern.compile(
    "([^|=,]*)" +                // Group 1: The word. Zero or more characters.
    "=([^|=,]*)" +               // Group 2: The part of speech.
    ",?([^|=,]*(?:,[^|=,]*)*)" + // Group 3: The p-params
    "=([^|=,]*(?:,[^|=,]*)*)"    // Group 4: The q-params
);
Matcher matcher = pattern1.matcher("word=PS1,p1,p2,p3=q1,q2|word2=PS3,p4,p5,p6=q3");
while (matcher.find()) {
  String word = matcher.group(1);
  String partOfSpeech = matcher.group(2);
  String pParamString = matcher.group(3);
  String qParamString = matcher.group(4);
  String[] pParams = pParamString.split(",");
  String[] qParams = qParamString.split(",");
  // Do something with the above variables...
}

I used [^|=,]* to match any non-special character.

物价感观 2024-11-11 06:48:45

当我遇到这样的问题时,我会查看量词上的修饰符。您可能希望将某些量词修改为贪婪的,例如

(,?[az]+)+*

上面的区别在于,最后的零个或多个量词现在尽可能多地获取。这只是一个示例,我完全不确定该特定修饰符是否是您所需要的,但是,鉴于您的表达式按照您报告的方式工作,这些修饰符似乎可以完成其余的工作。

When I have problems like that I look to the modifiers on the quantifiers. You may want some of the quantifiers to be modified to be greedy, e.g.

(,?[a-z]+)+*

This difference, above, is that the final zero or more quantifier now grabs as much as it can. This is just an example and I'm not at all sure that that particular modifier is what you need but, given that your expression works as you reported, it seems likely that these modifiers will get it the rest of the way.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文