捕获组仅返回每个组的最后一次出现

发布于 2024-11-04 06:48:45 字数 1256 浏览 6 评论 0原文

我有这样的字符串：

String s = "word=PS1,p1,p2,p3=q1,q2|word2=PS3,p4,p5,p6=q3";

或这样：

String s2 = "word3=PS2,p7,p8=q4,q5,q6|=PS3,p9=";

或这样：

String s3 = "=PS3=";

所以，在正式形式中 - 字符串包含字典中的一些单词定义，由“|”分隔象征。

这里：

word - 字典中的单词（可选，如 S2 或 S3）
PS1、PS2、PS3 - 的一部分语音标记（必需）
p1,p2,... - 一些参数（可选）
q1, q2, q3, ... - 一些其他参数（也是可选的）

q1, q2, q3, ... -我想构建正则表达式，它找到文本中出现的所有此类字符串，并给出组：

group1 - 单词
group2 - 词性标签
group3、group4、... - 参数 p
group(k)、group(k+1)、。 .. - 另一个参数 (q)

我不关心最后一个 p 参数和第一个 q 参数的组索引。我应该知道，第一组 - 是单词（可能为空），第二组 - 词性，其他组 - 参数 p 和 q。

现在我有这样的正则表达式：

"([a-z]*)?=([A-Z]+)(,?[a-z]+)*=(,?[a-z]+)*")

但它不能正常工作。它只显示最后一个参数 p 和 q。即（对于 S2）：

group1 = word3 - OK
group2 = PS2 - OK
group3 = p8 - NOT OK（仅最后一个 p 参数）
group4 = q6 - NOT OK（也是最后一个 q 参数）

你能帮我吗？

更新： “=”字符仅是 p 参数和 q 参数之间的分割字符。在我的问题中没有必要。您应该认为，p 参数和 q 参数没有不同。

实际输入的示例：

String s = "bread=NOUN,plur,link=form|=VERB="

原文

I have the string like this:

String s = "word=PS1,p1,p2,p3=q1,q2|word2=PS3,p4,p5,p6=q3";

or like this:

String s2 = "word3=PS2,p7,p8=q4,q5,q6|=PS3,p9=";

or like this:

String s3 = "=PS3=";

So, in formal - string contains some word definitions in dictionary, splitted by "|" symbol.

here:

word - word in the dictionary (optional, like in S2 or S3)
PS1, PS2, PS3 - Part of speech tag (required)
p1,p2,... - some parameters (optional)
q1, q2, q3, ... - some another parameters (also optional)

I want to build regex, which finds all occurrences of such strings in the text and gives me the groups:

group1 - word
group2 - part of speech tag
group3, group4, ... - parameters p
group(k), group(k+1), ... - another parameters (q)

I don't care for index of group of the last p parameter and first q parameter. I should know, that first group - is word (may be null), second group - part of speech, and other groups - parameters p and q.

Now I have such regex:

"([a-z]*)?=([A-Z]+)(,?[a-z]+)*=(,?[a-z]+)*")

But it doesn't work correctly. It shows me only the last parameters p and q. I.e. (for S2) :

group1 = word3 - OK
group2 = PS2 - OK
group3 = p8 - NOT OK (only last p-parameter)
group4 = q6 - NOT OK (also last q-parameter)

Could you help me?

UPDATE:
"="-character only the split-character between p-parameters and q-parameters. It's not necessary in my problem. You should think, that p-parameters and q-parameters are not different.

example of real input:

String s = "bread=NOUN,plur,link=form|=VERB="

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

愿得七秒忆 2024-11-11 06:48:45

在正则表达式中不能有可变数量的捕获组。在 .Net 中，每个组可以有多个捕获，但在 Java 中不行。您面临的问题是正则表达式引擎仅存储每个组的最后一次成功匹配。您能做的最好的事情就是将所有 p 和 q 参数匹配成两大组，然后将它们分开。

Pattern pattern1 = Pattern.compile(
    "([^|=,]*)" +                // Group 1: The word. Zero or more characters.
    "=([^|=,]*)" +               // Group 2: The part of speech.
    ",?([^|=,]*(?:,[^|=,]*)*)" + // Group 3: The p-params
    "=([^|=,]*(?:,[^|=,]*)*)"    // Group 4: The q-params
);
Matcher matcher = pattern1.matcher("word=PS1,p1,p2,p3=q1,q2|word2=PS3,p4,p5,p6=q3");
while (matcher.find()) {
  String word = matcher.group(1);
  String partOfSpeech = matcher.group(2);
  String pParamString = matcher.group(3);
  String qParamString = matcher.group(4);
  String[] pParams = pParamString.split(",");
  String[] qParams = qParamString.split(",");
  // Do something with the above variables...
}

我使用 [^|=,]* 来匹配任何非特殊字符。

You can't have a variable number of capture-groups in Regex. In .Net you could have multiple captures for each group, but not in Java. The problem for you is that the regex engine only stores the last successful match for each group. The best you could do is to match all p- and q- parameters into two big groups, and then split them up.

Pattern pattern1 = Pattern.compile(
    "([^|=,]*)" +                // Group 1: The word. Zero or more characters.
    "=([^|=,]*)" +               // Group 2: The part of speech.
    ",?([^|=,]*(?:,[^|=,]*)*)" + // Group 3: The p-params
    "=([^|=,]*(?:,[^|=,]*)*)"    // Group 4: The q-params
);
Matcher matcher = pattern1.matcher("word=PS1,p1,p2,p3=q1,q2|word2=PS3,p4,p5,p6=q3");
while (matcher.find()) {
  String word = matcher.group(1);
  String partOfSpeech = matcher.group(2);
  String pParamString = matcher.group(3);
  String qParamString = matcher.group(4);
  String[] pParams = pParamString.split(",");
  String[] qParams = qParamString.split(",");
  // Do something with the above variables...
}

I used [^|=,]* to match any non-special character.

回复收藏 0 原文