捕获组仅返回每个组的最后一次出现
我有这样的字符串:
String s = "word=PS1,p1,p2,p3=q1,q2|word2=PS3,p4,p5,p6=q3";
或这样:
String s2 = "word3=PS2,p7,p8=q4,q5,q6|=PS3,p9=";
或这样:
String s3 = "=PS3=";
所以,在正式形式中 - 字符串包含字典中的一些单词定义,由“|”分隔象征。
这里:
word - 字典中的单词(可选,如 S2 或 S3)
PS1、PS2、PS3 - 的一部分语音标记(必需)
p1,p2,... - 一些参数(可选)
q1, q2, q3, ... - 一些其他参数(也是可选的)
q1, q2, q3, ... -我想构建 正则表达式,它找到文本中出现的所有此类字符串,并给出组:
- group1 - 单词
- group2 - 词性标签
- group3、group4、... - 参数 p
- group(k)、group(k+1)、。 .. - 另一个参数 (q)
我不关心最后一个 p 参数和第一个 q 参数的组索引。我应该知道,第一组 - 是单词(可能为空),第二组 - 词性,其他组 - 参数 p 和 q。
现在我有这样的正则表达式:
"([a-z]*)?=([A-Z]+)(,?[a-z]+)*=(,?[a-z]+)*")
但它不能正常工作。它只显示最后一个参数 p 和 q。即(对于 S2):
- group1 = word3 - OK
- group2 = PS2 - OK
- group3 = p8 - NOT OK(仅最后一个 p 参数)
- group4 = q6 - NOT OK(也是最后一个 q 参数)
你能帮我吗?
更新: “=”字符仅是 p 参数和 q 参数之间的分割字符。在我的问题中没有必要。您应该认为,p 参数和 q 参数没有不同。
实际输入的示例:
String s = "bread=NOUN,plur,link=form|=VERB="
I have the string like this:
String s = "word=PS1,p1,p2,p3=q1,q2|word2=PS3,p4,p5,p6=q3";
or like this:
String s2 = "word3=PS2,p7,p8=q4,q5,q6|=PS3,p9=";
or like this:
String s3 = "=PS3=";
So, in formal - string contains some word definitions in dictionary, splitted by "|" symbol.
here:
word - word in the dictionary (optional, like in S2 or S3)
PS1, PS2, PS3 - Part of speech tag (required)
p1,p2,... - some parameters (optional)
q1, q2, q3, ... - some another parameters (also optional)
I want to build regex, which finds all occurrences of such strings in the text and gives me the groups:
- group1 - word
- group2 - part of speech tag
- group3, group4, ... - parameters p
- group(k), group(k+1), ... - another parameters (q)
I don't care for index of group of the last p parameter and first q parameter. I should know, that first group - is word (may be null), second group - part of speech, and other groups - parameters p and q.
Now I have such regex:
"([a-z]*)?=([A-Z]+)(,?[a-z]+)*=(,?[a-z]+)*")
But it doesn't work correctly. It shows me only the last parameters p and q. I.e. (for S2) :
- group1 = word3 - OK
- group2 = PS2 - OK
- group3 = p8 - NOT OK (only last p-parameter)
- group4 = q6 - NOT OK (also last q-parameter)
Could you help me?
UPDATE:
"="-character only the split-character between p-parameters and q-parameters. It's not necessary in my problem. You should think, that p-parameters and q-parameters are not different.
example of real input:
String s = "bread=NOUN,plur,link=form|=VERB="
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在正则表达式中不能有可变数量的捕获组。在 .Net 中,每个组可以有多个捕获,但在 Java 中不行。您面临的问题是正则表达式引擎仅存储每个组的最后一次成功匹配。您能做的最好的事情就是将所有 p 和 q 参数匹配成两大组,然后将它们分开。
我使用
[^|=,]*
来匹配任何非特殊字符。You can't have a variable number of capture-groups in Regex. In .Net you could have multiple captures for each group, but not in Java. The problem for you is that the regex engine only stores the last successful match for each group. The best you could do is to match all p- and q- parameters into two big groups, and then split them up.
I used
[^|=,]*
to match any non-special character.当我遇到这样的问题时,我会查看量词上的修饰符。您可能希望将某些量词修改为贪婪的,例如
(,?[az]+)+*
上面的区别在于,最后的零个或多个量词现在尽可能多地获取。这只是一个示例,我完全不确定该特定修饰符是否是您所需要的,但是,鉴于您的表达式按照您报告的方式工作,这些修饰符似乎可以完成其余的工作。
When I have problems like that I look to the modifiers on the quantifiers. You may want some of the quantifiers to be modified to be greedy, e.g.
(,?[a-z]+)+*
This difference, above, is that the final zero or more quantifier now grabs as much as it can. This is just an example and I'm not at all sure that that particular modifier is what you need but, given that your expression works as you reported, it seems likely that these modifiers will get it the rest of the way.