Java正则表达式需要新鲜的眼光,这太贪婪了

发布于 2024-11-24 22:26:20 字数 2087 浏览 4 评论 0原文

我有一个以下形式的字符串:

canonical_class_name[key1="value1",key2="value2",key3="value3",...] 

目的是捕获组中的 canonical_class_name ,然后交替 key=value 组。目前它与测试字符串不匹配(在下面的程序中,testString)。

必须至少有一个键/值对,但可能有很多这样的对。

问题:目前,正则表达式正确地获取规范类名和第一个键,但随后它会吞噬所有内容,直到最后一个双引号,如何让它惰性地获取键值对?

以下是以下程序组合而成的正则表达式:

(\S+)\[\s*(\S+)\s*=\s*"(.*)"\s*(?:\s*,\s*(\S+)\s*=\s*"(.*)"\s*)*\]

根据您的喜好,您可能会发现程序版本更易于阅读。

如果我的程序传递了字符串:

org.myobject[key1=\"value1\", key2=\"value2\", key3=\"value3\"]

...这些是我得到的组:

Group1 contains: org.myobject<br/>
Group2 contains: key1<br/>
Group3 contains: value1", key2="value2", key3="value3<br/>

还有一点需要注意的是,使用 String.split() 我可以简化表达式,但我将其用作学习经验可以更好地理解正则表达式,所以我不想使用这样的捷径。

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class BasicORMParser {
     String regex =
            "canonicalName\\[ map (?: , map )*\\]"
            .replace("canonicalName", "(\\S+)")
            .replace("map", "key = \"value\"")
            .replace("key", "(\\S+)")
            .replace("value", "(.*)")
            .replace(" ", "\\s*"); 

    List<String> getGroups(String ormString){
        List<String> values = new ArrayList();
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(ormString);
        if (matcher.matches() == false){
            String msg = String.format("String failed regex validiation. Required: %s , found: %s", regex, ormString);
            throw new RuntimeException(msg);
        }
        if(matcher.groupCount() < 2){
            String msg = String.format("Did not find Class and at least one key value.");
            throw new RuntimeException(msg);
        }
        for(int i = 1; i < matcher.groupCount(); i++){
            values.add(matcher.group(i));
        }
        return values;
    }
}

I have a string of the form:

canonical_class_name[key1="value1",key2="value2",key3="value3",...] 

The purpose is to capture the canonical_class_name in a group and then alternating key=value groups. Currently it does not match a test string (in the following program, testString).

There must be at least one key/value pair, but there may be many such pairs.

Question: Currently the regex grabs the canonical class name, and the first key correctly but then it gobbles up everything until the last double quote, how do I make it grab the key value pairs lazy?

Here is the regular expression which the following program puts together:

(\S+)\[\s*(\S+)\s*=\s*"(.*)"\s*(?:\s*,\s*(\S+)\s*=\s*"(.*)"\s*)*\]

Depending on your preference you may find the programs version easier to read.

If my program is passed the String:

org.myobject[key1=\"value1\", key2=\"value2\", key3=\"value3\"]

...these are the groups I get:

Group1 contains: org.myobject<br/>
Group2 contains: key1<br/>
Group3 contains: value1", key2="value2", key3="value3<br/>

One more note, using String.split() I can simplify the expression, but I'm using this as a learning experience to better my regex understanding, so I don't want to use such a short cut.

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class BasicORMParser {
     String regex =
            "canonicalName\\[ map (?: , map )*\\]"
            .replace("canonicalName", "(\\S+)")
            .replace("map", "key = \"value\"")
            .replace("key", "(\\S+)")
            .replace("value", "(.*)")
            .replace(" ", "\\s*"); 

    List<String> getGroups(String ormString){
        List<String> values = new ArrayList();
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(ormString);
        if (matcher.matches() == false){
            String msg = String.format("String failed regex validiation. Required: %s , found: %s", regex, ormString);
            throw new RuntimeException(msg);
        }
        if(matcher.groupCount() < 2){
            String msg = String.format("Did not find Class and at least one key value.");
            throw new RuntimeException(msg);
        }
        for(int i = 1; i < matcher.groupCount(); i++){
            values.add(matcher.group(i));
        }
        return values;
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

踏月而来 2024-12-01 22:26:21

对于非贪婪匹配,请在模式后附加 ?。例如,.*? 匹配尽可能少的字符数。

For non-greedy matching, append a ? after the pattern. e.g., .*? matches the fewest number of characters possible.

萌酱 2024-12-01 22:26:20

你自己实际上已经回答了这个问题:让他们变得懒惰。也就是说,使用惰性(又名非贪婪不情愿)量词。只需将每个 (\S+) 更改为 (\S+?),并将每个 (.*) 更改为 (.*?)。但如果是我,我会更改这些子表达式,这样它们就永远不会匹配太多,无论贪婪程度如何。例如,您可以使用 ([^\s\[]+) 作为类名,使用 ([^\s=]+) 作为键,使用 < code>"([^"]*)" 作为值。

不过,我认为这不会解决你的真正问题。一旦你得到它,它就能正确匹配所有键/值对,你会发现它只捕获第一对(组#2 和#3)和最后一对(组#4 和#5)这是因为,每次(?:\s*,\s*(\S+)\s*=\。 s*"(.*)"\s*)* 被重复,这两个组的内容被覆盖,并且他们在上一次迭代中捕获的任何内容都丢失了,这是无法避免的,这至少是一个。例如,您可以匹配两步操作。将所有键/值对作为一个块,然后分解各个对。

这一行:

if(matcher.groupCount() < 2){

...可能没有按照您的想法进行操作。 Pattern 对象的静态属性;它告诉正则表达式中有多少个捕获组,无论匹配成功还是失败,groupCount() 将始终返回相同的值 - 在本例中为 5 个。 。如果匹配成功,某些捕获组可能为空(表明它们没有参加比赛),但总会有五个。


编辑:我怀疑这就是您最初尝试的:

Pattern p = Pattern.compile(
    "(?:([^\\s\\[]+)\\[|\\G)([^\\s=]+)=\"([^\"]*)\"[,\\s]*");

String s = "org.myobject[key1=\"value1\", key2=\"value2\", key3=\"value3\"]";
Matcher m = p.matcher(s);
while (m.find())
{
  if (m.group(1) != null)
  {
    System.out.printf("class : %s%n", m.group(1));
  }
  System.out.printf("key : %s, value : %s%n", m.group(2), m.group(3));
}

输出:

class : org.myobject
key : key1, value : value1
key : key2, value : value2
key : key3, value : value3

理解正则表达式的关键是这部分: (?:([^\s\[]+)\[|\G)。在第一遍中,它匹配类名称和左方括号。之后,\G 接管,将下一场比赛锚定到上一场比赛结束的位置。

You practically answered the question yourself: make them lazy. That is, use lazy (a.k.a. non-greedy or reluctant) quantifiers. Just change each (\S+) to (\S+?), and each (.*) to (.*?). But if it were me, I'd change those subexpressions so they can never match too much, regardless of greediness. For example, you could use ([^\s\[]+) for the class name, ([^\s=]+) for the key, and "([^"]*)" for the value.

I don't think that's going to solve your real problem, though. Once you've got it so it correctly matches all the key/value pairs, you'll find that it only captures the first pair (groups #2 and #3) and the last pair (groups #4 and #5). That's because, each time (?:\s*,\s*(\S+)\s*=\s*"(.*)"\s*)* gets repeated, those two groups get their contents overwritten, and whatever they captured on the previous iteration is lost. There's no getting around it, this is at least a two-step operation. For example, you could match all of the key/value pairs as a block, then break out the individual pairs.

One more thing. This line:

if(matcher.groupCount() < 2){

...probably isn't doing what you think it does. groupCount() is a static property of the Pattern object; it tells how many capturing groups there are in the regex. Whether the match succeeds or fails, groupCount() will always return the same value--in this case, five. If the match succeeds, some of the capture groups may be null (indicating that they didn't participate in the match), but there will always be five of them.


EDIT: I suspect this is what you were trying for initially:

Pattern p = Pattern.compile(
    "(?:([^\\s\\[]+)\\[|\\G)([^\\s=]+)=\"([^\"]*)\"[,\\s]*");

String s = "org.myobject[key1=\"value1\", key2=\"value2\", key3=\"value3\"]";
Matcher m = p.matcher(s);
while (m.find())
{
  if (m.group(1) != null)
  {
    System.out.printf("class : %s%n", m.group(1));
  }
  System.out.printf("key : %s, value : %s%n", m.group(2), m.group(3));
}

output:

class : org.myobject
key : key1, value : value1
key : key2, value : value2
key : key3, value : value3

The key to understanding the regex is this part: (?:([^\s\[]+)\[|\G). On the first pass it matches the class name and the opening square bracket. After that, \G takes over, anchoring the next match to the position where the previous match ended.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文