正则表达式麻烦,转义引号

发布于 2024-11-07 13:54:56 字数 1135 浏览 0 评论 0原文

基本上,我正在传递一个字符串,我需要以与 *nix shell 标记命令行选项大致相同的方式对其进行标记

假设我有以下字符串

"Hello\" World" "Hello Universe" Hi

如何将其转换为 3 元素列表

  • Hello" World
  • 你好宇宙

以下是我的第一次尝试,但它有很多问题

  • 它留下了引号字符
  • 它没有捕获转义的引号

代码:

public void test() {
    String str = "\"Hello\\\" World\" \"Hello Universe\" Hi";
    List<String> list = split(str);
}

public static List<String> split(String str) {
    Pattern pattern = Pattern.compile(
        "\"[^\"]*\"" + /* double quoted token*/
        "|'[^']*'" + /*single quoted token*/
        "|[A-Za-z']+" /*everything else*/
    );

    List<String> opts = new ArrayList<String>();
    Scanner scanner = new Scanner(str).useDelimiter(pattern);

    String token;
    while ((token = scanner.findInLine(pattern)) != null) {
        opts.add(token);
    }
    return opts;
}

所以以下代码的错误输出是

  • “Hello\”
  • World
  • “”
  • Hello
  • Universe

编辑我完全接受非正则表达式解决方案这只是我想到的第一个解决方案。

Basically, I'm being passed a string and I need to tokenise it in much the same manner as command line options are tokenised by a *nix shell

Say I have the following string

"Hello\" World" "Hello Universe" Hi

How could I turn it into a 3 element list

  • Hello" World
  • Hello Universe
  • Hi

The following is my first attempt, but it's got a number of problems

  • It leaves the quote characters
  • It doesn't catch the escaped quote

Code:

public void test() {
    String str = "\"Hello\\\" World\" \"Hello Universe\" Hi";
    List<String> list = split(str);
}

public static List<String> split(String str) {
    Pattern pattern = Pattern.compile(
        "\"[^\"]*\"" + /* double quoted token*/
        "|'[^']*'" + /*single quoted token*/
        "|[A-Za-z']+" /*everything else*/
    );

    List<String> opts = new ArrayList<String>();
    Scanner scanner = new Scanner(str).useDelimiter(pattern);

    String token;
    while ((token = scanner.findInLine(pattern)) != null) {
        opts.add(token);
    }
    return opts;
}

So the incorrect output of the following code is

  • "Hello\"
  • World
  • " "
  • Hello
  • Universe
  • Hi

EDIT I'm totally open to a non regex solution. It's just the first solution that came to mind

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

葬心 2024-11-14 13:54:56

如果您决定放弃正则表达式并进行解析,则有几种选择。如果您愿意只使用双引号或单引号(但不能同时使用两者)作为引用,那么您可以使用 StreamTokenizer 轻松解决此问题:

public static List<String> tokenize(String s) throws IOException {
    List<String> opts = new ArrayList<String>();
    StreamTokenizer st = new StreamTokenizer(new StringReader(s));
    st.quoteChar('\"');
    while (st.nextToken() != StreamTokenizer.TT_EOF) {
        opts.add(st.sval);
    }

    return opts;
}

如果您必须支持两个引号,这里有一个应该可行的简单实现(警告像 '"blah \" blah"blah' 这样的字符串会产生类似 'blah " blahblah' 的结果,如果这样不行,您将需要进行一些更改):

   public static List<String> splitSSV(String in) throws IOException {
        ArrayList<String> out = new ArrayList<String>();

        StringReader r = new StringReader(in);
        StringBuilder b = new StringBuilder();
        int inQuote = -1;
        boolean escape = false;
        int c;
        // read each character
        while ((c = r.read()) != -1) {
            if (escape) {  // if the previous char is escape, add the current char
                b.append((char)c);
                escape = false;
                continue;
            }
            switch (c) {
            case '\\':   // deal with escape char
                escape = true;
                break;
            case '\"':
            case '\'':  // deal with quote chars
                if (c == '\"' || c == '\'') {
                    if (inQuote == -1) {  // not in a quote
                        inQuote = c;  // now we are
                    } else {
                        inQuote = -1;  // we were in a quote and now we aren't
                    }
                }
                break;
            case ' ':
                if (inQuote == -1) {  // if we aren't in a quote, then add token to list
                    out.add(b.toString());
                    b.setLength(0);
                } else {
                    b.append((char)c); // else append space to current token
                }
                break;
            default:
                b.append((char)c);  // append all other chars to current token
            }
        }
        if (b.length() > 0) {
            out.add(b.toString()); // add final token to list
        }
        return out;
    }

If you decide you want to forego regex, and do parsing instead, there are a couple of options. If you are willing to have just a double quote or a single quote (but not both) as your quote, then you can use StreamTokenizer to solve this easily:

public static List<String> tokenize(String s) throws IOException {
    List<String> opts = new ArrayList<String>();
    StreamTokenizer st = new StreamTokenizer(new StringReader(s));
    st.quoteChar('\"');
    while (st.nextToken() != StreamTokenizer.TT_EOF) {
        opts.add(st.sval);
    }

    return opts;
}

If you must support both quotes, here is a naive implementation that should work (caveat that a string like '"blah \" blah"blah' will yield something like 'blah " blahblah'. If that isn't OK, you will need to make some changes):

   public static List<String> splitSSV(String in) throws IOException {
        ArrayList<String> out = new ArrayList<String>();

        StringReader r = new StringReader(in);
        StringBuilder b = new StringBuilder();
        int inQuote = -1;
        boolean escape = false;
        int c;
        // read each character
        while ((c = r.read()) != -1) {
            if (escape) {  // if the previous char is escape, add the current char
                b.append((char)c);
                escape = false;
                continue;
            }
            switch (c) {
            case '\\':   // deal with escape char
                escape = true;
                break;
            case '\"':
            case '\'':  // deal with quote chars
                if (c == '\"' || c == '\'') {
                    if (inQuote == -1) {  // not in a quote
                        inQuote = c;  // now we are
                    } else {
                        inQuote = -1;  // we were in a quote and now we aren't
                    }
                }
                break;
            case ' ':
                if (inQuote == -1) {  // if we aren't in a quote, then add token to list
                    out.add(b.toString());
                    b.setLength(0);
                } else {
                    b.append((char)c); // else append space to current token
                }
                break;
            default:
                b.append((char)c);  // append all other chars to current token
            }
        }
        if (b.length() > 0) {
            out.add(b.toString()); // add final token to list
        }
        return out;
    }
夜空下最亮的亮点 2024-11-14 13:54:56

我很确定你不能仅通过对正则表达式进行标记来做到这一点。如果需要处理嵌套和转义分隔符,则需要编写解析器。参见例如 http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

将会有是开源解析器,可以做你想做的事情,尽管我不知道。您还应该查看 StreamTokenizer 类。

I'm pretty sure you can't do this by just tokenising on a regex. If you need to deal with nested and escaped delimiters, you need to write a parser. See e.g. http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

There will be open source parsers which can do what you want, although I don't know any. You should also check out the StreamTokenizer class.

窝囊感情。 2024-11-14 13:54:56

回顾一下,您希望在空格上进行分割,除非用双引号引起来,双引号前面没有反斜杠。

第 1 步:对输入进行标记: /([ \t]+)|(\\")|(")|([^ \t"]+)/

这将为您提供一系列 SPACE 、ESCAPED_QUOTE、QUOTE 和 TEXT 标记

步骤 2:构建一个匹配标记并做出反应的有限状态机:

状态:START

  • SPACE -> 返回空字符串
  • ESCAPED_QUOTE -> Error (?)
  • QUOTE -> 状态 :=WITHIN_QUOTES
  • TEXT -> 返回文本

状态:WITHIN_QUOTES

  • SPACE -> 将值添加到累加器
  • QUOTE ->向累加器添加引用
  • 状态 := START
  • TEXT ->将文本添加到累加器

第 3 步:利润!!

To recap, you want to split on whitespace, except when surrounded by double quotes, which are not preceded by a backslash.

Step 1: tokenize the input: /([ \t]+)|(\\")|(")|([^ \t"]+)/

This gives you a sequence of SPACE, ESCAPED_QUOTE, QUOTE and TEXT tokens.

Step 2: build a finite state machine matching and reacting to the tokens:

State: START

  • SPACE -> return empty string
  • ESCAPED_QUOTE -> Error (?)
  • QUOTE -> State := WITHIN_QUOTES
  • TEXT -> return text

State: WITHIN_QUOTES

  • SPACE -> add value to accumulator
  • ESCAPED_QUOTE -> add quote to accumulator
  • QUOTE -> return and clear accumulator; State := START
  • TEXT -> add text to accumulator

Step 3: Profit!!

枯叶蝶 2024-11-14 13:54:56

我想如果你使用这样的模式:

Pattern pattern = Pattern.compile("\".*?(?<!\\\\)\"|'.*?(?<!\\\\)'|[A-Za-z']+");

那么它会给你想要的输出。当我运行你的输入数据时,我得到了这个列表:

["Hello\" World", "Hello Universe", Hi]

我在您自己的问题中使用了 [A-Za-z']+ 但不应该只是: [A-Za-z]+

编辑

更改您的 opts.add(token); 行到:

opts.add(token.replaceAll("^\"|\"$|^'|'$", ""));

I think if you use pattern like this:

Pattern pattern = Pattern.compile("\".*?(?<!\\\\)\"|'.*?(?<!\\\\)'|[A-Za-z']+");

Then it will give you desired output. When I ran with your input data I got this list:

["Hello\" World", "Hello Universe", Hi]

I used [A-Za-z']+ from your own question but shouldn't it be just : [A-Za-z]+

EDIT

Change your opts.add(token); line to:

opts.add(token.replaceAll("^\"|\"$|^'|'$", ""));
还如梦归 2024-11-14 13:54:56

您需要做的第一件事就是停止用 split() 来思考这项工作。 split() 用于分解简单的字符串,例如 this/that/the other,其中 / 始终是分隔符。但是你试图在空格上进行分割,除非空格在引号内,除非如果引号用反斜杠转义(如果反斜杠转义引号,它们可能会转义其他引号)的东西,就像其他反斜杠)。

有了所有这些例外,就不可能创建一个正则表达式来匹配所有可能的分隔符,甚至使用像环视、条件、不情愿和所有格量词这样的花哨的花招。您想要做的是匹配标记,而不是分隔符。

在以下代码中,用双引号或单引号括起来的标记可能包含空格以及引号字符(如果前面有反斜杠)。除了括起来的引号之外的所有内容都在组#1(对于双引号标记)或组#2(单引号)中捕获。任何字符都可以用反斜杠转义,即使是在不带引号的标记中; “转义”反斜杠在单独的步骤中被删除。

public static void test()
{
  String str = "\"Hello\\\" World\" 'Hello Universe' Hi";
  List<String> commands = parseCommands(str);
  for (String s : commands)
  {
    System.out.println(s);
  }
}

public static List<String> parseCommands(String s)
{
  String rgx = "\"((?:[^\"\\\\]++|\\\\.)*+)\""  // double-quoted
             + "|'((?:[^'\\\\]++|\\\\.)*+)'"    // single-quoted
             + "|\\S+";                         // not quoted
  Pattern p = Pattern.compile(rgx);
  Matcher m = p.matcher(s);
  List<String> commands = new ArrayList<String>();
  while (m.find())
  {
    String cmd = m.start(1) != -1 ? m.group(1) // strip double-quotes
               : m.start(2) != -1 ? m.group(2) // strip single-quotes
               : m.group();
    cmd = cmd.replaceAll("\\\\(.)", "$1");  // remove escape characters
    commands.add(cmd);
  }
  return commands;
}

输出:

Hello" World
Hello Universe
Hi

这对于基于正则表达式的解决方案来说非常简单,而且它并不能真正处理格式错误的输入,例如不平衡的引号。如果您不熟悉正则表达式,那么最好使用纯手工编码的解决方案,或者更好的是专用的命令行解释器 (CLI) 库。

The first thing you need to do is stop thinking of the job in terms of split(). split() is meant for breaking down simple strings like this/that/the other, where / is always a delimiter. But you're trying to split on whitespace, unless the whitespace is within quotes, except if the quotes are escaped with backslashes (and if backslashes escape quotes, they probably escape other things, like other backslashes).

With all those exceptions-to-exceptions, it's just not possible to create a regex to match all possible delimiters, not even with fancy gimmicks like lookarounds, conditionals, reluctant and possessive quantifiers. What you want to do is match the tokens, not the delimiters.

In the following code, a token that's enclosed in double-quotes or single-quotes may contain whitespace as well as the quote character if it's preceded by a backslash. Everything except the enclosing quotes is captured in group #1 (for double-quoted tokens) or group #2 (single-quoted). Any character may be escaped with a backslash, even in non-quoted tokens; the "escaping" backslashes are removed in a separate step.

public static void test()
{
  String str = "\"Hello\\\" World\" 'Hello Universe' Hi";
  List<String> commands = parseCommands(str);
  for (String s : commands)
  {
    System.out.println(s);
  }
}

public static List<String> parseCommands(String s)
{
  String rgx = "\"((?:[^\"\\\\]++|\\\\.)*+)\""  // double-quoted
             + "|'((?:[^'\\\\]++|\\\\.)*+)'"    // single-quoted
             + "|\\S+";                         // not quoted
  Pattern p = Pattern.compile(rgx);
  Matcher m = p.matcher(s);
  List<String> commands = new ArrayList<String>();
  while (m.find())
  {
    String cmd = m.start(1) != -1 ? m.group(1) // strip double-quotes
               : m.start(2) != -1 ? m.group(2) // strip single-quotes
               : m.group();
    cmd = cmd.replaceAll("\\\\(.)", "$1");  // remove escape characters
    commands.add(cmd);
  }
  return commands;
}

output:

Hello" World
Hello Universe
Hi

This is about as simple as it gets for a regex-based solution--and it doesn't really deal with malformed input, like unbalanced quotes. If you're not fluent in regexes, you might be better off with a purely hand-coded solution or, even better, a dedicated command-line interpreter (CLI) library.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文