创建一个支持字符串的 Guava Splitter

发布于 2024-11-02 14:10:50 字数 512 浏览 1 评论 0原文

我想为 Java 创建一个 Guava Splitter,它可以将 Java 字符串作为一个块进行处理。例如,我希望以下断言为真:

@Test
public void testSplitter() {
  String toSplit = "a,b,\"c,d\\\"\",e";
  List<String> expected = ImmutableList.of("a", "b", "c,d\"","e");

  Splitter splitter = Splitter.onPattern(...);
  List<String> actual = ImmutableList.copyOf(splitter.split(toSplit));

  assertEquals(expected, actual);
}

我可以编写正则表达式来查找所有元素,并且不考虑“,”,但我找不到用作分隔符的正则表达式一个分离器。

如果不可能,请直接说出来,然后我将从 findAll 正则表达式构建列表。

I would like to create a Guava Splitter for Java that can handles Java strings as one block. For instance, I would like the following assertion to be true:

@Test
public void testSplitter() {
  String toSplit = "a,b,\"c,d\\\"\",e";
  List<String> expected = ImmutableList.of("a", "b", "c,d\"","e");

  Splitter splitter = Splitter.onPattern(...);
  List<String> actual = ImmutableList.copyOf(splitter.split(toSplit));

  assertEquals(expected, actual);
}

I can write the regex to find all the elements and don't consider the ',' but I can't find the regex that would act as a separator to be used with a Splitter.

If it's impossible, please just say so, then I'll build the list from the findAll regex.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

内心旳酸楚 2024-11-09 14:10:50

这看起来你应该使用 CSV 库,例如 opencsv 。分离值和处理像引用块这样的情况就是它们的全部内容。

This seems like something you should use a CSV library such as opencsv for. Separating values and handling cases like quoted blocks are what they're all about.

余生共白头 2024-11-09 14:10:50

我有同样的问题(除了不需要支持转义引号字符)。我不喜欢为这么简单的事情包含另一个库。然后我想到,我需要一个可变的 CharMatcher。与 Bart Kiers 的解决方案一样,它保留了引号字符。

public static Splitter quotableComma() {
    return on(new CharMatcher() {
        private boolean inQuotes = false;

        @Override
        public boolean matches(char c) {
            if ('"' == c) {
                inQuotes = !inQuotes;
            }
            if (inQuotes) {
                return false;
            }
            return (',' == c);
        }
    });
}

@Test
public void testQuotableComma() throws Exception {
    String toSplit = "a,b,\"c,d\",e";
    List<String> expected = ImmutableList.of("a", "b", "\"c,d\"", "e");
    Splitter splitter = Splitters.quotableComma();
    List<String> actual = ImmutableList.copyOf(splitter.split(toSplit));
    assertEquals(expected, actual);
}

I've same problem (except no need to support escaping of quote character). I don't like to include another library for such simple thing. And then i came to idea, that i need a mutable CharMatcher. As with solution of Bart Kiers, it keeps quote character.

public static Splitter quotableComma() {
    return on(new CharMatcher() {
        private boolean inQuotes = false;

        @Override
        public boolean matches(char c) {
            if ('"' == c) {
                inQuotes = !inQuotes;
            }
            if (inQuotes) {
                return false;
            }
            return (',' == c);
        }
    });
}

@Test
public void testQuotableComma() throws Exception {
    String toSplit = "a,b,\"c,d\",e";
    List<String> expected = ImmutableList.of("a", "b", "\"c,d\"", "e");
    Splitter splitter = Splitters.quotableComma();
    List<String> actual = ImmutableList.copyOf(splitter.split(toSplit));
    assertEquals(expected, actual);
}
拧巴小姐 2024-11-09 14:10:50

您可以按照以下模式进行拆分:

\s*,\s*(?=((\\["\\]|[^"\\])*"(\\["\\]|[^"\\])*")*(\\["\\]|[^"\\])*$)

使用 (?x) 标志可能看起来(有点)友好:

(?x)            # enable comments, ignore space-literals
\s*,\s*         # match a comma optionally surrounded by space-chars
(?=             # start positive look ahead
  (             #   start group 1
    (           #     start group 2
      \\["\\]   #       match an escaped quote or backslash
      |         #       OR
      [^"\\]    #       match any char other than a quote or backslash
    )*          #     end group 2, and repeat it zero or more times
    "           #     match a quote
    (           #     start group 3
      \\["\\]   #       match an escaped quote or backslash
      |         #       OR
      [^"\\]    #       match any char other than a quote or backslash
    )*          #     end group 3, and repeat it zero or more times
    "           #     match a quote
  )*            #   end group 1, and repeat it zero or more times
  (             #   open group 4
    \\["\\]     #     match an escaped quote or backslash
    |           #     OR
    [^"\\]      #     match any char other than a quote or backslash
  )*            #   end group 4, and repeat it zero or more times
  $             #   match the end-of-input
)               # end positive look ahead

但即使在这个注释版本中,它仍然是一个怪物。用简单的英语来说,这个正则表达式可以解释如下:

匹配可选地由空格字符包围的逗号,仅当向前查看该逗号时(一直到字符串末尾!),有零个或偶数个引号,同时忽略转义引号或转义反斜杠。

因此,看到此内容后,您可能会同意 ColinD(我同意!)的观点,即在这种情况下使用某种 CSV 解析器是可行的方法。

请注意,上面的正则表达式将保留标记周围的 qoutes,即字符串 a,b,"c,d\"",e (作为文字:"a,b, \"c,d\\\"\",e") 将被分割如下:

a
b
"c,d\""
e

You could split on the following pattern:

\s*,\s*(?=((\\["\\]|[^"\\])*"(\\["\\]|[^"\\])*")*(\\["\\]|[^"\\])*$)

which might look (a bit) friendlier with the (?x) flag:

(?x)            # enable comments, ignore space-literals
\s*,\s*         # match a comma optionally surrounded by space-chars
(?=             # start positive look ahead
  (             #   start group 1
    (           #     start group 2
      \\["\\]   #       match an escaped quote or backslash
      |         #       OR
      [^"\\]    #       match any char other than a quote or backslash
    )*          #     end group 2, and repeat it zero or more times
    "           #     match a quote
    (           #     start group 3
      \\["\\]   #       match an escaped quote or backslash
      |         #       OR
      [^"\\]    #       match any char other than a quote or backslash
    )*          #     end group 3, and repeat it zero or more times
    "           #     match a quote
  )*            #   end group 1, and repeat it zero or more times
  (             #   open group 4
    \\["\\]     #     match an escaped quote or backslash
    |           #     OR
    [^"\\]      #     match any char other than a quote or backslash
  )*            #   end group 4, and repeat it zero or more times
  $             #   match the end-of-input
)               # end positive look ahead

But even in this commented-version, it still is a monster. In plain English, this regex could be explained as follows:

Match a comma that is optionally surrounded by space-chars, only when looking ahead of that comma (all the way to the end of the string!), there are zero or an even number of quotes while ignoring escaped quotes or escaped backslashes.

So, after seeing this, you might agree with ColinD (I do!) that using some sort of a CSV parser is the way to go in this case.

Note that the regex above will leave the qoutes around the tokens, i.e., the string a,b,"c,d\"",e (as a literal: "a,b,\"c,d\\\"\",e") will be split as follows:

a
b
"c,d\""
e
写给空气的情书 2024-11-09 14:10:50

对@Rage-Steel 的答案进行了一些改进。

final static CharMatcher notQuoted = new CharMatcher() {
     private boolean inQuotes = false;

     @Override
     public boolean matches(char c) {
        if ('"' == c) {
        inQuotes = !inQuotes;
     }
     return !inQuotes;
};

final static Splitter SPLITTER = Splitter.on(notQuoted.and(CharMatcher.anyOf(" ,;|"))).trimResults().omitEmptyStrings();

然后,

public static void main(String[] args) {
    final String toSplit = "a=b c=d,kuku=\"e=f|g=h something=other\"";

    List<String> sputnik = SPLITTER.splitToList(toSplit);
    for (String s : sputnik)
        System.out.println(s);
}

注意线程安全(或者,简单来说 - 没有)

Improving on @Rage-Steel 's answer a bit.

final static CharMatcher notQuoted = new CharMatcher() {
     private boolean inQuotes = false;

     @Override
     public boolean matches(char c) {
        if ('"' == c) {
        inQuotes = !inQuotes;
     }
     return !inQuotes;
};

final static Splitter SPLITTER = Splitter.on(notQuoted.and(CharMatcher.anyOf(" ,;|"))).trimResults().omitEmptyStrings();

And then,

public static void main(String[] args) {
    final String toSplit = "a=b c=d,kuku=\"e=f|g=h something=other\"";

    List<String> sputnik = SPLITTER.splitToList(toSplit);
    for (String s : sputnik)
        System.out.println(s);
}

Pay attention to thread safety (or, to simplify - there isn't any)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文