帮助您更好地从 Java 中解析字符串中的数字

发布于 2024-07-24 09:40:17 字数 1099 浏览 11 评论 0原文

我有一个包含数字和字母的字符串。 我希望将字符串拆分为连续的数字块和连续的字母块。

考虑字符串“34A312O5M444123A”。

我想输出: [“34”,“A”,“312”,“O”,“5”,“M”,“444123”,“A”]

我有有效的代码,看起来像:

List<String> digitsAsElements(String str){
  StringBuilder digitCollector = new StringBuilder();

  List<String> output = new ArrayList<String>();

  for (int i = 0; i < str.length(); i++){
    char cChar = str.charAt(i);

    if (Character.isDigit(cChar))
       digitCollector.append(cChar);
    else{
      output.add(digitCollector.toString());
      output.add(""+cChar);

      digitCollector = new StringBuilder();
    }         
  }

  return output;
}

我考虑将 str 拆分两次以获得包含所有数字块的数组和包含所有字母块的数组。 然后合并结果。 我回避了这一点,因为它会损害可读性。

我故意避免使用正则表达式模式来解决这个问题,因为我发现正则表达式模式是可读性的主要障碍。

  • 调试器不能很好地处理它们。
  • 它们打断了人们阅读源代码的流程。
  • 随着时间的推移,正则表达式会有机地成长并成为怪物。
  • 它们非常不直观。

我的问题是:

  • 如何提高上述代码的可读性?
  • 有一个更好的方法吗? 一个Util类可以优雅地解决这个问题。
  • 使用正则表达式和编写类似于我上面写的内容之间的界限在哪里?
  • 如何提高正则表达式的可读性/可维护性?

I have a string which contains digits and letters. I wish to split the string into contiguous chunks of digits and contiguous chunks of letters.

Consider the String "34A312O5M444123A".

I would like to output:
["34", "A", "312", "O", "5", "M", "444123", "A"]

I have code which works and looks like:

List<String> digitsAsElements(String str){
  StringBuilder digitCollector = new StringBuilder();

  List<String> output = new ArrayList<String>();

  for (int i = 0; i < str.length(); i++){
    char cChar = str.charAt(i);

    if (Character.isDigit(cChar))
       digitCollector.append(cChar);
    else{
      output.add(digitCollector.toString());
      output.add(""+cChar);

      digitCollector = new StringBuilder();
    }         
  }

  return output;
}

I considered splitting str twice to get an array containing all the numbers chunks and an array containing the all letters chunks. Then merging the results. I shied away from this as it would harm readability.

I have intentionally avoided solving this with a regex pattern as I find regex patterns to be a major impediment to readability.

  • Debuggers don't handle them well.
  • They interrupt the flow of someone reading source code.
  • Overtime regex's grow organically and become monsters.
  • They are deeply non intuitive.

My questions are:

  • How could I improve the readability of the above code?
  • Is there a better way to do this? A Util class that solves this problem elegantly.
  • Where do you draw the line between using a regEx and coding something simpilar to what I've written above?
  • How do you increase the readability/maintainability of regExes?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

哀由 2024-07-31 09:40:17

对于这个特定的任务,我总是使用正则表达式而不是手写类似的东西。 至少对我来说,上面给出的代码的可读性不如简单的正则表达式(在这种情况下,它是 (\d+|[^\d]+) ,据我所知可以看到)。

您可能希望避免编写超过几行的正则表达式。 这些可能而且通常是不可读和难以理解的,但是它们可以替换的代码也是如此!解析器几乎从来都不是漂亮的,你通常最好阅读原始语法而不是尝试编写生成的(或手写的)解析器的意义。 正则表达式也是如此(恕我直言),它只是常规语法的简洁描述。

因此,总的来说,我想说禁止正则表达式以支持您在问题中给出的代码听起来是一个非常愚蠢的想法。 正则表达式只是一个工具,仅此而已。 如果其他东西可以更好地进行文本解析(例如,真正的解析器,一些子字符串魔术等),那么就使用它。 但不要仅仅因为你对这些可能性感到不舒服而放弃它们——其他人在应对这些可能性时可能会遇到更少的问题,而且所有人都能够学习。

编辑:mmyers 评论后更新了正则表达式。

For this particular task I'd always use a regex instead of hand-writing something similar. The code you have given above is, at least to me, less readable than a simple regular expression (which would be (\d+|[^\d]+) in this case, as far as I can see).

You may want to avoid writing regular expressions that exceed a few lines. Those can be and usually are unreadable and hard to understand, but so is the code they can be replaced with! Parsers are almost never pretty and you're usually better off reading the original grammar than trying to make sense of the generated (or handwritten) parser. Same goes (imho) for regexes which are just a concise description of a regular grammar.

So, in general I'd say banning regexes in favor of code like you've given in your question sounds like a terribly stupid idea. And regular expressions are just a tool, nothing less, nothing more. If something else does a better job of text parsing (say, a real parser, some substring magic, etc.) then use it. But don't throw away possibilities just because you feel uncomfortable with them – others may have less problems coping with them and all people are able to learn.

EDIT: Updated regex after comment by mmyers.

⊕婉儿 2024-07-31 09:40:17

对于实用程序类,请查看 java.util .扫描仪。 其中有很多选项可以帮助您解决问题。 对于你的问题我有几点评论。

调试器不能很好地处理它们(正则表达式)

正则表达式是否有效取决于数据中的内容。 您可以使用一些不错的插件来帮助您构建正则表达式,例如 QuickREx for Eclipse,调试器真的可以帮助您为数据编写正确的解析器吗?

它们打断了人们阅读源代码的流程。

我想这取决于你和他们相处的舒服程度。 就我个人而言,我宁愿阅读合理的正则表达式,也不愿阅读 50 行以上的字符串解析代码,但这也许是个人的事情。

随着时间的推移,正则表达式会有机地成长并成为怪物。

我想他们可能会,但这可能是他们所居住的代码变得不集中的问题。 如果源数据的复杂性不断增加,您可能需要留意是否需要更具表现力的解决方案(可能是像 ANTLR 这样的解析器生成器)

它们非常不直观。

它们是一种模式匹配语言。 我想说他们在这种情况下非常直观。

如何提高上述代码的可读性?

不确定,除了使用正则表达式。

有更好的方法吗? 一个Util类可以优雅地解决这个问题。

上面提到了java.util.Scanner。

使用正则表达式和编写类似于我上面写的内容之间的界限在哪里?

就我个人而言,我将正则表达式用于任何相当简单的事情。

如何提高正则表达式的可读性/可维护性?

在扩展之前仔细考虑,特别注意详细注释代码和正则表达式,以便清楚您在做什么。

For a utility class, check out java.util.Scanner. There are a number of options in there as to how you might go about solving your problem. I have a few comments on your questions.

Debuggers don't handle them (regular expressions) well

Whether a regex works or not depends on whats in your data. There are some nice plugins you can use to help you build a regex, like QuickREx for Eclipse, does a debugger actually help you write the right parser for your data?

They interrupt the flow of someone reading source code.

I guess it depends on how comfortable you are with them. Personally, I'd rather read a reasonable regex than 50 more lines of string parsing code, but maybe that's a personal thing.

Overtime regex's grow organically and become monsters.

I guess they might, but that's probably a problem with the code they live in becoming unfocussed. If the complexity of the source data is increasing, you probably need to keep an eye on whether you need a more expressive solution (maybe a parser generator like ANTLR)

They are deeply non intuitive.

They're a pattern matching language. I would say they're pretty intuitive in that context.

How could I improve the readability of the above code?

Not sure, apart from use a regex.

Is there a better way to do this? A Util class that solves this problem elegantly.

Mentioned above, java.util.Scanner.

Where do you draw the line between using a regEx and coding something simpilar to what I've written above?

Personally I use regex for anything reasonably simple.

How do you increase the readability/maintainability of regExes?

Think carefully before extending,take extra care to comment up the code and the regex in detail so that it's clear what you're doing.

阿楠 2024-07-31 09:40:17

如果正则表达式意味着用一行代码解决问题,您愿意使用正则表达式吗?

// Split at any position that's either:
// preceded by a digit and followed by a non-digit, or
// preceded by a non-digit and followed by a digit.
String[] parts = str.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");

通过解释正则表达式的注释,我认为这比任何非正则表达式解决方案(或任何其他正则表达式解决方案)都更具可读性。

Would you be willing to use regexes if it meant solving the problem in one line of code?

// Split at any position that's either:
// preceded by a digit and followed by a non-digit, or
// preceded by a non-digit and followed by a digit.
String[] parts = str.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");

With the comment to explain the regex, I think that's more readable than any of the non-regex solutions (or any of the other regex solutions, for that matter).

夏有森光若流苏 2024-07-31 09:40:17

我会使用类似的东西(警告,未经测试的代码)。 对我来说,这比试图避免使用正则表达式更具可读性。 当用在正确的地方时,正则表达式是一个很棒的工具。

注释方法以及在注释中提供输入和输出值的示例也有帮助。

List<String> digitsAsElements(String str){
    Pattern p = Pattern.compile("(\\d+|\\w+)*");
    Matcher m = p.matcher(str);

    List<String> output = new ArrayList<String>();
    for(int i = 1; i <= m.groupCount(); i++) {
       output.add(m.group(i));
    }
    return output;
}

I would use something like this (warning, untested code). For me this is a lot more readable than trying to avoid regexps. Regexps are a great tool when used in right place.

Commenting methods and providing examples of input and output values in comments also helps.

List<String> digitsAsElements(String str){
    Pattern p = Pattern.compile("(\\d+|\\w+)*");
    Matcher m = p.matcher(str);

    List<String> output = new ArrayList<String>();
    for(int i = 1; i <= m.groupCount(); i++) {
       output.add(m.group(i));
    }
    return output;
}
故事灯 2024-07-31 09:40:17

我自己并不太热衷于正则表达式,但这似乎是它们真正会简化事情的情况。 您可能想要做的是将它们放入您可以设计的最小方法中,适当地命名它,然后将所有控制代码放入另一个方法中。

例如,如果您编写了“抓取数字或字母块”方法,则调用者将是一个非常简单、直接的循环,只需打印每个调用的结果,并且您调用的方法将被明确定义,因此即使您对语法一无所知,正则表达式的意图也会很清楚,并且该方法将受到限制,因此人们不太可能随着时间的推移将其搞砸。

问题在于,正则表达式工具非常简单并且非常适合这种用途,因此很难证明为此进行方法调用的合理性。

I'm not overly crazy about regex myself, but this seems like a case where they will really simplify things. What you might want to do is put them into the smallest method you can devise, name it aptly, and then put all the control code in another method.

For instance, if you coded a "Grab block of numbers or letters" method, the caller would be a very simple, straight-forward loop just printing the results of each call, and the method you were calling would be well-defined so the intention of the regex would be clear even if you didn't know anything about the syntax, and the method would be bounded so people wouldn't be likely to muck it up over time.

The problem with this is that the regex tools are so simple and well-adapted to this use that it's hard to justify a method call for this.

卸妝后依然美 2024-07-31 09:40:17

由于似乎还没有人发布正确的代码,所以我会尝试一下。

首先是非正则表达式版本。 请注意,我使用 StringBuilder 来累积最后看到的任何类型的字符(数字或非数字)。 如果状态发生变化,我会将其内容转储到列表中并启动一个新的 StringBuilder。 这样,连续的非数字就像连续的数字一样被分组。

static List<String> digitsAsElements(String str) {
    StringBuilder collector = new StringBuilder();

    List<String> output = new ArrayList<String>();
    boolean lastWasDigit = false;
    for (int i = 0; i < str.length(); i++) {
        char cChar = str.charAt(i);

        boolean isDigit = Character.isDigit(cChar);
        if (isDigit != lastWasDigit) {
            if (collector.length() > 0) {
                output.add(collector.toString());
                collector = new StringBuilder();
            }
            lastWasDigit = isDigit;
        }
        collector.append(cChar);
    }
    if (collector.length() > 0)
        output.add(collector.toString());

    return output;
}

现在是正则表达式版本。 这与 Juha S. 发布的代码基本相同,但正则表达式实际上有效。

private static final Pattern DIGIT_OR_NONDIGIT_STRING =
        Pattern.compile("(\\d+|[^\\d]+)");
static List<String> digitsAsElementsR(String str) {
    // Match a consecutive series of digits or non-digits
    final Matcher matcher = DIGIT_OR_NONDIGIT_STRING.matcher(str);
    final List<String> output = new ArrayList<String>();
    while (matcher.find()) {
        output.add(matcher.group());
    }
    return output;
}

我尝试保持正则表达式可读性的一种方法是它们的名称。 我认为 DIGIT_OR_NONDIGIT_STRING 很好地传达了我(程序员)的想法,测试应该确保它确实做到了它应该做的事情。

public static void main(String[] args) {
    System.out.println(digitsAsElements( "34A312O5MNI444123A"));
    System.out.println(digitsAsElementsR("34A312O5MNI444123A"));
}

印刷:

[34, A, 312, O, 5, MNI, 444123, A]
[34, A, 312, O, 5, MNI, 444123, A]

Since no one seems to have posted correct code yet, I'll give it a shot.

First the non-regex version. Note that I use the StringBuilder for accumulating whichever type of character was seen last (digit or non-digit). If the state changes, I dump its contents into the list and start a new StringBuilder. This way consecutive non-digits are grouped just like consecutive digits are.

static List<String> digitsAsElements(String str) {
    StringBuilder collector = new StringBuilder();

    List<String> output = new ArrayList<String>();
    boolean lastWasDigit = false;
    for (int i = 0; i < str.length(); i++) {
        char cChar = str.charAt(i);

        boolean isDigit = Character.isDigit(cChar);
        if (isDigit != lastWasDigit) {
            if (collector.length() > 0) {
                output.add(collector.toString());
                collector = new StringBuilder();
            }
            lastWasDigit = isDigit;
        }
        collector.append(cChar);
    }
    if (collector.length() > 0)
        output.add(collector.toString());

    return output;
}

Now the regex version. This is basically the same code that was posted by Juha S., but the regex actually works.

private static final Pattern DIGIT_OR_NONDIGIT_STRING =
        Pattern.compile("(\\d+|[^\\d]+)");
static List<String> digitsAsElementsR(String str) {
    // Match a consecutive series of digits or non-digits
    final Matcher matcher = DIGIT_OR_NONDIGIT_STRING.matcher(str);
    final List<String> output = new ArrayList<String>();
    while (matcher.find()) {
        output.add(matcher.group());
    }
    return output;
}

One way I try to keep my regexes readable is their names. I think DIGIT_OR_NONDIGIT_STRING conveys pretty well what I (the programmer) think it does, and testing should make sure that it really does what it's meant to do.

public static void main(String[] args) {
    System.out.println(digitsAsElements( "34A312O5MNI444123A"));
    System.out.println(digitsAsElementsR("34A312O5MNI444123A"));
}

prints:

[34, A, 312, O, 5, MNI, 444123, A]
[34, A, 312, O, 5, MNI, 444123, A]
昵称有卵用 2024-07-31 09:40:17

哇哦,有人比我先写代码了。 我认为正则表达式版本更容易阅读/维护。 另外,请注意两种实现与预期输出之间的输出差异...

输出:

digitsAsElements1("34A312O5MNI444123A") = [34, A, 312, O, 5, M, , N, , I, 444123, A]
digitsAsElements2("34A312O5MNI444123A") = [34, A, 312, O, 5, MNI, 444123, A]
Expected: [34, A, 312, O, 5, MN, 444123, A]

比较:

DigitsAsElements.java:

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class DigitsAsElements {

    static List<String> digitsAsElements1(String str){
        StringBuilder digitCollector = new StringBuilder();

        List<String> output = new ArrayList<String>();

        for (int i = 0; i < str.length(); i++){
          char cChar = str.charAt(i);

          if (Character.isDigit(cChar))
             digitCollector.append(cChar);
          else{
            output.add(digitCollector.toString());
            output.add(""+cChar);

            digitCollector = new StringBuilder();
          }         
        }

        return output;
      }

    static List<String> digitsAsElements2(String str){
        // Match a consecutive series of digits or non-digits
        final Pattern pattern = Pattern.compile("(\\d+|\\D+)");
        final Matcher matcher = pattern.matcher(str);

        final List<String> output = new ArrayList<String>();
        while (matcher.find()) {
            output.add(matcher.group());
        }

        return output;
      }

    /**
     * @param args
     */
    public static void main(String[] args) {
        System.out.println("digitsAsElements(\"34A312O5MNI444123A\") = " +
                digitsAsElements1("34A312O5MNI444123A"));
        System.out.println("digitsAsElements2(\"34A312O5MNI444123A\") = " +
                digitsAsElements2("34A312O5MNI444123A"));
        System.out.println("Expected: [" +
                "34, A, 312, O, 5, MN, 444123, A"+"]");
    }

}

Awww, someone beat me to code. I think the regex version is easier to read/maintain. Also, note the difference in output between the 2 implementations vs the expected output ...

Output:

digitsAsElements1("34A312O5MNI444123A") = [34, A, 312, O, 5, M, , N, , I, 444123, A]
digitsAsElements2("34A312O5MNI444123A") = [34, A, 312, O, 5, MNI, 444123, A]
Expected: [34, A, 312, O, 5, MN, 444123, A]

Compare:

DigitsAsElements.java:

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class DigitsAsElements {

    static List<String> digitsAsElements1(String str){
        StringBuilder digitCollector = new StringBuilder();

        List<String> output = new ArrayList<String>();

        for (int i = 0; i < str.length(); i++){
          char cChar = str.charAt(i);

          if (Character.isDigit(cChar))
             digitCollector.append(cChar);
          else{
            output.add(digitCollector.toString());
            output.add(""+cChar);

            digitCollector = new StringBuilder();
          }         
        }

        return output;
      }

    static List<String> digitsAsElements2(String str){
        // Match a consecutive series of digits or non-digits
        final Pattern pattern = Pattern.compile("(\\d+|\\D+)");
        final Matcher matcher = pattern.matcher(str);

        final List<String> output = new ArrayList<String>();
        while (matcher.find()) {
            output.add(matcher.group());
        }

        return output;
      }

    /**
     * @param args
     */
    public static void main(String[] args) {
        System.out.println("digitsAsElements(\"34A312O5MNI444123A\") = " +
                digitsAsElements1("34A312O5MNI444123A"));
        System.out.println("digitsAsElements2(\"34A312O5MNI444123A\") = " +
                digitsAsElements2("34A312O5MNI444123A"));
        System.out.println("Expected: [" +
                "34, A, 312, O, 5, MN, 444123, A"+"]");
    }

}
池木 2024-07-31 09:40:17

你可以使用这个类来简化你的循环:

public class StringIterator implements Iterator<Character> {

    private final char[] chars;
    private int i;

    private StringIterator(char[] chars) {
        this.chars = chars;
    }

    public boolean hasNext() {
        return i < chars.length;
    }

    public Character next() {
        return chars[i++];
    }

    public void remove() {
        throw new UnsupportedOperationException("Not supported.");
    }

    public static Iterable<Character> of(String string) {
        final char[] chars = string.toCharArray();

        return new Iterable<Character>() {

            @Override
            public Iterator<Character> iterator() {
                return new StringIterator(chars);
            }
        };
    }
}

现在你可以重写这个:

for (int i = 0; i < str.length(); i++){
    char cChar = str.charAt(i);
    ...
}

with:

for (Character cChar : StringIterator.of(str)) {
    ...
}

我的 2 美分

BTW 这个类也可以在其他上下文中重用。

you could use this class in order to simplify your loop:

public class StringIterator implements Iterator<Character> {

    private final char[] chars;
    private int i;

    private StringIterator(char[] chars) {
        this.chars = chars;
    }

    public boolean hasNext() {
        return i < chars.length;
    }

    public Character next() {
        return chars[i++];
    }

    public void remove() {
        throw new UnsupportedOperationException("Not supported.");
    }

    public static Iterable<Character> of(String string) {
        final char[] chars = string.toCharArray();

        return new Iterable<Character>() {

            @Override
            public Iterator<Character> iterator() {
                return new StringIterator(chars);
            }
        };
    }
}

Now you can rewrite this:

for (int i = 0; i < str.length(); i++){
    char cChar = str.charAt(i);
    ...
}

with:

for (Character cChar : StringIterator.of(str)) {
    ...
}

my 2 cents

BTW this class is also reusable in other context.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文