读取文本文件时的 Java 分隔符 - 正则表达式/或不？

发布于 2024-12-13 12:33:41 字数 430 浏览 0 评论 0原文

我正在尝试读取以这种形式编写的文本文件：

    AB523:[joe, pierre][charlie][dogs,cat]
    ZZ883:[ronald, zigomarre][pele]

我想创建我的结构并正确检索信息。

AB523 --- 单独
乔，皮埃尔 ---独自
查理——独自一人
狗，猫 --- 单独

我不确定应该使用的最佳技术是什么。我尝试过 StringTokenizer ...并使用了 regEx 但我无法正确解决

你有什么解决方案吗？或建议

写入文本文件时的约定是什么？使用分隔符的最佳做法是什么？

编辑：文本文件也是由我生成的，因此我可以控制整体模式。重新阅读时减少工作量的最佳输出模式是什么？

原文

I am trying to read a text file written in this form:

    AB523:[joe, pierre][charlie][dogs,cat]
    ZZ883:[ronald, zigomarre][pele]

I would like to create my structure and retrieve the information properly.

AB523 --- alone
joe,pierre ---alone
charlie ---alone
dogs,cat --- alone

I am not sure what's the best technique that should be used. I've tried StringTokenizer ...and played with regEx but I can't get it right

Do you have any solution? or suggestion

What's is the convention when writting in a text file? What are the best pratices with delimiters?

EDIT:The textfile is also generated by me, so I have control over the overall pattern. What would be the best output pattern to reduce the amount of work when re-reading it ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甜柠檬 2024-12-20 12:33:41

我会在这里使用正则表达式，因为它看起来需要维护的代码更少，而且您的语言肯定是正则的。与 java.util.Scanner 实例一起提高效率。这是一些代码：

import java.io.Reader;
import java.io.StringReader;
import java.util.Scanner;
import java.util.regex.Pattern;

public class ScannerTest {

private static final Pattern header = Pattern.compile("(.*):");
private static final Pattern names = Pattern.compile("\\[([^\\]]+)\\]");

public static void main(String[] args) {

    Reader reader = new StringReader(
            "AB523:[joe, pierre][charlie][dogs,cat]\n"
                    + "ZZ883:[ronald, zigomarre][pele]");

    Scanner scanner = new Scanner(reader);
    scanner.useDelimiter("\n");

    while (scanner.hasNext()) {
        String h = scanner.findInLine(header);
        // Substring removes trailing ':'.
        System.out.println(h.substring(0, h.length() - 1));

        String n;
        while ((n = scanner.findInLine(names)) != null)
            // Substring removes '[' and ']'.
            System.out.println(n.substring(1, n.length() - 1));

        if (scanner.hasNext())
            scanner.nextLine();
    }
}
}

尽管如此，我仍然无法删除子字符串调用，也许这隐藏了一些低效率。我的猜测是，由于字符串的不变性，不应为这种情况重新创建字符串。

编辑：为了获得更好的性能，我还会考虑手工制作的递归下降解析器。

I would use regular expressiones here, because it seems like less code to maintain, and your language is certainly regular. Along with a java.util.Scanner instance for more efficiency. Here's some code:

import java.io.Reader;
import java.io.StringReader;
import java.util.Scanner;
import java.util.regex.Pattern;

public class ScannerTest {

private static final Pattern header = Pattern.compile("(.*):");
private static final Pattern names = Pattern.compile("\\[([^\\]]+)\\]");

public static void main(String[] args) {

    Reader reader = new StringReader(
            "AB523:[joe, pierre][charlie][dogs,cat]\n"
                    + "ZZ883:[ronald, zigomarre][pele]");

    Scanner scanner = new Scanner(reader);
    scanner.useDelimiter("\n");

    while (scanner.hasNext()) {
        String h = scanner.findInLine(header);
        // Substring removes trailing ':'.
        System.out.println(h.substring(0, h.length() - 1));

        String n;
        while ((n = scanner.findInLine(names)) != null)
            // Substring removes '[' and ']'.
            System.out.println(n.substring(1, n.length() - 1));

        if (scanner.hasNext())
            scanner.nextLine();
    }
}
}

Nevertheless, I still couldn't manage to remove the substring invocations, and maybe that hides some inefficiency. My guess is that not, due to the immutability of strings, strings shouldn't be recreated for this case.

EDIT: for better performance I would also consider a handcrafted recursive descent parser.

回复收藏 0 原文

幽梦紫曦～ 2024-12-20 12:33:41

使用 String#split 或 Pattern#split 方法。
例如，

   String[] list ="AB523:[joe, pierre][charlie][dogs,cat]".split("[:\\[\\]]+");
   for(String s : list)
       System.out.println(s);

Use String#split or Pattern#split Method.
For example,

   String[] list ="AB523:[joe, pierre][charlie][dogs,cat]".split("[:\\[\\]]+");
   for(String s : list)
       System.out.println(s);

回复收藏 0 原文

坐在坟头思考人生 2024-12-20 12:33:41

单字符分隔符很容易分割：String.split() 函数将分割字符或字符串。它们的作用与 StringTokenizer 完全相同，但使用更简洁的语法。也就是说， String[] items = myString.split(",") 看起来比

StringTokenizer st = new StringTokenizer(myString, ","); 
while(st.hasMoreTokens()){
    myList.add(st.nextToken();
}

（我要说的是将来使用 split ）要干净得多。）

但是，看起来您的情况稍微复杂一些，您需要获取左侧由 [ 边框和右侧由 ] 边框的内容。这需要正则表达式和捕获组。像 /\[(.*)\]/

CSV（逗号分隔值）这样的东西对于简单的表格数据来说很常见，并且格式甚至在一定程度上被标准化。如果你想表示更复杂的对象，那么你可以使用 JSON 或 SOAP。如果您仅使用 Java 存储，请查看 Java 的内置序列化功能。

由于您在本地使用它，并且可能正在保存某种 Java 对象来表示它，因此一种方法是在表示数据的任何对象中实现 Serialized。

如果你不喜欢这样，我会选择 JSON，因为它看起来像是你正在做某种树结构。

Single-character delimiters are easy to split by: the String.split() function will split on a character or string. They do exactly what a StringTokenizer does, but do it with a cleaner syntax. That is, String[] items = myString.split(",") looks much cleaner than

StringTokenizer st = new StringTokenizer(myString, ","); 
while(st.hasMoreTokens()){
    myList.add(st.nextToken();
}

(Use split in the future is what I'm saying.)

However, it looks like you're in a slightly more complicated situation, where you need to get the stuff bordered on the left by [ and on the right by ]. This calls for regex, and capturing groups. Something like /\[(.*)\]/

CSV (comma separated values) are common for simple tabular data, and the format is even standardized to a degree. If you want to represent more complicated objects, then you can use JSON or SOAP. If you're only using the storage for Java, take a look at Java's built-in serializing features.

Since you're using it locally, and probably you're saving some sort of Java object to represent it, one way would be to implement Serializable in whatever object is representing your data.

If you don't like that, I'd go with JSON because it looks like you're doing some sort of tree structure.

回复收藏 0 原文

雾里花 2024-12-20 12:33:41

由于您可以控制文件格式，因此我建议使用制表符分隔。许多其他程序（例如 Excel）将读取制表符分隔。因此，该文件将如下所示（\t 代表制表符）

AB523\tjoe, pierre\tcharlie\tdogs,cat
ZZ883\tronald, zigomarre\tpele

注意 - 您不能使用另一种常见格式逗号分隔 (CSV)，因为逗号是字符串中的合法值。同样，如果制表符是字符串中的合法字符，制表符分隔也会出现问题。

就像其他人建议的那样， String.split() 是解析文件的好方法。

Since you have control over the file format, I'd suggest tab-delimited. Lots of other programs (e.g. Excel) will read tab delimited. So the file would look like the following (\t represents the tab)

AB523\tjoe, pierre\tcharlie\tdogs,cat
ZZ883\tronald, zigomarre\tpele

Note - You can't use comma delimited (CSV), another common format, because comma is a legal value in your strings. Likewise, tab delimited will have issues if the tab character is a legal character in your strings.

Like the others suggest, String.split() is a good way to parse the file.

回复收藏 0 原文