解析包含多种分隔符的文本文件的最佳方法?

发布于 2024-08-10 18:24:54 字数 250 浏览 3 评论 0原文

我需要解析一些具有不同类型分隔符(波形符、空格、逗号、竖线、插入符号)的文本文件。

根据分隔符的不同,元素的顺序也不同,例如:

comma: A, B, C, D, E
caret: B, C, A, E, D
tilde: C, A, B, D, E 

文件内的分隔符相同,但每个文件之间的分隔符不同。据我所知,数据元素内没有分隔符。

在普通的 Java 中执行此操作的好方法是什么?

I need to parse some text files that have different types of delimiters (tildes, spaces, commas, pipes, caret characters).

There is also a different order of elements depending on what the delimiter is, e.g:

comma: A, B, C, D, E
caret: B, C, A, E, D
tilde: C, A, B, D, E 

The delimiter is the same within the file but different from one file to another. From what I can tell, there are no delimiters within the data elements.

What's a good approach to do this in plain ol' Java?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

她说她爱他 2024-08-17 18:24:54

我喜欢读取文件的前两行,然后测试分隔符。如果您在分隔符上进行分割,并且两行都返回相同的非零数量的片段,那么您可能已经猜到了正确的片段。下面是一个检查文件名称.txt 的示例程序。

public static void main(String[] args) throws IOException {
    File file = new File("etc/names.txt");

    String delim = getDelimiter(file);
    System.out.println("Delim is " + delim + " (" + (int) delim.charAt(0) + ")");
}

private static final String[] DELIMS = new String[] { "\t", ",", " " };

private static String getDelimiter(File file) throws IOException {
    for (String delim : DELIMS) {

        BufferedReader br = new BufferedReader(new FileReader(file));
        String[] line0 = br.readLine().split(delim);
        String[] line1 = br.readLine().split(delim);
        br.close();
        if (line0.length == line1.length && line0.length > 1) {
            return delim;
        }
    }
    throw new IllegalStateException("Failed to find delimiter for file " + file);
}

I like to read the first two lines of a file, and then test the delimiters. If you split on a delimiter, and both lines return the same non-zero number of pieces, then you've probably guessed the correct one. Here's an example program which checks the file names.txt.

public static void main(String[] args) throws IOException {
    File file = new File("etc/names.txt");

    String delim = getDelimiter(file);
    System.out.println("Delim is " + delim + " (" + (int) delim.charAt(0) + ")");
}

private static final String[] DELIMS = new String[] { "\t", ",", " " };

private static String getDelimiter(File file) throws IOException {
    for (String delim : DELIMS) {

        BufferedReader br = new BufferedReader(new FileReader(file));
        String[] line0 = br.readLine().split(delim);
        String[] line1 = br.readLine().split(delim);
        br.close();
        if (line0.length == line1.length && line0.length > 1) {
            return delim;
        }
    }
    throw new IllegalStateException("Failed to find delimiter for file " + file);
}
吻安 2024-08-17 18:24:54

我可能会先玩Java的 StringTokenizer。这需要一个字符串,并让您找到由分隔符分隔的每个标记。

这是来自网络的一个示例。

但是你想要标记文件中的内容。在这种情况下,您可能想使用 Java 的 StreamTokenizer,它允许您解析文件流中的输入。

编辑

如果您事先不知道分隔符,您可以执行以下操作:

  1. 基于所有可能的分隔符进行分隔。如果您的数据本身没有任何分隔符,那么这将起作用。 (即,查找“,”和“;” - 前提是您的数据本身不包含这些字符中的任何一个)
  2. 如果您知道您的数据应该是什么样子(应该是整数,或者假设为单个字符)那么您的代码可以尝试不同的分隔符(首先尝试“,”,然后尝试“;”等),直到它“正确”解析一​​行文本。

I might start by playing with Java's StringTokenizer. This takes a string, and lets you find each token that is separated by a delimiter.

Here is one example from the net.

But you want to tokenize things from a file. In that case, you might want to play with Java's StreamTokenizer, which lets you parse input from a file stream.

edit

If you don't know the delimiters in advance, you could do a few things:

  1. Delimit based on all possible delimiters. If your data itself doesn't have any delimiters, then this would work. (ie, look for both "," and ";" - provided that your data itself doesn't nave either of those characters)
  2. If you have an idea of what your data is supposed to look like (supposed to be integers, or supposed to be single characters) then your code could try different delimiters (try "," first, then try ";", etc) until it parsed a line of text "correctly".
牵你手 2024-08-17 18:24:54

如果整个文件中的分隔符相同,请为一个分隔符编写一个函数,将其命名为d,并且在处理其他文件时,将其分隔符替换为d。冲洗。重复。 :)

另一种方法:让您的解析函数接受文件名和分隔符作为参数。
这假设所有文件的解析逻辑都是相同的。

如果您的文件看起来完全不同,那么分隔符是最不重要的问题。

If it's the same delimiter throughout the file, write a function for one delimiter, call it d, and when handling other files, replace their delimiter with d. Rinse. Repeat. :)

Another approach: have your parsing function accept a file name and a delimiter as parameters.
This assumes the parsing logic is the same for all files.

If your files look completely different - than delimiters are the least of your problem.

南城追梦 2024-08-17 18:24:54

如果整个文件中的分隔符相同,那么在加载要解析的文件时,您可能可以输入分隔符。

例如,

    void someFunction(char delimiter){
--- do wateva you want to do with the file --- // you can use stringTokenizer for this purpose
}

每次加载文件时,您都可以通过使用文件分隔符作为参数来调用该函数来使用该函数。

希望这有帮助..:-)

if its same delimiter through out the file then probabably while loading file to parse you can input the delimiter.

Say for ex..

    void someFunction(char delimiter){
--- do wateva you want to do with the file --- // you can use stringTokenizer for this purpose
}

Each time upon loading the file , you can use this function by calling it with delimiter for the file as argument.

Hope this helps.. :-)

水水月牙 2024-08-17 18:24:54

您可以编写一个解析文件的类,如下所示:

interface MyParser {
  public MyParser(char delimiter, List<String> fields);

  Map<String,String> ParseFile(InputStream file);
}

您将分隔符和有序字段列表传递给构造函数,然后要求它解析文件。您将获得字段名称(从有序列表中)到值的映射。

ParseFile 的实现可能会使用带有分隔符的 split,然后同时迭代 split 返回的数组和字段列表,从而创建映射。

You could write a class that parses a file something like this:

interface MyParser {
  public MyParser(char delimiter, List<String> fields);

  Map<String,String> ParseFile(InputStream file);
}

You'd pass the delimiter and an ordered list of fields to the constructor, then ask it to parse a file. You'd get back a map of field names (from the ordered list) to values.

The implementation of ParseFile would probably use split with the delimiter and then iterate through the array returned by split and the list of fields concurrently, creating the map as it went.

欢烬 2024-08-17 18:24:54

One possible approach is to use the Java Compiler Compiler (https://javacc.dev.java.net/). With this you can write a set of rules for what you will accept and what delimiters might appear at any one time. The engine can be given rules to work around order issues depending on the delimiter in use. And the file could, if necessary, switch delimiters along the way.

玩套路吗 2024-08-17 18:24:54

如果使用特定分隔符时知道记录的确切顺序,我只需创建一个解析器,该解析器将为每行返回一个 Record 对象......如下所示。

这确实包括很多硬编码值,但我不确定您需要它的灵活性。我认为这更像是一个脚本/黑客解决方案,而不是您可以扩展的东西。如果您不知道分隔符,可以使用 String.split() 方法测试文件的第一行,并查看列数是否与预期计数匹配。

 class MyParser

    {
        public static Record parseLine(String line, char delimiter)
        {
            StringTokenizer st1 = new StringTokenizer(line, delimiter);
            //You could easily use an array instead of these dumb variables
            String temp1,temp2,temp3,temp4,temp5;

            temp1 = st1.getNextToken();
            .. etc..

            Record ret = new Record();
            switch (delimiter)
            {
                case '^':
                ret.A = temp2;
                ret.B = temp3;
                ...etc...
                break;
                case '~':
                ...etc...
                break;
            }
        }
    }

    class Record
    {
        String A;
        String B;
        String C;
        String D;
        String E:
    }

If the exactly order of the records is known when a specific delimiter is used, I'd just create a parser that would return a Record object for each line... something like below.

This does include a lot of hard coded values but I'm not sure how flexible you would need this. I would consider this more of a scripty/hacky solution rather than something you could extend. If you don't know the delimiters, you could test the first line of the file by using the String.split() method and see if the number of columns match the expected count.

 class MyParser

    {
        public static Record parseLine(String line, char delimiter)
        {
            StringTokenizer st1 = new StringTokenizer(line, delimiter);
            //You could easily use an array instead of these dumb variables
            String temp1,temp2,temp3,temp4,temp5;

            temp1 = st1.getNextToken();
            .. etc..

            Record ret = new Record();
            switch (delimiter)
            {
                case '^':
                ret.A = temp2;
                ret.B = temp3;
                ...etc...
                break;
                case '~':
                ...etc...
                break;
            }
        }
    }

    class Record
    {
        String A;
        String B;
        String C;
        String D;
        String E:
    }
無心 2024-08-17 18:24:54

您可以使用前面提到的 StringTokenizer。是的,您需要为所有可能的分隔符指定一个字符串。不要忘记设置标记生成器的“returnsDelims”属性。这样您就会知道文件中使用了哪个令牌,然后可以相应地解析数据。

You can use the StringTokenizer as mentioned earlier. Yes you will need to specify a string for all the possible delimiters. Don't forget to set the "returnsDelims" property of the tokenizer. That way you will know which token is used in the file and can then parse the data accordingly.

云淡月浅 2024-08-17 18:24:54

在文件中查找分隔符的一种方法是使用某种正则表达式。一个简单的情况是查找任何非字母或数字的字符: [^A-Za-z0-9]

static String getDelimiter(String str) {
  Pattern p = Pattern.compile("([^A-Za-z0-9])");
  Matcher m = p.matcher(str.trim()); //remove whitespace as first char(s)
  if(m.find())
   return m.group(0);
  else 
   return null;
 }




public static void main(String[] args) {
  String[] str = {" A, B, C, D", "A B C D", "A;B;C;D"};
  for(String s : str){   
   String[] data = s.split(getDelimiter(s));
   //do clever stuff with the array
  }
 }

在这种情况下,我从数组加载数据,而不是从文件读取。从文件读取时,将第一行提供给 getDelimiter 方法。

One way to find the delimiter in the file is to some kind of regex. A simple case would be to find any character that isn't alphabetical or numerical: [^A-Za-z0-9]

static String getDelimiter(String str) {
  Pattern p = Pattern.compile("([^A-Za-z0-9])");
  Matcher m = p.matcher(str.trim()); //remove whitespace as first char(s)
  if(m.find())
   return m.group(0);
  else 
   return null;
 }




public static void main(String[] args) {
  String[] str = {" A, B, C, D", "A B C D", "A;B;C;D"};
  for(String s : str){   
   String[] data = s.split(getDelimiter(s));
   //do clever stuff with the array
  }
 }

In this case I've loaded the data from an array instead of reading from a file. When reading from a file feed the first line to the getDelimiter method.

年华零落成诗 2024-08-17 18:24:54

大多数开源 CSV 解析库允许您更改分隔符,并且还内置了处理转义的行为。 Opencsv 好像现在比较流行,不过我还没用过。上次我必须进行大量 csv 解析时,我对 Ostermiller csv 库 非常满意。

Most of the open source CSV parsing libraries allow you to change the delimiter characters, and also have behavior built in to handle escaping. Opencsv seems to be the popular one nowadays, but I haven't used it yet. I was pretty happy with the Ostermiller csv library last time I had to do a lot of csv parsing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文