解析包含多种分隔符的文本文件的最佳方法？

发布于 2024-08-10 18:24:54 字数 250 浏览 3 评论 0原文

我需要解析一些具有不同类型分隔符（波形符、空格、逗号、竖线、插入符号）的文本文件。

根据分隔符的不同，元素的顺序也不同，例如：

comma: A, B, C, D, E
caret: B, C, A, E, D
tilde: C, A, B, D, E

文件内的分隔符相同，但每个文件之间的分隔符不同。据我所知，数据元素内没有分隔符。

在普通的 Java 中执行此操作的好方法是什么？

原文

I need to parse some text files that have different types of delimiters (tildes, spaces, commas, pipes, caret characters).

There is also a different order of elements depending on what the delimiter is, e.g:

comma: A, B, C, D, E
caret: B, C, A, E, D
tilde: C, A, B, D, E

The delimiter is the same within the file but different from one file to another. From what I can tell, there are no delimiters within the data elements.

What's a good approach to do this in plain ol' Java?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

她说她爱他 2024-08-17 18:24:54

我喜欢读取文件的前两行，然后测试分隔符。如果您在分隔符上进行分割，并且两行都返回相同的非零数量的片段，那么您可能已经猜到了正确的片段。下面是一个检查文件名称.txt 的示例程序。

public static void main(String[] args) throws IOException {
    File file = new File("etc/names.txt");

    String delim = getDelimiter(file);
    System.out.println("Delim is " + delim + " (" + (int) delim.charAt(0) + ")");
}

private static final String[] DELIMS = new String[] { "\t", ",", " " };

private static String getDelimiter(File file) throws IOException {
    for (String delim : DELIMS) {

        BufferedReader br = new BufferedReader(new FileReader(file));
        String[] line0 = br.readLine().split(delim);
        String[] line1 = br.readLine().split(delim);
        br.close();
        if (line0.length == line1.length && line0.length > 1) {
            return delim;
        }
    }
    throw new IllegalStateException("Failed to find delimiter for file " + file);
}

I like to read the first two lines of a file, and then test the delimiters. If you split on a delimiter, and both lines return the same non-zero number of pieces, then you've probably guessed the correct one. Here's an example program which checks the file names.txt.

public static void main(String[] args) throws IOException {
    File file = new File("etc/names.txt");

    String delim = getDelimiter(file);
    System.out.println("Delim is " + delim + " (" + (int) delim.charAt(0) + ")");
}

private static final String[] DELIMS = new String[] { "\t", ",", " " };

private static String getDelimiter(File file) throws IOException {
    for (String delim : DELIMS) {

        BufferedReader br = new BufferedReader(new FileReader(file));
        String[] line0 = br.readLine().split(delim);
        String[] line1 = br.readLine().split(delim);
        br.close();
        if (line0.length == line1.length && line0.length > 1) {
            return delim;
        }
    }
    throw new IllegalStateException("Failed to find delimiter for file " + file);
}

回复收藏 0 原文

吻安 2024-08-17 18:24:54

我可能会先玩Java的 StringTokenizer。这需要一个字符串，并让您找到由分隔符分隔的每个标记。

这是来自网络的一个示例。

但是你想要标记文件中的内容。在这种情况下，您可能想使用 Java 的 StreamTokenizer，它允许您解析文件流中的输入。

编辑

如果您事先不知道分隔符，您可以执行以下操作：

基于所有可能的分隔符进行分隔。如果您的数据本身没有任何分隔符，那么这将起作用。（即，查找“，”和“；” - 前提是您的数据本身不包含这些字符中的任何一个）
如果您知道您的数据应该是什么样子（应该是整数，或者假设为单个字符）那么您的代码可以尝试不同的分隔符（首先尝试“，”，然后尝试“;”等），直到它“正确”解析一行文本。

回复收藏 0 原文

牵你手 2024-08-17 18:24:54

如果整个文件中的分隔符相同，请为一个分隔符编写一个函数，将其命名为d，并且在处理其他文件时，将其分隔符替换为d。冲洗。重复。 :)

另一种方法：让您的解析函数接受文件名和分隔符作为参数。
这假设所有文件的解析逻辑都是相同的。

如果您的文件看起来完全不同，那么分隔符是最不重要的问题。

回复收藏 0 原文

南城追梦 2024-08-17 18:24:54

如果整个文件中的分隔符相同，那么在加载要解析的文件时，您可能可以输入分隔符。

例如，

    void someFunction(char delimiter){
--- do wateva you want to do with the file --- // you can use stringTokenizer for this purpose
}

每次加载文件时，您都可以通过使用文件分隔符作为参数来调用该函数来使用该函数。

希望这有帮助..:-)

if its same delimiter through out the file then probabably while loading file to parse you can input the delimiter.

Say for ex..

    void someFunction(char delimiter){
--- do wateva you want to do with the file --- // you can use stringTokenizer for this purpose
}

Each time upon loading the file , you can use this function by calling it with delimiter for the file as argument.

Hope this helps.. :-)

回复收藏 0 原文

水水月牙 2024-08-17 18:24:54

您可以编写一个解析文件的类，如下所示：

interface MyParser {
  public MyParser(char delimiter, List<String> fields);

  Map<String,String> ParseFile(InputStream file);
}

您将分隔符和有序字段列表传递给构造函数，然后要求它解析文件。您将获得字段名称（从有序列表中）到值的映射。

ParseFile 的实现可能会使用带有分隔符的 split，然后同时迭代 split 返回的数组和字段列表，从而创建映射。

You could write a class that parses a file something like this:

interface MyParser {
  public MyParser(char delimiter, List<String> fields);

  Map<String,String> ParseFile(InputStream file);
}

You'd pass the delimiter and an ordered list of fields to the constructor, then ask it to parse a file. You'd get back a map of field names (from the ordered list) to values.

The implementation of ParseFile would probably use split with the delimiter and then iterate through the array returned by split and the list of fields concurrently, creating the map as it went.

回复收藏 0 原文

欢烬 2024-08-17 18:24:54

一种可能的方法是使用 Java Compiler 编译器 (https://javacc.dev.java.net/）。有了这个，您可以为您将接受的内容以及随时可能出现的分隔符编写一组规则。可以为引擎指定规则来解决顺序问题，具体取决于所使用的分隔符。如有必要，文件可以一路切换分隔符。

回复收藏 0 原文

玩套路吗 2024-08-17 18:24:54

如果使用特定分隔符时知道记录的确切顺序，我只需创建一个解析器，该解析器将为每行返回一个 Record 对象......如下所示。

这确实包括很多硬编码值，但我不确定您需要它的灵活性。我认为这更像是一个脚本/黑客解决方案，而不是您可以扩展的东西。如果您不知道分隔符，可以使用 String.split() 方法测试文件的第一行，并查看列数是否与预期计数匹配。

 class MyParser

    {
        public static Record parseLine(String line, char delimiter)
        {
            StringTokenizer st1 = new StringTokenizer(line, delimiter);
            //You could easily use an array instead of these dumb variables
            String temp1,temp2,temp3,temp4,temp5;

            temp1 = st1.getNextToken();
            .. etc..

            Record ret = new Record();
            switch (delimiter)
            {
                case '^':
                ret.A = temp2;
                ret.B = temp3;
                ...etc...
                break;
                case '~':
                ...etc...
                break;
            }
        }
    }

    class Record
    {
        String A;
        String B;
        String C;
        String D;
        String E:
    }

If the exactly order of the records is known when a specific delimiter is used, I'd just create a parser that would return a Record object for each line... something like below.

This does include a lot of hard coded values but I'm not sure how flexible you would need this. I would consider this more of a scripty/hacky solution rather than something you could extend. If you don't know the delimiters, you could test the first line of the file by using the String.split() method and see if the number of columns match the expected count.

 class MyParser

    {
        public static Record parseLine(String line, char delimiter)
        {
            StringTokenizer st1 = new StringTokenizer(line, delimiter);
            //You could easily use an array instead of these dumb variables
            String temp1,temp2,temp3,temp4,temp5;

            temp1 = st1.getNextToken();
            .. etc..

            Record ret = new Record();
            switch (delimiter)
            {
                case '^':
                ret.A = temp2;
                ret.B = temp3;
                ...etc...
                break;
                case '~':
                ...etc...
                break;
            }
        }
    }

    class Record
    {
        String A;
        String B;
        String C;
        String D;
        String E:
    }

回复收藏 0 原文

無心 2024-08-17 18:24:54

您可以使用前面提到的 StringTokenizer。是的，您需要为所有可能的分隔符指定一个字符串。不要忘记设置标记生成器的“returnsDelims”属性。这样您就会知道文件中使用了哪个令牌，然后可以相应地解析数据。

回复收藏 0 原文

云淡月浅 2024-08-17 18:24:54

在文件中查找分隔符的一种方法是使用某种正则表达式。一个简单的情况是查找任何非字母或数字的字符： [^A-Za-z0-9]

static String getDelimiter(String str) {
  Pattern p = Pattern.compile("([^A-Za-z0-9])");
  Matcher m = p.matcher(str.trim()); //remove whitespace as first char(s)
  if(m.find())
   return m.group(0);
  else 
   return null;
 }




public static void main(String[] args) {
  String[] str = {" A, B, C, D", "A B C D", "A;B;C;D"};
  for(String s : str){   
   String[] data = s.split(getDelimiter(s));
   //do clever stuff with the array
  }
 }

在这种情况下，我从数组加载数据，而不是从文件读取。从文件读取时，将第一行提供给 getDelimiter 方法。

One way to find the delimiter in the file is to some kind of regex. A simple case would be to find any character that isn't alphabetical or numerical: [^A-Za-z0-9]

static String getDelimiter(String str) {
  Pattern p = Pattern.compile("([^A-Za-z0-9])");
  Matcher m = p.matcher(str.trim()); //remove whitespace as first char(s)
  if(m.find())
   return m.group(0);
  else 
   return null;
 }




public static void main(String[] args) {
  String[] str = {" A, B, C, D", "A B C D", "A;B;C;D"};
  for(String s : str){   
   String[] data = s.split(getDelimiter(s));
   //do clever stuff with the array
  }
 }

In this case I've loaded the data from an array instead of reading from a file. When reading from a file feed the first line to the getDelimiter method.

回复收藏 0 原文