我应该如何用Java解析这个简单的文本文件?

发布于 2024-08-27 15:18:52 字数 323 浏览 13 评论 0原文

我有一个如下所示的文本文件:

grn129          agri-
ac-214          ahss
hud114          ahss
lov1150         ahss
lov1160         ahss
lov1170         ahss
lov1210         ahss

如果我想创建一个以第一列作为键、第二列作为值的 HashMap,那么使用 Java 解析此文件的最佳方法是什么。

我应该使用 Scanner 类吗?尝试将整个文件作为字符串读取并拆分它?

最好的方法是什么?

I have a text file that looks like this:

grn129          agri-
ac-214          ahss
hud114          ahss
lov1150         ahss
lov1160         ahss
lov1170         ahss
lov1210         ahss

What is the best way to parse this file using Java if I want to create a HashMap with the first column as the key and the second column as the value.

Should I use the Scanner class? Try to read in the whole file as a string and split it?

What is the best way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

可可 2024-09-03 15:18:52

我就是这样做的!自 2000 年以来,我几乎就是一名 Java 程序员,所以这可能有点过时了。有一句话让我特别自豪:

new InputStreamReader(fin, "UTF-8");

http://www.joelonsoftware。 com/articles/Unicode.html

尽情享受吧!

import java.io.*;
import java.util.*;

public class StackOverflow2565230 {

  public static void main(String[] args) throws Exception {
    Map<String, String> m = new LinkedHashMap<String, String>();
    FileInputStream fin = null;
    InputStreamReader isr = null;
    BufferedReader br = null;
    try {
      fin = new FileInputStream(args[0]);
      isr = new InputStreamReader(fin, "UTF-8");
      br = new BufferedReader(isr);
      String line = br.readLine();
      while (line != null) {
        // Regex to scan for 1 or more whitespace characters
        String[] toks = line.split("\\s+");
        m.put(toks[0], toks[1]);
        line = br.readLine();
      }
    } finally {
      if (br != null)  { br.close();  }
      if (isr != null) { isr.close(); }
      if (fin != null) { fin.close(); }
    }

    System.out.println(m);
  }

}

这是输出:

julius@flower:~$ javac StackOverflow2565230.java 
julius@flower:~$ java -cp .  StackOverflow2565230  file.txt 
{grn129=agri-, ac-214=ahss, hud114=ahss, lov1150=ahss, lov1160=ahss, lov1170=ahss, lov1210=ahss}

是的,我的计算机的名称是 Flower。以小鹿斑比的臭鼬命名。

最后一点:因为 close() 可以抛出 IOException,所以这就是我真正关闭流的方式:

} finally {
  try {
    if (br != null) br.close();
  } finally {
    try {
      if (isr != null) isr.close();
    } finally {
      if (fin != null) fin.close();
    }
  }
}

Here's how I would do it! I'm almost exclusively a Java programmer since 2000, so it might be a little old fashioned. There is one line in particular I'm a little proud of:

new InputStreamReader(fin, "UTF-8");

http://www.joelonsoftware.com/articles/Unicode.html

Enjoy!

import java.io.*;
import java.util.*;

public class StackOverflow2565230 {

  public static void main(String[] args) throws Exception {
    Map<String, String> m = new LinkedHashMap<String, String>();
    FileInputStream fin = null;
    InputStreamReader isr = null;
    BufferedReader br = null;
    try {
      fin = new FileInputStream(args[0]);
      isr = new InputStreamReader(fin, "UTF-8");
      br = new BufferedReader(isr);
      String line = br.readLine();
      while (line != null) {
        // Regex to scan for 1 or more whitespace characters
        String[] toks = line.split("\\s+");
        m.put(toks[0], toks[1]);
        line = br.readLine();
      }
    } finally {
      if (br != null)  { br.close();  }
      if (isr != null) { isr.close(); }
      if (fin != null) { fin.close(); }
    }

    System.out.println(m);
  }

}

And here's the output:

julius@flower:~$ javac StackOverflow2565230.java 
julius@flower:~$ java -cp .  StackOverflow2565230  file.txt 
{grn129=agri-, ac-214=ahss, hud114=ahss, lov1150=ahss, lov1160=ahss, lov1170=ahss, lov1210=ahss}

Yes, my computer's name is Flower. Named after the skunk from Bambi.

One final note: because close() can throw an IOException, this is how I would really close the streams:

} finally {
  try {
    if (br != null) br.close();
  } finally {
    try {
      if (isr != null) isr.close();
    } finally {
      if (fin != null) fin.close();
    }
  }
}
痴骨ら 2024-09-03 15:18:52

基于@Julius Davies,这是一个较短的版本。

import java.io.*; 
import java.util.*; 

public class StackOverflow2565230b { 
  public static void main(String... args) throws IOException { 
    Map<String, String> m = new LinkedHashMap<String, String>(); 
    BufferedReader br = null; 
    try { 
      br = new BufferedReader(new FileReader(args[0])); 
      String line;
      while ((line = br.readLine()) != null) { 
        // Regex to scan for 1 or more whitespace characters 
        String[] toks = line.split("\\s+"); 
        m.put(toks[0], toks[1]); 
      } 
    } finally { 
      if (br != null) br.close(); // dont throw an NPE because the file wasn't found.
    } 

    System.out.println(m); 
  } 
}

Based on @Julius Davies, here is a shorter version.

import java.io.*; 
import java.util.*; 

public class StackOverflow2565230b { 
  public static void main(String... args) throws IOException { 
    Map<String, String> m = new LinkedHashMap<String, String>(); 
    BufferedReader br = null; 
    try { 
      br = new BufferedReader(new FileReader(args[0])); 
      String line;
      while ((line = br.readLine()) != null) { 
        // Regex to scan for 1 or more whitespace characters 
        String[] toks = line.split("\\s+"); 
        m.put(toks[0], toks[1]); 
      } 
    } finally { 
      if (br != null) br.close(); // dont throw an NPE because the file wasn't found.
    } 

    System.out.println(m); 
  } 
}
白日梦 2024-09-03 15:18:52

我不知道最好的方法,但我怀疑最有效的方法是一次读取一行(使用 BufferedReader),然后通过找到第一个空白字符来分割每一行,在那里分割,然后修剪两边。然而,无论你最喜欢什么都可以,除非它需要超快。

我个人倾向于一次加载整个文件......除了假设有足够的内存来保存整个文件这一事实之外,它不允许任何并行计算(例如,如果输入传入)从管道)。能够在输入仍在生成时对其进行处理是有意义的。

I don't know about the best way, but I suspect that the most efficient way would be to read one line at a time (using BufferedReader), and then split each line by finding the first whitespace character, splitting there, and then trimming both sides. However, whatever you like best is fine unless it needs to be super fast.

I am personally biased against loading an entire file all at once... aside from the fact that it assumes there is enough memory to hold the entire file, it doesn't allow for any parallel computation (for example, if input is coming in from a pipe). It makes sense to be able to process the input while it is still being generated.

独木成林 2024-09-03 15:18:52

使用扫描仪或普通的 FileReader + String.split() 应该都能正常工作。我认为速度差异很小,除非您打算一遍又一遍地读取非常大的文件,否则这并不重要。

编辑:实际上,对于第二种方法,使用 BufferedReader 。它有一个 getLine() 方法,这使事情变得稍微容易一些。

Using a Scanner or a normal FileReader + String.split() should both work fine. I think the speed differences are minimal, and unless you plan to read a very large file over and over again, it doesn't matter.

EDIT: Actually, for the second method, use a BufferedReader. It has a getLine() method, which makes things slightly easier.

ペ泪落弦音 2024-09-03 15:18:52

如果您想遵循教科书解决方案,请使用 StringTokenizer。它直接、易学且非常简单。它可以克服结构上的简单偏差(空白字符数量可变、格式不均匀的行等)

但是如果您的文本已知 100% 格式良好且可预测,那么只需将一堆行读入缓冲区,一次一个,然后将部分字符串取出到 HashMap 键和值中。它比 StringTokenizer 更快,但缺乏灵活性。

If you wish to follow the textbook solution, use StringTokenizer. It's straight-forward, easy to learn and quite simple. It can overcome simple deviations in structure (variable number of white-space characters, uneven formatted lines, etc)

But if your text is known to be 100% well-formatted and predictable, then just read a bunch of lines into a buffer, take them one at a time, and take-out parts of the strings into your HashMap key and value. It's faster than StringTokenizer, but lacks the flexibility.

无声情话 2024-09-03 15:18:52

缓存正则表达式怎么样? (String.split() 会在每次调用时编译正则表达式)

我很好奇,如果您在几个大文件(100、1k、100k、1m、10m 条目)上对每个方法进行性能测试,并查看性能比较如何。

import java.io.*;
import java.util.*;
import java.util.regex.*;

public class So2565230 {

    private static final Pattern rgx = Pattern.compile("^([^ ]+)[ ]+(.*)$");

    private static InputStream getTestData(String charEncoding) throws UnsupportedEncodingException {
        String nl = System.getProperty("line.separator");
        StringBuilder data = new StringBuilder();
        data.append(" bad data " + nl);
        data.append("grn129          agri-" + nl);
        data.append("grn129          agri-" + nl);
        data.append("ac-214          ahss" + nl);
        data.append("hud114          ahss" + nl);
        data.append("lov1150         ahss" + nl);
        data.append("lov1160         ahss" + nl);
        data.append("lov1170         ahss" + nl);
        data.append("lov1210         ahss" + nl);
        byte[] dataBytes = data.toString().getBytes(charEncoding);
        return new ByteArrayInputStream(dataBytes);
    }

    public static void main(final String[] args) throws IOException {
        String encoding = "UTF-8";

        Map<String, String> valuesMap = new LinkedHashMap<String, String>();

        InputStream is = getTestData(encoding);
        new So2565230().fill(valuesMap, is, encoding);

        for (Map.Entry<String, String> entry : valuesMap.entrySet()) {
            System.out.format("K=[%s] V=[%s]%n", entry.getKey(), entry.getValue());
        }
    }

    private void fill(Map<String, String> map, InputStream is, String charEncoding) throws IOException {
        BufferedReader bufReader = new BufferedReader(new InputStreamReader(is, charEncoding));
        for (String line = bufReader.readLine(); line != null; line = bufReader.readLine()) {
            Matcher m = rgx.matcher(line);
            if (!m.matches()) {
                System.err.println("Line has improper format (" + line + ")");
                continue;
            }
            String key = m.group(1);
            String value = m.group(2);
            if (map.put(key, value) != null) {
                System.err.println("Duplicate key detected: (" + line + ")");
            }
        }
    }
}

How about caching a regular expression? (String.split() would compile the regular expression on each call)

I'd be curious if you performance tested each of the methods on several large files (100, 1k, 100k, 1m, 10m entries) and see how the performance compares.

import java.io.*;
import java.util.*;
import java.util.regex.*;

public class So2565230 {

    private static final Pattern rgx = Pattern.compile("^([^ ]+)[ ]+(.*)$");

    private static InputStream getTestData(String charEncoding) throws UnsupportedEncodingException {
        String nl = System.getProperty("line.separator");
        StringBuilder data = new StringBuilder();
        data.append(" bad data " + nl);
        data.append("grn129          agri-" + nl);
        data.append("grn129          agri-" + nl);
        data.append("ac-214          ahss" + nl);
        data.append("hud114          ahss" + nl);
        data.append("lov1150         ahss" + nl);
        data.append("lov1160         ahss" + nl);
        data.append("lov1170         ahss" + nl);
        data.append("lov1210         ahss" + nl);
        byte[] dataBytes = data.toString().getBytes(charEncoding);
        return new ByteArrayInputStream(dataBytes);
    }

    public static void main(final String[] args) throws IOException {
        String encoding = "UTF-8";

        Map<String, String> valuesMap = new LinkedHashMap<String, String>();

        InputStream is = getTestData(encoding);
        new So2565230().fill(valuesMap, is, encoding);

        for (Map.Entry<String, String> entry : valuesMap.entrySet()) {
            System.out.format("K=[%s] V=[%s]%n", entry.getKey(), entry.getValue());
        }
    }

    private void fill(Map<String, String> map, InputStream is, String charEncoding) throws IOException {
        BufferedReader bufReader = new BufferedReader(new InputStreamReader(is, charEncoding));
        for (String line = bufReader.readLine(); line != null; line = bufReader.readLine()) {
            Matcher m = rgx.matcher(line);
            if (!m.matches()) {
                System.err.println("Line has improper format (" + line + ")");
                continue;
            }
            String key = m.group(1);
            String value = m.group(2);
            if (map.put(key, value) != null) {
                System.err.println("Duplicate key detected: (" + line + ")");
            }
        }
    }
}
唔猫 2024-09-03 15:18:52

朱利叶斯·戴维斯的回答很好。

不过,恐怕您必须定义要解析的文本文件的格式。例如你的第一列和第二列之间的分隔字符是什么,如果不固定,将会造成一些困难。

Julius Davies's answer is fine.

However I am afraid you will have to define the format of your text file which is to be parsered. For example what is the separate character between your first column and the second column,if it is not fixed, it will cause somemore difficulties.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文