解析一个巨大的纯文本文件

发布于 12-29 18:31 字数 813 浏览 4 评论 0原文

我有一个巨大的文本文件(207 MB,400 万行),我需要逐行顺序读取它。
每行都有这样的格式:
20227993821姓名 NINIC NN08
我正在使用(对于常规文件)Java 库的 FileReaderBufferedReader,如下所示:

FileReader dataFile = new FileReader(directory);
data = new BufferedReader(dataFile);
String s;
while((s = data.readLine()) != null){
    //do stuff
}

没有任何问题,但对于大文件,需要花费太多时间来处理。
我想知道在这种情况下最好的做法是什么(另一个库,不同的方法等),任何东西都会有帮助。
该文件由政府机构定期发布,必须加载到我的软件中进行数据比较。

编辑:

此代码:

BufferedReader data = new BufferedReader(new FileReader(file));
String s;
int count = 0;
while ((s = data.readLine()) != null) {
   System.out.println (count + " - " + s);
   count++;
}
data.close();

在 19 分 30 秒内执行。我不知道为什么花了这么长时间。
我有一个 64 位操作系统和一个 i5 处理器。

I have a huge text file (207 MB, 4 million lines) and I need to read it sequentially line by line.
Every line has this format:
20227993821NAME AND SURNAME NINIC NN08
I was using (for regular files) the Java library's FileReader and BufferedReader like this:

FileReader dataFile = new FileReader(directory);
data = new BufferedReader(dataFile);
String s;
while((s = data.readLine()) != null){
    //do stuff
}

with no problems, but with huge files it takes too much time to process.
I wonder what would be the best practice in such cases (another library, different methods, etc.), anything would be helpfull.
The file is issued periodically by a government agency and it must be loaded in to my software for data comparison.

Edit:

This code:

BufferedReader data = new BufferedReader(new FileReader(file));
String s;
int count = 0;
while ((s = data.readLine()) != null) {
   System.out.println (count + " - " + s);
   count++;
}
data.close();

executed in 19 minutes 30 seconds. I don't know why it took so long.
I have a 64 bit operative system and a i5 processor.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

≈。彩虹2025-01-05 18:31:19

如果我运行

File file = new File("/tmp/deleteme.txt");
file.deleteOnExit();

long start = System.nanoTime();
PrintWriter pw = new PrintWriter(file);
for (int i = 0; i < 4 * 1000 * 1000; i++)
    pw.println("01234567890123456789012345678901234567890123456789");
pw.close();

long mid = System.nanoTime();
BufferedReader data = new BufferedReader(new FileReader(file));
String s;
while ((s = data.readLine()) != null) {
    //do stuff
}
data.close();
long end = System.nanoTime();

System.out.printf("Took %.3f seconds to write and %.3f seconds to read a %.2f MB file.%n",
        (mid - start) / 1e9, (end - mid) / 1e9, file.length() / 1e6);

它会打印

Took 0.465 seconds to write and 0.522 seconds to read a 204.00 MB file.

EDIT: 如果我打印出每一行,它会显着减慢,因为写入屏幕需要很长时间。我发现 MS-DOS 窗口特别慢。

Took 0.467 seconds to write and 10.254 second to read a 204.00 MB file.

我不认为读取文件花费了太长时间,而是您使用它所做的事情花费了很长时间。

If I run

File file = new File("/tmp/deleteme.txt");
file.deleteOnExit();

long start = System.nanoTime();
PrintWriter pw = new PrintWriter(file);
for (int i = 0; i < 4 * 1000 * 1000; i++)
    pw.println("01234567890123456789012345678901234567890123456789");
pw.close();

long mid = System.nanoTime();
BufferedReader data = new BufferedReader(new FileReader(file));
String s;
while ((s = data.readLine()) != null) {
    //do stuff
}
data.close();
long end = System.nanoTime();

System.out.printf("Took %.3f seconds to write and %.3f seconds to read a %.2f MB file.%n",
        (mid - start) / 1e9, (end - mid) / 1e9, file.length() / 1e6);

it prints

Took 0.465 seconds to write and 0.522 seconds to read a 204.00 MB file.

EDIT: If I print out each line, it slows down dramatically because writing to the screen take a long time. I have found the MS-DOS window to be especially slow.

Took 0.467 seconds to write and 10.254 second to read a 204.00 MB file.

I don't believe its the reading of the file which is taking too long, it is what you are doing with it that is taking a long time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文