java中的文件处理

发布于 2024-09-15 09:02:40 字数 154 浏览 16 评论 0原文

我有一个2GB大小的文件,里面有学生记录。我需要根据每条记录中的某些属性查找学生,并创建一个包含结果的新文件。过滤后的学生的顺序应与原始文件中的顺序相同。什么是效率和效率?使用 Java IO API 和线程执行此操作而不出现内存问题的最快方法是什么? JVM 的最大堆大小设置为 512MB。

I have a file of size 2GB which has student records in it. I need to find students based on certain attributes in each record and create a new file with results. The order of the filtered students should be same as in the original file. What's the efficient & fastest way of doing this using Java IO API and threads without having memory issues? The maxheap size for JVM is set to 512MB.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

百善笑为先 2024-09-22 09:03:12

我认为你应该使用 内存映射 文件。这将帮助你将较大的文件映射到
较小的内存。这将像虚拟内存一样工作,就性能而言,映射文件比流写入/读取更快。

I think you should use memory mapped files.This will help you to map the bigger file to a
smaller memory.This will act like virtual memory and as far as performance is concerned mapped files are the faster than stream write/read.

旧时光的容颜 2024-09-22 09:03:09
  1. 2GB 对于一个文件来说是巨大的,你应该选择一个数据库。
  2. 如果您确实想使用Java I/O API,然后试试这个: 使用 Java 高效处理大型数据文件 以及:调整 Java I/O性能
  1. 2GB for a file is huge, you SHOULD go for a db.
  2. If you really want to use Java I/O API, then try out this: Handling large data files efficiently with Java and this: Tuning Java I/O Performance
〗斷ホ乔殘χμё〖 2024-09-22 09:03:06

在您发现这种无聊的简单方法无法满足您的需要之前,我不会将其过于复杂化。本质上你只需要:

  • 打开 2GB 文件的输入流,记住缓冲(例如通过使用 BufferedInputStream 包装)
  • 打开输出流到你要创建的过滤文件
  • 从输入流中读取第一条记录,查看任何属性来决定是否“需要”它;如果这样做,请将其写入输出文件中,
  • 重复剩余记录

在我的一个具有极其普通硬件的测试系统上,开箱即用的 FileInputStream 周围的 BufferedInputStream 在 25 秒内读取大约 500 MB,即可能不到 2 分钟来处理您的 2GB 文件,并且默认缓冲区大小基本上是尽可能好的(请参阅 BufferedInputStream 计时 我做了更多细节)。我想,如果使用最先进的硬件,时间很可能会减少一半。

您是否需要付出很大的努力来减少 2/3 分钟,或者只是在等待它运行时花一小会儿,您必须根据您的要求做出决定。我认为数据库选项不会给你带来太多好处,除非你需要对同一组数据进行大量不同的处理运行(并且还有其他解决方案,但并不自动意味着数据库)。

I wouldn't overcomplicate this until you find that the boringly simple way doesn't work for what you need. Essentially you just need to:

  • open input stream to 2GB file, remembering to buffer (e.g. by wrapping with BufferedInputStream)
  • open output stream to filtered file you're going to create
  • read first record from input stream, look at whatever attribute to decide if you "need" it; if you do, write it to output file
  • repeat for remaining records

On one of my test systems with extremely modest hardware, BufferedInputStream around a FileInputStream out of the box read about 500 MB in 25 seconds, i.e. probably under 2 minutes to process your 2GB file, and the default buffer size is basically as good as it gets (see the BufferedInputStream timings I made for more details). I imagine with state of the art hardware it's quite possible the time would be halved.

Whether you need to go to a lot of effort to reduce the 2/3 minutes or just go for a wee while you're waiting for it to run is a decision that you'll have to make depending on your requirements. I think the database option won't buy you much unless you need to do a lot of different processing runs on the same set of data (and there are other solutions to this that don't automatically mean database).

独自唱情﹋歌 2024-09-22 09:02:59

什么样的文件?基于文本,例如 CSV?

最简单的方法是像 grep 那样:逐行读取文件,解析该行,检查过滤条件,如果匹配,则输出结果行,然后转到下一行,直到文件完成。这是非常有效的内存效率,因为您只同时加载当前行(或稍大一点的缓冲区)。您的进程只需读取整个文件一次。

我认为多线程不会有太大帮助。这会使事情变得更加复杂,并且由于无论如何该进程似乎都是 I/O 绑定的,因此尝试使用多个线程读取同一文件可能不会提高吞吐量。

如果您发现需要经常执行此操作,并且每次检查文件都太慢,则需要构建某种索引。最简单的方法是首先将文件导入数据库(可以是嵌入式数据库,如 SQLite 或 HSQL)。

What kind of file? Text-based, like CSV?

The easiest way would be to do something like grep does: Read the file line by line, parse the line, check your filter criterion, if matched, output a result line, then go to the next line, until the file is done. This is very memory efficient, as you only have the current line (or a buffer a little larger) loaded at the same time. Your process needs to read through the whole file just once.

I do not think multiple threads are going to help much. It would make things much more complicated, and since the process seems to be I/O bound anyway, trying to read the same file with multiple threads probably does not improve throughput.

If you find that you need to do this often, and going through the file each time is too slow, you need to build some kind of index. The easiest way to do that would be to import the file into a DB (can be an embedded DB like SQLite or HSQL) first.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文