在 Java 中从文件中读取整数的最快方法是什么？

发布于 2024-11-09 04:02:04 字数 234 浏览 4 评论 0原文

我有一个像这样排列的整数文件：

1 2 3 55 22 11 (and so on)

我想尽快读取这些数字以减少程序的总执行时间。到目前为止，我使用的扫描仪效果良好。然而，我感觉有一个更快的 IO 实用程序可以使用。谁能指出我正确的方向吗？

编辑：

所以是的，我通过在 java 代码周围设置不同的计时器并比较结果来验证程序中的 IO 花费了最多的时间。

原文

I have a file of integers arranged like this:

1 2 3 55 22 11 (and so on)

And I want to read in these numbers as fast as possible to lessen the total execution time of my program. So far, I am using a scanner with good results. However, I get the feeling that there exists a faster IO utility I can use. Can anyone please point me in the right direction?

EDIT:

So yes, I verified that it is the IO in my program that's taking the most time by setting up different timers around the java code and comparing results.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苄①跕圉湢 2024-11-16 04:02:04

当前文件格式

如果数字表示为字符串，则没有更快的方法来读取和解析它们，磁盘 I/O 将慢几个数量级CPU 正在执行的任何操作。唯一能做的就是使用具有巨大缓冲区大小的 BufferedReader，并在使用 Scanner 之前尝试获取内存中的尽可能多（如果不是全部）的文件。

替代文件格式

如果您可以在文件中将它们表示为二进制并使用DataInputStream读取数字class，那么 I/O 时间可能会小幅减少，CPU 边际也会减少，因为您不这样做需要将 String 表示解析为一个 int ，除非您的输入文件达到数百兆或更大，否则可能无法测量。 **缓冲输入流仍然比其他任何东西都更有效，在这种情况下使用 BufferedInputStream 。

如何优化

您需要强大的分析来检测您所做的任何更改是否对性能产生积极或消极影响。

如果您一遍又一遍地读取相同的文件，操作系统磁盘缓存之类的事情将会扭曲基准，操作系统将缓存它并搞砸您的基准。尽早了解什么是足够好。

“我们应该忘记小事
效率大约是 97%
时间：过早优化是
万恶之源” - Donald Knuth

昆斯引言中的不成熟部分是重要的部分，它意味着：

如果没有分析和基准测试来验证您正在更改的内容实际上是瓶颈，并且您可以衡量更改的积极或消极影响，则不要进行优化。

这是一个快速基准测试，比较读取同一组二进制数字的 BufferedInputStream 与 <由 BufferedReader 支持的 code>Scanner 读取与带有 SPACE 分隔符的文本表示形式相同的数字，

结果非常一致：

对于我的 1,000 个数字。具有 8GB RAM 的 Core i3 笔记本电脑

Read binary file in 0001 ms
Read text file in   0041 ms

对于具有 8GB RAM 的 Core i3 笔记本电脑上的 1,000,000 个数字

Read binary file in 0603 ms
Read text file in   1509 ms

对于具有 8GB RAM 的 Core i3 笔记本电脑上的 50,000,000 个数字

Read binary file in 29020 ms
Read text file in   70346 ms

这 50,000,000 个数字的文件大小如下：

 48M input.dat
419M input.txt

读取二进制文件的速度要快得多，直到设置完成为止的数字变得非常大。二进制编码整数上的 I/O 较少（大约 10 倍），没有 String 解析逻辑，以及对象创建的其他开销以及 Scanner 执行的其他操作。我继续使用了 InputStream 和 Reader 类的 Buffered 版本，因为这些是最佳实践，应该尽可能使用。

值得一提的是，压缩将进一步减少大文件上的 I/O 等待，而对 CPU 时间几乎没有可测量的影响。

Current file format

If the numbers are represented as Strings there is no faster way to read them in and parse them, disk I/O is going to be orders of magnitude slower than anything the CPU is doing. The only thing can do is use a BufferedReader with a huge buffer size and try and get as much if not all the file in the memory before using Scanner.

Alternate file format

If you can represent them as binary in the file and read the numbers in using the DataInputStream class, then you might get a small decrease in I/O time and a marginal CPU decrease because you don't need to parse the String representation into an int that probably would not be measurable unless your input file in in hundreds of megabytes or larger. **Buffering the input stream will still have more effect than anything else, use a BufferedInputStream in this case.

How to optimize

You need robust profiling to even detect if any changes you make are impacting performance positively or negatively.

Things like OS disk caching will skew benchmarks if you read the same file in over and over, the OS will cache it and screw up your benchmarks. Learn what good enough is sooner than later.

"We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil" - Donald Knuth

The premature part of Kunth's quote is the important part, it means:

Don't optimize without profiling and benchmarks to verify that what you are changing is actually a bottleneck and that you can measure the positve or negative impact of your changes.

Here is a quick benchmark comparing a BufferedInputStream reading the same set of binary numbers versus a Scanner backed by a BufferedReader reading the same set of numbers as text representations with a SPACE delimiter.

the results are pretty consistent:

For 1,000 numbers on my Core i3 laptop with 8GB of RAM

Read binary file in 0001 ms
Read text file in   0041 ms

For 1,000,000 numbers on my Core i3 laptop with 8GB of RAM

Read binary file in 0603 ms
Read text file in   1509 ms

For 50,000,000 numbers on my Core i3 laptop with 8GB of RAM

Read binary file in 29020 ms
Read text file in   70346 ms

File sizes for the 50,000,000 numbers were as follows:

 48M input.dat
419M input.txt

Reading the binary is much faster until the set of numbers grows very large. I/O on binary encoded ints is less ( by about 10 times ), there is no String parsing logic, and other overhead of object creation and whatever else that Scanner does. I went ahead and used the Buffered versions of the InputStream and Reader classes because those are best practices and should be used whenever possible.

For extra credit, compression would reduce the I/O wait even more on the large files with almost no measurable effect on the CPU time.

回复收藏 0 原文