在 Java 中从文件中读取整数的最快方法是什么?
我有一个像这样排列的整数文件:
1 2 3 55 22 11 (and so on)
我想尽快读取这些数字以减少程序的总执行时间。到目前为止,我使用的扫描仪效果良好。然而,我感觉有一个更快的 IO 实用程序可以使用。谁能指出我正确的方向吗?
编辑:
所以是的,我通过在 java 代码周围设置不同的计时器并比较结果来验证程序中的 IO 花费了最多的时间。
I have a file of integers arranged like this:
1 2 3 55 22 11 (and so on)
And I want to read in these numbers as fast as possible to lessen the total execution time of my program. So far, I am using a scanner with good results. However, I get the feeling that there exists a faster IO utility I can use. Can anyone please point me in the right direction?
EDIT:
So yes, I verified that it is the IO in my program that's taking the most time by setting up different timers around the java code and comparing results.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
当前文件格式
如果数字表示为
字符串
,则没有更快的方法来读取和解析它们,磁盘 I/O 将慢几个数量级CPU 正在执行的任何操作。唯一能做的就是使用具有巨大缓冲区大小的 BufferedReader,并在使用 Scanner 之前尝试获取内存中的尽可能多(如果不是全部)的文件。替代文件格式
如果您可以在文件中将它们表示为二进制并使用
DataInputStream
读取数字class,那么 I/O 时间可能会小幅减少,CPU 边际也会减少,因为您不这样做需要将String
表示解析为一个int
,除非您的输入文件达到数百兆或更大,否则可能无法测量。 **缓冲输入流仍然比其他任何东西都更有效,在这种情况下使用 BufferedInputStream 。如何优化
您需要强大的分析来检测您所做的任何更改是否对性能产生积极或消极影响。
如果您一遍又一遍地读取相同的文件,操作系统磁盘缓存之类的事情将会扭曲基准,操作系统将缓存它并搞砸您的基准。尽早了解什么是足够好。
昆斯引言中的不成熟部分是重要的部分,它意味着:
如果没有分析和基准测试来验证您正在更改的内容实际上是瓶颈,并且您可以衡量更改的积极或消极影响,则不要进行优化。
这是一个快速基准测试,比较读取同一组二进制数字的
BufferedInputStream
与 <由BufferedReader
支持的 code>Scanner 读取与带有SPACE
分隔符的文本表示形式相同的数字,结果非常一致:
对于我的 1,000 个数字。具有 8GB RAM 的 Core i3 笔记本电脑
对于具有 8GB RAM 的 Core i3 笔记本电脑上的 1,000,000 个数字
对于具有 8GB RAM 的 Core i3 笔记本电脑上的 50,000,000 个数字
这 50,000,000 个数字的文件大小如下:
读取二进制文件的速度要快得多,直到设置完成为止的数字变得非常大。二进制编码整数上的 I/O 较少(大约 10 倍),没有
String
解析逻辑,以及对象创建的其他开销以及Scanner
执行的其他操作。我继续使用了InputStream
和Reader
类的Buffered
版本,因为这些是最佳实践,应该尽可能使用。值得一提的是,压缩将进一步减少大文件上的 I/O 等待,而对 CPU 时间几乎没有可测量的影响。
Current file format
If the numbers are represented as
Strings
there is no faster way to read them in and parse them, disk I/O is going to be orders of magnitude slower than anything the CPU is doing. The only thing can do is use aBufferedReader
with a huge buffer size and try and get as much if not all the file in the memory before usingScanner
.Alternate file format
If you can represent them as binary in the file and read the numbers in using the
DataInputStream
class, then you might get a small decrease in I/O time and a marginal CPU decrease because you don't need to parse theString
representation into anint
that probably would not be measurable unless your input file in in hundreds of megabytes or larger. **Buffering the input stream will still have more effect than anything else, use aBufferedInputStream
in this case.How to optimize
You need robust profiling to even detect if any changes you make are impacting performance positively or negatively.
Things like OS disk caching will skew benchmarks if you read the same file in over and over, the OS will cache it and screw up your benchmarks. Learn what good enough is sooner than later.
The premature part of Kunth's quote is the important part, it means:
Don't optimize without profiling and benchmarks to verify that what you are changing is actually a bottleneck and that you can measure the positve or negative impact of your changes.
Here is a quick benchmark comparing a
BufferedInputStream
reading the same set of binary numbers versus aScanner
backed by aBufferedReader
reading the same set of numbers as text representations with aSPACE
delimiter.the results are pretty consistent:
For 1,000 numbers on my Core i3 laptop with 8GB of RAM
For 1,000,000 numbers on my Core i3 laptop with 8GB of RAM
For 50,000,000 numbers on my Core i3 laptop with 8GB of RAM
File sizes for the 50,000,000 numbers were as follows:
Reading the binary is much faster until the set of numbers grows very large. I/O on binary encoded ints is less ( by about 10 times ), there is no
String
parsing logic, and other overhead of object creation and whatever else thatScanner
does. I went ahead and used theBuffered
versions of theInputStream
andReader
classes because those are best practices and should be used whenever possible.For extra credit, compression would reduce the I/O wait even more on the large files with almost no measurable effect on the CPU time.
一般来说,您可以按照磁盘允许的速度读取数据。更快地读取它的最佳方法是使其更紧凑或使用更快的磁盘。
对于您使用的格式,我将 GZip 文件并读取压缩数据。这是提高底层数据读取速度的简单方法。
Generally you can read the data as fast as the disk allows. The best way to read it faster is to make it more compact or get a faster disk.
For the format you are using, I would GZip the files and read the compressed data. This is a simple way to increase the rate you can read the underlying data.
升级可能性:
在获得更高的性能/速度方面总是需要权衡。上述方法需要花钱,并且必须在每台主机上执行,因此如果这是一个出售给多个客户的程序,那么调整算法可能是一个更好的选择,这将在每台主机上节省资金,程序运行。
如果压缩文件或存储二进制数据,读取速度会提高,但使用独立工具检查数据会更困难。当然,我们无法判断这种情况发生的频率。
在大多数情况下,我建议保留人类可读的数据,并使用较慢的程序,但这当然取决于您损失了多少时间,您丢失时间的频率等等。
也许这只是一个练习,看看你能跑多快。但我想提醒大家不要养成总是追求最高性能而不考虑权衡和成本的习惯。
Escalation possibilities:
There is always a tradeoff in gaining more performance/speed. The above methods will cost money, and have to be performed on every host, so if this is a program which is sold to multiple customers, it could be a better option to twiddle at the algorithm, which will save money on every host, the program is run.
If you compress the file, or store binary data, the speed to read is increased, but it will be harder to inspect the data with independent tools. Of course we can not tell how often this might happen.
In most circumstances I would suggest keeping human readable data, and live with a slower program, but of course it depends how much time you lose, how often you lose it, and so on too.
And maybe it is just an excercise, to find out, how fast you can get. But then I like to warn from the habit to always reach for highest performance without considering the tradeoffs and the costs.