在 Java 中从文件中读取整数的最快方法是什么?

发布于 2024-11-09 04:02:04 字数 234 浏览 4 评论 0原文

我有一个像这样排列的整数文件:

1 2 3 55 22 11 (and so on)

我想尽快读取这些数字以减少程序的总执行时间。到目前为止,我使用的扫描仪效果良好。然而,我感觉有一个更快的 IO 实用程序可以使用。谁能指出我正确的方向吗?

编辑:

所以是的,我通过在 java 代码周围设置不同的计时器并比较结果来验证程序中的 IO 花费了最多的时间。

I have a file of integers arranged like this:

1 2 3 55 22 11 (and so on)

And I want to read in these numbers as fast as possible to lessen the total execution time of my program. So far, I am using a scanner with good results. However, I get the feeling that there exists a faster IO utility I can use. Can anyone please point me in the right direction?

EDIT:

So yes, I verified that it is the IO in my program that's taking the most time by setting up different timers around the java code and comparing results.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

苄①跕圉湢 2024-11-16 04:02:04

当前文件格式

如果数字表示为字符串,则没有更快的方法来读取和解析它们,磁盘 I/O 将慢几个数量级CPU 正在执行的任何操作。唯一能做的就是使用具有巨大缓冲区大小的 BufferedReader,并在使用 Scanner 之前尝试获取内存中的尽可能多(如果不是全部)的文件。

替代文件格式

如果您可以在文件中将它们表示为二进制并使用DataInputStream读取数字class,那么 I/O 时间可能会小幅减少,CPU 边际也会减少,因为您不这样做需要将 String 表示解析为一个 int ,除非您的输入文件达到数百兆或更大,否则可能无法测量。 **缓冲输入流仍然比其他任何东西都更有效,在这种情况下使用 BufferedInputStream 。

如何优化

您需要强大的分析来检测您所做的任何更改是否对性能产生积极消极影响。

如果您一遍又一遍地读取相同的文件,操作系统磁盘缓存之类的事情将会扭曲基准,操作系统将缓存它并搞砸您的基准。尽早了解什么是足够好

“我们应该忘记小事
效率大约是 97%
时间:过早优化
万恶之源” - Donald Knuth

昆斯引言中的不成熟部分是重要的部分,它意味着:

如果没有分析和基准测试来验证您正在更改的内容实际上是瓶颈,并且您可以衡量更改的积极或消极影响,则不要进行优化。

这是一个快速基准测试,比较读取同一组二进制数字的 BufferedInputStream 与 <由 BufferedReader 支持的 code>Scanner 读取与带有 SPACE 分隔符的文本表示形式相同的数字,

结果非常一致:

对于我的 1,000 个数字。具有 8GB RAM 的 Core i3 笔记本电脑

Read binary file in 0001 ms
Read text file in   0041 ms

对于具有 8GB RAM 的 Core i3 笔记本电脑上的 1,000,000 个数字

Read binary file in 0603 ms
Read text file in   1509 ms

对于具有 8GB RAM 的 Core i3 笔记本电脑上的 50,000,000 个数字

Read binary file in 29020 ms
Read text file in   70346 ms

这 50,000,000 个数字的文件大小如下:

 48M input.dat
419M input.txt

读取二进制文件的速度要快得多,直到设置完成为止的数字变得非常大。二进制编码整数上的 I/O 较少(大约 10 倍),没有 String 解析逻辑,以及对象创建的其他开销以及 Scanner 执行的其他操作。我继续使用了 InputStreamReader 类的 Buffered 版本,因为这些是最佳实践,应该尽可能使用。

值得一提的是,压缩将进一步减少大文件上的 I/O 等待,而对 CPU 时间几乎没有可测量的影响。

Current file format

If the numbers are represented as Strings there is no faster way to read them in and parse them, disk I/O is going to be orders of magnitude slower than anything the CPU is doing. The only thing can do is use a BufferedReader with a huge buffer size and try and get as much if not all the file in the memory before using Scanner.

Alternate file format

If you can represent them as binary in the file and read the numbers in using the DataInputStream class, then you might get a small decrease in I/O time and a marginal CPU decrease because you don't need to parse the String representation into an int that probably would not be measurable unless your input file in in hundreds of megabytes or larger. **Buffering the input stream will still have more effect than anything else, use a BufferedInputStream in this case.

How to optimize

You need robust profiling to even detect if any changes you make are impacting performance positively or negatively.

Things like OS disk caching will skew benchmarks if you read the same file in over and over, the OS will cache it and screw up your benchmarks. Learn what good enough is sooner than later.

"We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil" - Donald Knuth

The premature part of Kunth's quote is the important part, it means:

Don't optimize without profiling and benchmarks to verify that what you are changing is actually a bottleneck and that you can measure the positve or negative impact of your changes.

Here is a quick benchmark comparing a BufferedInputStream reading the same set of binary numbers versus a Scanner backed by a BufferedReader reading the same set of numbers as text representations with a SPACE delimiter.

the results are pretty consistent:

For 1,000 numbers on my Core i3 laptop with 8GB of RAM

Read binary file in 0001 ms
Read text file in   0041 ms

For 1,000,000 numbers on my Core i3 laptop with 8GB of RAM

Read binary file in 0603 ms
Read text file in   1509 ms

For 50,000,000 numbers on my Core i3 laptop with 8GB of RAM

Read binary file in 29020 ms
Read text file in   70346 ms

File sizes for the 50,000,000 numbers were as follows:

 48M input.dat
419M input.txt

Reading the binary is much faster until the set of numbers grows very large. I/O on binary encoded ints is less ( by about 10 times ), there is no String parsing logic, and other overhead of object creation and whatever else that Scanner does. I went ahead and used the Buffered versions of the InputStream and Reader classes because those are best practices and should be used whenever possible.

For extra credit, compression would reduce the I/O wait even more on the large files with almost no measurable effect on the CPU time.

慢慢从新开始 2024-11-16 04:02:04

一般来说,您可以按照磁盘允许的速度读取数据。更快地读取它的最佳方法是使其更紧凑或使用更快的磁盘。

对于您使用的格式,我将 GZip 文件并读取压缩数据。这是提高底层数据读取速度的简单方法。

Generally you can read the data as fast as the disk allows. The best way to read it faster is to make it more compact or get a faster disk.

For the format you are using, I would GZip the files and read the compressed data. This is a simple way to increase the rate you can read the underlying data.

辞旧 2024-11-16 04:02:04

升级可能性:

  • 购买更快的磁盘。
  • 购买一个 SSD 驱动器。
  • 将文件存储在 ramdisk 中。

在获得更高的性能/速度方面总是需要权衡。上述方法需要花钱,并且必须在每台主机上执行,因此如果这是一个出售给多个客户的程序,那么调整算法可能是一个更好的选择,这将在每台主机上节省资金,程序运行。

如果压缩文件或存储二进制数据,读取速度会提高,但使用独立工具检查数据会更困难。当然,我们无法判断这种情况发生的频率。

在大多数情况下,我建议保留人类可读的数据,并使用较慢的程序,但这当然取决于您损失了多少时间,您丢失时间的频率等等。

也许这只是一个练习,看看你能跑多快。但我想提醒大家不要养成总是追求最高性能而不考虑权衡和成本的习惯。

Escalation possibilities:

  • Buy a faster disk.
  • Buy an ssd-drive.
  • Store the file in a ramdisk.

There is always a tradeoff in gaining more performance/speed. The above methods will cost money, and have to be performed on every host, so if this is a program which is sold to multiple customers, it could be a better option to twiddle at the algorithm, which will save money on every host, the program is run.

If you compress the file, or store binary data, the speed to read is increased, but it will be harder to inspect the data with independent tools. Of course we can not tell how often this might happen.

In most circumstances I would suggest keeping human readable data, and live with a slower program, but of course it depends how much time you lose, how often you lose it, and so on too.

And maybe it is just an excercise, to find out, how fast you can get. But then I like to warn from the habit to always reach for highest performance without considering the tradeoffs and the costs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文