寻找 Java 中 BufferedInputStream 的最佳大小
我正在分析加载二进制文件的代码。加载时间约为 15 秒。
我的大部分加载时间来自加载二进制数据的方法。
我有以下代码来创建我的 DataInputStream:
is = new DataInputStream(
new GZIPInputStream(
new FileInputStream("file.bin")));
我将其更改为:
is = new DataInputStream(
new BufferedInputStream(
new GZIPInputStream(
new FileInputStream("file.bin"))));
因此,在我做了这个小修改之后,加载代码从 15 秒变为 4 秒。
但后来我发现 BufferedInputStream 有两个构造函数。另一个构造函数允许您显式定义缓冲区大小。
我有两个问题:
- 在 BufferedInputStream 中选择什么大小,是否理想?如果没有,我怎样才能找到缓冲区的最佳大小?我应该编写一段快速的代码来进行二分搜索吗?
- 这是我使用 BufferedInputStream 的最佳方式吗?我最初将它放在 GZIPInputStream 中,但好处可以忽略不计。我假设代码现在所做的是每次需要填充文件缓冲区时,GZIP 输入流都会经过并解码 x 个字节(其中 x 是缓冲区的大小)。是否值得完全省略 GZIPInputStream ?绝对不需要,但使用它时我的文件大小显着减小。
I was profiling my code that was loading a binary file. The load time was something around 15 seconds.
The majority of my load time was coming from the methods that were loading binary data.
I had the following code to create my DataInputStream:
is = new DataInputStream(
new GZIPInputStream(
new FileInputStream("file.bin")));
And I changed it to this:
is = new DataInputStream(
new BufferedInputStream(
new GZIPInputStream(
new FileInputStream("file.bin"))));
So after I did this small modification the loading code went from 15 seconds to 4.
But then I found that BufferedInputStream has two constructors. The other constructor lets you explicitly define the buffer size.
I've got two questions:
- What size is chosen in BufferedInputStream and is it ideal? If not, how can I find the optimum size for the buffer? Should I write a quick bit of code that does a binary search?
- Is this the best way I can use BufferedInputStream? I originally had it within the GZIPInputStream but there was negligable benefit. I'm assuming what the code is doing now is every time that the file buffer needs to be filled, the GZIP input stream goes through and decodes x bytes (where x is the size of the buffer). Would it be worth just omitting the GZIPInputStream entirely? It's definitely not needed, but my file size is decreased dramatically when using it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
GZIPInputStream 和 BufferedInputStream 都使用内部缓冲区。这就是为什么在 GZIPInputStream 中使用 BufferedInputStream 不会带来任何好处。 GZIPInputStream 的问题是它不会缓冲它生成的输出,因此您当前的版本要快得多。
BufferedInputStream 的默认缓冲区大小为 8kb,因此您可以尝试增加或减少该值以查看是否有帮助。我怀疑确切的数字有多重要,所以你可以简单地乘以或除以二。
如果文件较小,也可以尝试完全缓冲。理论上这应该会给你最好的性能。您还可以尝试增加 GZIPInputStream 的缓冲区大小(默认为 512 字节),因为这可能会加快从磁盘的读取速度。
Both the GZIPInputStream and the BufferedInputStream use an internal buffer. That is why using a BufferedInputStream inside the GZIPInputStream doesn't provide any benefit. The problem with the GZIPInputStream is that it doesn't buffer the output that it generates, thus your current version is much faster.
The default buffersize for the BufferedInputStream is 8kb, so you can try and increase or decrease that to see if it helps. I doubt that the exact number matters much, so you can simply multiply or divide by two.
If the file is small, you can also try to buffer it completely. This should give you the best performance in theory. You could also try to increase the buffer size of the GZIPInputStream (by default 512 bytes), since this might speed up reading from disk.
不要为编码的二分搜索而烦恼。只需手动尝试一些值并比较时间(如果您愿意,您可以进行手动二分搜索)。您很可能会发现,非常广泛的缓冲区大小将为您提供接近最佳的性能,因此请选择能够达到目的的最小缓冲区大小。
您所拥有的是正确的顺序:
将
BufferedInputStream
放入GZIPInputStream
中没有什么意义,因为后者已经缓冲了其输入(但没有缓冲输出)。删除
GZIPInputStream
可能是一个胜利,但如果数据必须从磁盘读取并且不驻留在文件系统缓存中,则很可能会损害性能。原因是从磁盘读取非常慢,而解压缩gzip
非常快。因此,从磁盘读取较少的数据并将其解压缩到内存中通常比从磁盘读取更多的数据更便宜。Don't bother with a coded binary search. Just try some values by hand and compare the timings (you can do a manual binary search if you like). You'll most likely find that a very wide range of buffer sizes will give you close-to-optimal performance, so pick the smallest that does the trick.
What you have is the correct order:
There is little point in putting a
BufferedInputStream
inside theGZIPInputStream
since the latter already buffers its input (but not the output.)Removing
GZIPInputStream
might be a win, but will most likely be detrimental to performance if the data has to be read from disk and is not resident in the filesystem cache. The reason is that reading from disk is very slow and decompressinggzip
is very fast. Therefore it is generally cheaper to read less data from disk and decompress it in memory than it is to read more data from disk.