使用 Java 创建 .zip 存档的缓冲区大小是多少?
我使用此代码创建一个包含文件列表的 .zip:
ZipOutputStream zos = new ZipOutputStream(new FileOutputStream(zipFile));
for (int i=0;i<srcFiles.length;i++){
String fileName=srcFiles[i].getName();
ZipEntry zipEntry = new ZipEntry(fileName);
zos.putNextEntry(zipEntry);
InputStream fis = new FileInputStream(srcFiles[i]);
int read;
for(byte[] buffer=new byte[1024];(read=fis.read(buffer))>0;){
zos.write(buffer,0,read);
}
fis.close();
zos.closeEntry();
}
zos.close();
我不知道 zip 算法和 ZipOutputStream 是如何工作的,如果它在我读取所有数据并将结果文件发送到“zos”之前写入一些内容字节大小可能与我选择其他缓冲区大小不同。
换句话说,我不知道算法是否类似于:
READ DATA-->PROCESS DATA-->CREATE .ZIP
或
READ CHUNK OF DATA-->PROCESS CHUNK OF DATA-->WRITE CHUNK IN 。邮编-->| ^------------------------------------------------- -------------------------------------------------- --------------------------
如果是这种情况,那么缓冲区大小最好是多少?
更新:
我已经测试了这段代码,将缓冲区大小从 1024 更改为 64,并压缩相同的文件:使用 1024 字节时,80 KB 结果文件比使用 64 字节缓冲区小 3 字节。 在最快的时间内生成最小的 .zip 的最佳缓冲区大小是多少?
I use this code to create a .zip with a list of files:
ZipOutputStream zos = new ZipOutputStream(new FileOutputStream(zipFile));
for (int i=0;i<srcFiles.length;i++){
String fileName=srcFiles[i].getName();
ZipEntry zipEntry = new ZipEntry(fileName);
zos.putNextEntry(zipEntry);
InputStream fis = new FileInputStream(srcFiles[i]);
int read;
for(byte[] buffer=new byte[1024];(read=fis.read(buffer))>0;){
zos.write(buffer,0,read);
}
fis.close();
zos.closeEntry();
}
zos.close();
I don't know how the zip algorithm and the ZipOutputStream works, if it writes something before I read and send to 'zos' all of the data, the result file can be different in size of bytes than if I choose another buffer size.
in other words I don't know if the algorithm is like:
READ DATA-->PROCESS DATA-->CREATE .ZIP
or
READ CHUNK OF DATA-->PROCESS CHUNK OF DATA-->WRITE CHUNK IN .ZIP-->|
^-----------------------------------------------------------------------------------------------------------------------------
If this is the case, what buffer size is the best?
Update:
I have tested this code, changing the buffer size from 1024 to 64, and zipping the same files: with 1024 bytes the 80 KB result file was 3 bytes smaller than with 64 bytes buffer. Which is the best buffer size to produce the smallest .zip in the fatest time?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
简短的回答:我会选择 16k 之类的。
长答案:
ZIP 使用 DEFLATE 算法进行压缩(http://en.wikipedia.org/wiki/DEFLATE )。 Deflate 是 Ziv Lempel Welch 的风格(在维基百科中搜索 LZW)。 DEFLATE 使用 LZ77 和 Huffman 编码。
这是一种字典压缩,据我所知,从算法的角度来看,将数据输入压缩器时使用的缓冲区大小应该几乎没有影响。 对 LZ77 最大的影响是字典大小和滑动窗口,它们不受示例中缓冲区大小的控制。
我认为如果您愿意,您可以尝试不同的缓冲区大小并绘制图表,但我确信您不会看到压缩比有任何显着变化(3/80000 = 0.00375%)。
缓冲区大小对速度的最大影响是由于调用 FileInputStream.read 和 zos.write 时执行的开销代码量。 从这个角度来看,你应该考虑你的收获和花费。
当从 1 字节增加到 1024 字节时,您会丢失 1023 字节(理论上),并且 .read 和 .write 方法中的开销时间会减少约 1024 倍。
然而,当从 1k 增加到 64k 时,您将花费 63k,从而减少了 64 倍的开销。
因此,这会带来收益递减,因此我会选择中间的某个位置(假设是 16k)并坚持下去。
Short answer: I would pick something like 16k.
Long answer:
ZIP is using the DEFLATE algorithm for compression (http://en.wikipedia.org/wiki/DEFLATE). Deflate is a flavor of Ziv Lempel Welch(search wikipedia for LZW). DEFLATE uses LZ77 and Huffman coding.
This is a dictionary compression, and as far as I know from the algorithm standpoint the buffer size used when feeding the data into the deflater should have almost no impact. The biggest impact for LZ77 are dictionary size and sliding window, which are not controlled by the buffer size in your example.
I think you can experiment with different buffer sizes if you want and plot a graph, but I am sure you will not see any significant changes in compression ratio (3/80000 = 0.00375%).
The biggest impact the buffer size has is on the speed due to the amount of overhead code that is executed when you make the calls to FileInputStream.read and zos.write. From this point of view you should take into account what you gain and what you spend.
When increasing from 1 byte to 1024 bytes, you lose 1023 bytes (in theory) and you gain a ~1024 reduction of the overhead time in the .read and .write methods.
However when increasing from 1k to 64k, you are spending 63k which reducing the overhead 64 times.
So this comes with diminishing returns, thus I would choose somewhere in the middle (let's say 16k) and stick with that.
取决于您拥有的硬件(磁盘速度和文件搜索时间)。 我想说,如果您不想压缩最后的性能下降,请选择 4k 到 64k 之间的任何大小。 由于它是一个短暂的对象,无论如何它都会很快被收集。
Depends on the hardware you have (disk speed and file search time). I would say if you are not interested in squeezing the last drop of performance pick any size between 4k and 64k. Since it is a short-lived object it will be collected quickly anyway.