使用 Java 创建 .zip 存档的缓冲区大小是多少?

发布于 2024-07-08 07:55:52 字数 1101 浏览 5 评论 0原文

我使用此代码创建一个包含文件列表的 .zip:

ZipOutputStream zos = new ZipOutputStream(new FileOutputStream(zipFile));

for (int i=0;i<srcFiles.length;i++){
    String fileName=srcFiles[i].getName();
    ZipEntry zipEntry = new ZipEntry(fileName);
    zos.putNextEntry(zipEntry);
    InputStream fis = new FileInputStream(srcFiles[i]);
    int read;
    for(byte[] buffer=new byte[1024];(read=fis.read(buffer))>0;){
        zos.write(buffer,0,read);
    }
    fis.close();
    zos.closeEntry();
}
zos.close();

我不知道 zip 算法和 ZipOutputStream 是如何工作的,如果它在我读取所有数据并将结果文件发送到“zos”之前写入一些内容字节大小可能与我选择其他缓冲区大小不同。

换句话说,我不知道算法是否类似于:

READ DATA-->PROCESS DATA-->CREATE .ZIP

READ CHUNK OF DATA-->PROCESS CHUNK OF DATA-->WRITE CHUNK IN 。邮编-->| ^------------------------------------------------- -------------------------------------------------- --------------------------

如果是这种情况,那么缓冲区大小最好是多少?

更新:

我已经测试了这段代码,将缓冲区大小从 1024 更改为 64,并压缩相同的文件:使用 1024 字节时,80 KB 结果文件比使用 64 字节缓冲区小 3 字节。 在最快的时间内生成最小的 .zip 的最佳缓冲区大小是多少?

I use this code to create a .zip with a list of files:

ZipOutputStream zos = new ZipOutputStream(new FileOutputStream(zipFile));

for (int i=0;i<srcFiles.length;i++){
    String fileName=srcFiles[i].getName();
    ZipEntry zipEntry = new ZipEntry(fileName);
    zos.putNextEntry(zipEntry);
    InputStream fis = new FileInputStream(srcFiles[i]);
    int read;
    for(byte[] buffer=new byte[1024];(read=fis.read(buffer))>0;){
        zos.write(buffer,0,read);
    }
    fis.close();
    zos.closeEntry();
}
zos.close();

I don't know how the zip algorithm and the ZipOutputStream works, if it writes something before I read and send to 'zos' all of the data, the result file can be different in size of bytes than if I choose another buffer size.

in other words I don't know if the algorithm is like:

READ DATA-->PROCESS DATA-->CREATE .ZIP

or

READ CHUNK OF DATA-->PROCESS CHUNK OF DATA-->WRITE CHUNK IN .ZIP-->|
^-----------------------------------------------------------------------------------------------------------------------------

If this is the case, what buffer size is the best?

Update:

I have tested this code, changing the buffer size from 1024 to 64, and zipping the same files: with 1024 bytes the 80 KB result file was 3 bytes smaller than with 64 bytes buffer. Which is the best buffer size to produce the smallest .zip in the fatest time?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

美人骨 2024-07-15 07:55:52

简短的回答:我会选择 16k 之类的。


长答案:

ZIP 使用 DEFLATE 算法进行压缩(http://en.wikipedia.org/wiki/DEFLATE )。 Deflate 是 Ziv Lempel Welch 的风格(在维基百科中搜索 LZW)。 DEFLATE 使用 LZ77 和 Huffman 编码。

这是一种字典压缩,据我所知,从算法的角度来看,将数据输入压缩器时使用的缓冲区大小应该几乎没有影响。 对 LZ77 最大的影响是字典大小和滑动窗口,它们不受示例中缓冲区大小的控制。

我认为如果您愿意,您可以尝试不同的缓冲区大小并绘制图表,但我确信您不会看到压缩比有任何显着变化(3/80000 = 0.00375%)。

缓冲区大小对速度的最大影响是由于调用 FileInputStream.read 和 zos.write 时执行的开销代码量。 从这个角度来看,你应该考虑你的收获和花费。

当从 1 字节增加到 1024 字节时,您会丢失 1023 字节(理论上),并且 .read 和 .write 方法中的开销时间会减少约 1024 倍。
然而,当从 1k 增加到 64k 时,您将花费 63k,从而减少了 64 倍的开销。

因此,这会带来收益递减,因此我会选择中间的某个位置(假设是 16k)并坚持下去。

Short answer: I would pick something like 16k.


Long answer:

ZIP is using the DEFLATE algorithm for compression (http://en.wikipedia.org/wiki/DEFLATE). Deflate is a flavor of Ziv Lempel Welch(search wikipedia for LZW). DEFLATE uses LZ77 and Huffman coding.

This is a dictionary compression, and as far as I know from the algorithm standpoint the buffer size used when feeding the data into the deflater should have almost no impact. The biggest impact for LZ77 are dictionary size and sliding window, which are not controlled by the buffer size in your example.

I think you can experiment with different buffer sizes if you want and plot a graph, but I am sure you will not see any significant changes in compression ratio (3/80000 = 0.00375%).

The biggest impact the buffer size has is on the speed due to the amount of overhead code that is executed when you make the calls to FileInputStream.read and zos.write. From this point of view you should take into account what you gain and what you spend.

When increasing from 1 byte to 1024 bytes, you lose 1023 bytes (in theory) and you gain a ~1024 reduction of the overhead time in the .read and .write methods.
However when increasing from 1k to 64k, you are spending 63k which reducing the overhead 64 times.

So this comes with diminishing returns, thus I would choose somewhere in the middle (let's say 16k) and stick with that.

自在安然 2024-07-15 07:55:52

取决于您拥有的硬件(磁盘速度和文件搜索时间)。 我想说,如果您不想压缩最后的性能下降,请选择 4k 到 64k 之间的任何大小。 由于它是一个短暂的对象,无论如何它都会很快被收集。

Depends on the hardware you have (disk speed and file search time). I would say if you are not interested in squeezing the last drop of performance pick any size between 4k and 64k. Since it is a short-lived object it will be collected quickly anyway.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文