java.util.zip - ZipInputStream 与 ZipFile
我有一些关于 java.util.zip 库的一般性问题。 我们基本上做的是进口和出口许多小部件。以前,这些组件是使用单个大文件导入和导出的,例如:
<component-type-a id="1"/>
<component-type-a id="2"/>
<component-type-a id="N"/>
<component-type-b id="1"/>
<component-type-b id="2"/>
<component-type-b id="N"/>
请注意,导入期间组件的顺序是相关的。
现在每个组件都应该占用自己的文件,该文件应该在外部 版本控制、QA 编辑、等等。我们决定导出的输出应该是一个 zip 文件(包含所有这些文件),导入的输入应该是一个类似的 zip 文件。我们不想破坏我们系统中的拉链。我们不想为每个小文件打开单独的流。我当前的问题:
Q1。 ZipInputStream
能否保证 zip 条目(小文件)的读取顺序与我们使用 ZipOutputStream
的导出插入的顺序相同?我假设阅读类似于:
ZipInputStream zis = new ZipInputStream(new BufferedInputStream(fis));
ZipEntry entry;
while((entry = zis.getNextEntry()) != null)
{
//read from zis until available
}
我知道中央 zip 目录位于 zip 文件的末尾,但其中的文件条目仍然有顺序。我也知道依赖顺序是一个丑陋的想法,但我只想牢记所有事实。
Q2。如果我使用 ZipFile
(我更喜欢),调用 getInputStream()
数百次会对性能产生什么影响?它会比 ZipInputStream 解决方案慢很多吗? zip 仅打开一次,并且 ZipFile
由 RandomAccessFile
支持 - 这是正确的吗? 我认为阅读类似于:
ZipFile zipfile = new ZipFile(argv[0]);
Enumeration e = zipfile.entries();//TODO: assure the order of the entries
while(e.hasMoreElements()) {
entry = (ZipEntry) e.nextElement();
is = zipfile.getInputStream(entry));
}
Q3。从同一个 ZipFile 线程检索的输入流是否安全(例如,我可以同时读取不同线程中的不同条目)吗?有任何性能处罚吗?
感谢您的回答!
I have some general questions regarding the java.util.zip
library.
What we basically do is an import and an export of many small components. Previously these components were imported and exported using a single big file, e.g.:
<component-type-a id="1"/>
<component-type-a id="2"/>
<component-type-a id="N"/>
<component-type-b id="1"/>
<component-type-b id="2"/>
<component-type-b id="N"/>
Please note that the order of the components during import is relevant.
Now every component should occupy its own file which should be externally versioned, QA-ed, bla, bla. We decided that the output of our export should be a zip file (with all these files in) and the input of our import should be a similar zip file. We do not want to explode the zip in our system. We do not want opening separate streams for each of the small files. My current questions:
Q1. May the ZipInputStream
guarantee that the zip entries (the little files) will be read in the same order in which they were inserted by our export that uses ZipOutputStream
? I assume reading is something like:
ZipInputStream zis = new ZipInputStream(new BufferedInputStream(fis));
ZipEntry entry;
while((entry = zis.getNextEntry()) != null)
{
//read from zis until available
}
I know that the central zip directory is put at the end of the zip file but nevertheless the file entries inside have sequential order. I also know that relying on the order is an ugly idea but I just want to have all the facts in mind.
Q2. If I use ZipFile
(which I prefer) what is the performance impact of calling getInputStream()
hundreds of times? Will it be much slower than the ZipInputStream
solution? The zip is opened only once and ZipFile
is backed by RandomAccessFile
- is this correct?
I assume reading is something like:
ZipFile zipfile = new ZipFile(argv[0]);
Enumeration e = zipfile.entries();//TODO: assure the order of the entries
while(e.hasMoreElements()) {
entry = (ZipEntry) e.nextElement();
is = zipfile.getInputStream(entry));
}
Q3. Are the input streams retrieved from the same ZipFile
thread safe (e.g. may I read different entries in different threads simultaneously)? Any performance penalties?
Thanks for your answers!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Q1:是的,顺序将与添加条目的顺序相同。
Q2:请注意,由于 zip 存档文件的结构和压缩,没有一个解决方案是完全流式传输的;他们都做了一定程度的缓冲。如果您查看 JDK 源代码,就会发现实现共享大部分代码。尽管索引确实允许查找与条目相对应的块,但对内容内没有真正的随机访问。所以我认为不应该存在有意义的性能差异;特别是操作系统无论如何都会缓存磁盘块。您可能只想测试性能以通过简单的测试用例来验证这一点。
Q3:我不会指望这一点;但很可能他们不是。如果您确实认为并发访问会有所帮助(主要是因为解压缩受 CPU 限制,所以它可能会有所帮助),我会尝试读取内存中的整个文件,通过 ByteArrayInputStream 公开,并构造多个独立的读取器。
Q1: yes, order will be the same in which entries were added.
Q2: note that due to structure of zip archive files, and compression, none of solutions is exactly streaming; they all do some level of buffering. And if you check out JDK sources, implementations share most code. There is no real random access to within content, although index does allow finding chunks that correspond to entries. So I think there should not be meaningful performance differences; especially as OS will do caching of disk blocks anyway. You may want to just test performance to verify this with a simple test case.
Q3: I would not count on this; and most likely they aren't. If you really think concurrent access would help (mostly because decompression is CPU bound, so it might help), I'd try reading the whole file in memory, expose via ByteArrayInputStream, and construct multiple independent readers.
我测量发现,仅使用
ZipInputStream
列出文件比使用ZipFile
慢 8 倍。并且
(不要在同一个类中运行它们。创建两个不同的类并分别运行它们)
I measured that just listing the files with
ZipInputStream
is 8 times slower than withZipFile
.and
(Don't run them in the same class. Make two different classes and run them separately)
关于第三季度,JENKINS-14362 中的经验表明 zlib 不是线程安全的< em>即使在不相关的流上操作时,即它具有一些不正确共享的静态状态。未经证实,只是一个警告。
Regarding Q3, experience in JENKINS-14362 suggests that zlib is not thread-safe even when operating on unrelated streams, i.e. that it has some improperly shared static state. Not proven, just a warning.
使用 ZipFile.getInputStream() 比使用 new ZipInputStream() 快得多。自己尝试一下吧。
Using ZipFile.getInputStream() is significantly faster that using new ZipInputStream(). Just try it yourself.