为什么我的 java.util.zip 函数显示不一致的行为?
我有一个 Java 应用程序,它使用 java.util.zip 库来压缩和解压缩文件。我所拥有的是服务器上的一个 zip 文件(由我的应用程序创建),客户端压缩他的一些文件并将文件上传到服务器,但如果底层文件没有区别,那么我不想浪费时间上传。我想我可以计算客户端和服务器端的 MD5 哈希值,看看它们是否相同,但发生的情况是我使用我的应用程序解压缩 zip 文件,然后不更改任何底层文件,我使用我的应用程序重新压缩它,但新旧 zip 文件具有不同的 MD5 哈希值。有谁知道为什么会发生这种情况,以及是否有更好的方法来比较两个 zip 文件?谢谢。
I have a Java application that uses the java.util.zip library to compress and decompress files. What I have is a zip file on the server (created by my application) and the client zipping some of his files and uploading the file to the server, but if there's no difference in the underlying files then I don't want to waste the time uploading. I figured that I could calculate the MD5 hash values of the client-side and server-side and see if they're the same, but what's happening is I use my application to decompress a zip file, and then without changing any of the underlying files, I use my application to re-compress it, but the old and new zip files have different MD5 hashes. Does anybody know why this is happening, and if there's a better way to compare two zip files? Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我认为更糟糕的是:
两次执行相同的 zip 操作可能会产生两个不同的 zip 存档:
查看二进制文件,人们只能在一个地方看到差异:
我认为(取决于 zip 应用程序本身)当前系统时间可以/将会涉及。因此,任何 zip 操作(在完全相同的源上)都可以(!)是唯一的,因此不能假设校验和相等。
我发现的与时间无关的工具:tar、7z。 (都是命令行)
即 tar 和 7z 复制具有相同校验和 (md5) 的存档。
(在 OSX 10.6.8 上使用命令行 zip 实用程序进行测试)
It's even worse, I think:
Doing the same zip-operation twice can result in two different zip-archives:
Looking in the binaries, one can see difference in just one place:
I think (depending on the zip-app itself) the current system time can/will be involved. Thus any zip-operation - on exactly the same sources - can(!) be unique and therefore the checksums can't be assumed equal.
Time-independent tools I found: tar, 7z. (both command-line)
I.e. tar and 7z reproduces archives with equal checksums (md5).
(tested on OSX 10.6.8 with command-line zip utility)
1) 检查文件上的时间戳。通过解压缩生成的文件可能具有不同的上次修改日期和/或创建日期。该文件元数据可用于创建哈希。
2) 您在两个系统上使用相同的操作系统吗?如果操作系统不同,它们可能使用不同的字符编码。
3)你能区分zip文件吗?不同的 MD5 哈希应该意味着不同的数据。这会很混乱,但您可能会通过比较原始文件得到一些线索。
1) Check the time stamps on the files. The files made by unziping might have a different last modified date and or creation date. That file metadata might be used to create the hash.
2) Are you using the same OS on both systems? If the OSes are different they might be using a different character encoding.
3) Can you diff the zip files? Different MD5 hashes should mean different data. It will be messy but you might get some clues by comparing the raw files.
只是在黑暗中胡思乱想——您正在计算哈希值的两个文件系统是否大小写不同?
也就是说,其中之一是 Windows,它将 ABC.CLASS 和 abc.class 文件名视为相同,而 Unix 变体之一是将 ABC.CLASS 和 abc.class 视为不同的文件名?
只是一个疯狂的猜测...
编辑:您还可以查看嵌入的目录分隔符 / \ 。或 : 在 zip 文件内。
Just a wild shot in the dark -- are the two file systems you are calculating your hash values on differently cased?
That is, is one of them Windows, which treats ABC.CLASS and abc.class file names as identical, and one of the a Unix variant which treats ABC.CLASS and abc.class as different?
Just a wild guess...
EDIT: You might also look at the embedded directory separator characters / \ . or : inside the zip file.
您无法比较不同 zip 程序生成的 zip 文件并期望它们完全相同,即使在压缩之前使用完全相同的文件也是如此。
不能保证压缩文件在 zip 编码的两种不同实现之间具有确定性。 Zip 的工作原理是用相当于查找键的内容替换重复的数据部分。两种不同的算法可以不同地确定字典(重复数据集),以优化压缩级别。然而,这两种实现都可以创建有效的 zip 文件,这些文件在解压缩时会生成相同的文件。
唯一可靠的方法是保证在这两种情况下使用完全相同的 zip 算法。
编辑:这就是为什么您在 Deflate 算法的 Java 实现中看到不同的压缩级别设置 http://download.oracle.com/javase/1.5.0/docs/api/java/util/zip/Deflater.html
You cannot compare the resulting zip files from differing zip programs and expect them to be exactly the same, even if the exact same files were used before compression.
Zipping a file is not guaranteed to be deterministic between two different implementations of the zip encodings. Zip works by replacing repeated sections of data with what amounts to a look up key. Two different algorithms can determine the dictionary (set of repeated data) differently, in an effort to optimize the compression levels. Yet, both implementations can create valid zip files that when un-zipped result in the same file.
The only reliable way to do this would be to guarantee that the exact same zip algorithm is being used in both cases.
EDIT: This is why you see different compression level settings in the Java implementation of the Deflate algorithm http://download.oracle.com/javase/1.5.0/docs/api/java/util/zip/Deflater.html