我想判断两个 tarball 文件在文件名和文件内容方面是否包含相同的文件,不包括日期、用户、组等元数据。
但是,有一些限制:
首先,我无法控制制作tar文件时是否包含元数据,实际上,tar文件总是包含元数据,因此直接比较两个tar文件是行不通的。
其次,由于某些 tar 文件太大,我无法将它们解压到临时目录中并逐一比较所包含的文件。 (我知道如果我可以将 file1.tar 解压到 file1/ 中,我可以通过在 file/ 中调用“tar -dvf file2.tar”来比较它们。但通常我什至无法解压其中一个)
知道如何比较两个 tar 文件? 如果能在SHELL脚本中完成就更好了。 或者,有没有办法在不实际解压 tarball 的情况下获取每个子文件的校验和?
谢谢,
I want to tell whether two tarball files contain identical files, in terms of file name and file content, not including meta-data like date, user, group.
However, There are some restrictions:
first, I have no control of whether the meta-data is included when making the tar file, actually, the tar file always contains meta-data, so directly diff the two tar files doesn't work.
Second, since some tar files are so large that I cannot afford to untar them in to a temp directory and diff the contained files one by one. (I know if I can untar file1.tar into file1/, I can compare them by invoking 'tar -dvf file2.tar' in file/. But usually I cannot afford untar even one of them)
Any idea how I can compare the two tar files? It would be better if it can be accomplished within SHELL scripts. Alternatively, is there any way to get each sub-file's checksum without actually untar a tarball?
Thanks,
发布评论
评论(12)
这是我的变体,它也检查 unix 权限:
仅当文件名短于 200 个字符时才有效。
Here is my variant, it is checking the unix permission too:
Works only if the filenames are shorter than 200 char.
编辑:请参阅@StéphaneGourichon 的评论,
我意识到这是一个迟到的回复,但我在尝试实现相同目标时遇到了该线程。 我实现的解决方案将 tar 输出到 stdout,并将其通过管道传输到您选择的任何哈希:
请注意,参数的顺序很重要; 特别是
O
,它表示使用 stdout。EDIT: See the comment by @StéphaneGourichon
I realise that this is a late reply, but I came across the thread whilst attempting to achieve the same thing. The solution that I've implemented outputs the tar to stdout, and pipes it to whichever hash you choose:
Note that the order of the arguments is important; particularly
O
which signals to use stdout.tardiff 是您要找的吗? 它是“一个简单的 Perl 脚本”,“比较两个 tarball 的内容并报告它们之间发现的任何差异”。
Is tardiff what you're looking for? It's "a simple perl script" that "compares the contents of two tarballs and reports on any differences found between them."
还有 diffscope,它更通用,并且允许递归比较事物(包括各种格式)。
There is also diffoscope, which is more generic, and allows to compare things recursively (including various formats).
我建议使用gtarsum,这是我用Go编写的,这意味着它将是一个自主可执行文件(不需要Python或其他执行环境)。
它将读取一个 tar 文件,并且:
结果是 tar 文件的“全局哈希值”,基于文件列表及其内容。
它可以比较多个 tar 文件,如果相同则返回 0,否则返回 1。
I propose gtarsum, that I have written in Go, which means it will be an autonomous executable (no Python or other execution environment needed).
It will read a tar file, and:
The result is a "global hash" for a tar file, based on the list of files and their content.
It can compare multiple tar files, and return 0 if they are identical, 1 if they are not.
只是把它扔在那里,因为上述解决方案都不能满足我的需要。
此函数获取与给定路径匹配的所有文件路径的 md5 哈希值的 md5 哈希值。 如果哈希值相同,则文件层次结构和文件列表相同。
我知道它的性能不如其他产品,但它提供了我所需的确定性。
*注意:无效路径不会返回任何内容。
Just throwing this out there since none of the above solutions worked for what I needed.
This function gets the md5 hash of the md5 hashes of all the file-paths matching a given path. If the hashes are the same, the file hierarchy and file lists are the same.
I know it's not as performant as others, but it provides the certainty I needed.
*note: An invalid path simply returns nothing.
如果不提取档案也不需要差异,请尝试 diff 的 < strong>-q 选项:
diff -q 1.tar 2.tar
此安静结果将是“1.tar 2.tar 不同” 或者什么都没有,如果没有差异的话。
If not extracting the archives nor needing the differences, try diff's -q option:
diff -q 1.tar 2.tar
This quiet result will be "1.tar 2.tar differ" or nothing, if no differences.
有一个名为 archdiff 的工具。 它基本上是一个可以查看档案的 Perl 脚本。
There is tool called archdiff. It is basically a perl script that can look into the archives.
我有一个类似的问题,我通过 python 解决了它,这是代码。
ps:虽然这段代码是用来比较两个zipball的内容,但它与tarball类似,希望我可以帮助你
I have a similar question and i resolve it by python, here is the code.
ps:although this code is used to compare two zipball's content,but it's similar with tarball, hope i can help you
还可以尝试 pkgdiff 来可视化包之间的差异(检测添加/删除/重命名的文件和更改的内容,存在为零代码(如果未更改):
Try also pkgdiff to visualize differences between packages (detects added/removed/renamed files and changed content, exist with zero code if unchanged):
您是否控制这些 tar 文件的创建?
如果是这样,最好的技巧是创建 MD5 校验和并将其存储在存档本身的文件中。 然后,当您想要比较两个文件时,只需提取此校验和文件并比较它们即可。
如果您有能力只提取一个 tar 文件,您可以使用
tar
的--diff
选项来查找与其他 tar 文件内容的差异。如果您只需要比较文件名及其大小,那么还有一个粗略的技巧。
请记住,这并不能保证其他文件相同!
执行
tar tvf
列出每个文件的内容并将输出存储在两个不同的文件中。 然后,切掉除文件名和大小列之外的所有内容。 最好也对这两个文件进行排序。 然后,只需在两个列表之间进行文件比较即可。请记住,最后一个方案并不真正执行校验和。
tar 和输出示例(本例中所有文件的大小均为零)。
生成排序名称/大小列表的命令
您可以获取两个这样的排序列表并比较它们。
如果适合您,您还可以使用日期和时间列。
Are you controlling the creation of these tar files?
If so, the best trick would be to create a MD5 checksum and store it in a file within the archive itself. Then, when you want to compare two files, you just extract this checksum files and compare them.
If you can afford to extract just one tar file, you can use the
--diff
option oftar
to look for differences with the contents of other tar file.One more crude trick if you are fine with just a comparison of the filenames and their sizes.
Remember, this does not guarantee that the other files are same!
execute a
tar tvf
to list the contents of each file and store the outputs in two different files. then, slice out everything besides the filename and size columns. Preferably sort the two files too. Then, just do a file diff between the two lists.Just remember that this last scheme does not really do checksum.
Sample tar and output (all files are zero size in this example).
Command to generate sorted name/size list
You can take two such sorted lists and diff them.
You can also use the date and time columns if that works for you.
tarsum 是几乎是你所需要的。 获取其输出,通过排序运行它以使每个输出的顺序相同,然后使用 diff 比较两者。 这应该能让你进行基本的实现,并且通过修改 Python 代码来完成整个工作,可以很容易地将这些步骤拉到主程序中。
tarsum is almost what you need. Take its output, run it through sort to get the ordering identical on each, and then compare the two with diff. That should get you a basic implementation going, and it would be easily enough to pull those steps into the main program by modifying the Python code to do the whole job.