如何比较两个 tarball 的内容

发布于 2024-07-24 14:22:47 字数 389 浏览 8 评论 0 原文

我想判断两个 tarball 文件在文件名和文件内容方面是否包含相同的文件,不包括日期、用户、组等元数据。

但是,有一些限制: 首先,我无法控制制作tar文件时是否包含元数据,实际上,tar文件总是包含元数据,因此直接比较两个tar文件是行不通的。 其次,由于某些 tar 文件太大,我无法将它们解压到临时目录中并逐一比较所包含的文件。 (我知道如果我可以将 file1.tar 解压到 file1/ 中,我可以通过在 file/ 中调用“tar -dvf file2.tar”来比较它们。但通常我什至无法解压其中一个)

知道如何比较两个 tar 文件? 如果能在SHELL脚本中完成就更好了。 或者,有没有办法在不实际解压 tarball 的情况下获取每个子文件的校验和?

谢谢,

I want to tell whether two tarball files contain identical files, in terms of file name and file content, not including meta-data like date, user, group.

However, There are some restrictions:
first, I have no control of whether the meta-data is included when making the tar file, actually, the tar file always contains meta-data, so directly diff the two tar files doesn't work.
Second, since some tar files are so large that I cannot afford to untar them in to a temp directory and diff the contained files one by one. (I know if I can untar file1.tar into file1/, I can compare them by invoking 'tar -dvf file2.tar' in file/. But usually I cannot afford untar even one of them)

Any idea how I can compare the two tar files? It would be better if it can be accomplished within SHELL scripts. Alternatively, is there any way to get each sub-file's checksum without actually untar a tarball?

Thanks,

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

咽泪装欢 2024-07-31 14:22:48

这是我的变体,它也检查 unix 权限:

仅当文件名短于 200 个字符时才有效。

diff <(tar -tvf 1.tar | awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2) <(tar -tvf 2.tar|awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2)

Here is my variant, it is checking the unix permission too:

Works only if the filenames are shorter than 200 char.

diff <(tar -tvf 1.tar | awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2) <(tar -tvf 2.tar|awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2)
花想c 2024-07-31 14:22:48

编辑:请参阅@StéphaneGourichon 的评论,

我意识到这是一个迟到的回复,但我在尝试实现相同目标时遇到了该线程。 我实现的解决方案将 tar 输出到 stdout,并将其通过管道传输到您选择的任何哈希:

tar -xOzf archive.tar.gz | sort | sha1sum

请注意,参数的顺序很重要; 特别是 O ,它表示使用 stdout。

EDIT: See the comment by @StéphaneGourichon

I realise that this is a late reply, but I came across the thread whilst attempting to achieve the same thing. The solution that I've implemented outputs the tar to stdout, and pipes it to whichever hash you choose:

tar -xOzf archive.tar.gz | sort | sha1sum

Note that the order of the arguments is important; particularly O which signals to use stdout.

心奴独伤 2024-07-31 14:22:48

tardiff 是您要找的吗? 它是“一个简单的 Perl 脚本”,“比较两个 tarball 的内容并报告它们之间发现的任何差异”。

Is tardiff what you're looking for? It's "a simple perl script" that "compares the contents of two tarballs and reports on any differences found between them."

噩梦成真你也成魔 2024-07-31 14:22:48

还有 diffscope,它更通用,并且允许递归比较事物(包括各种格式)。

pip install diffoscope

There is also diffoscope, which is more generic, and allows to compare things recursively (including various formats).

pip install diffoscope
一杆小烟枪 2024-07-31 14:22:48

我建议使用gtarsum,这是我用Go编写的,这意味着它将是一个自主可执行文件(不需要Python或其他执行环境)。

go get github.com/VonC/gtarsum

它将读取一个 tar 文件,并且:

  • 按字母顺序对文件列表进行排序,
  • 为每个文件内容计算 SHA256,
  • 将这些哈希值连接成一个巨大的字符串,
  • 计算该字符串的 SHA256

结果是 tar 文件的“全局哈希值”,基于文件列表及其内容。

它可以比较多个 tar 文件,如果相同则返回 0,否则返回 1。

I propose gtarsum, that I have written in Go, which means it will be an autonomous executable (no Python or other execution environment needed).

go get github.com/VonC/gtarsum

It will read a tar file, and:

  • sort the list of files alphabetically,
  • compute a SHA256 for each file content,
  • concatenate those hashes into one giant string
  • compute the SHA256 of that string

The result is a "global hash" for a tar file, based on the list of files and their content.

It can compare multiple tar files, and return 0 if they are identical, 1 if they are not.

懒的傷心 2024-07-31 14:22:48

只是把它扔在那里,因为上述解决方案都不能满足我的需要。

此函数获取与给定路径匹配的所有文件路径的 md5 哈希值的 md5 哈希值。 如果哈希值相同,则文件层次结构和文件列表相同。

我知道它的性能不如其他产品,但它提供了我所需的确定性。

PATH_TO_CHECK="some/path"
for template in $(find build/ -name '*.tar'); do
    tar -xvf $template --to-command=md5sum | 
        grep $PATH_TO_CHECK -A 1 | 
        grep -v $PATH_TO_CHECK | 
        awk '{print $1}' | 
        md5sum | 
        awk "{print \"$template\",\$1}"
done

*注意:无效路径不会返回任何内容。

Just throwing this out there since none of the above solutions worked for what I needed.

This function gets the md5 hash of the md5 hashes of all the file-paths matching a given path. If the hashes are the same, the file hierarchy and file lists are the same.

I know it's not as performant as others, but it provides the certainty I needed.

PATH_TO_CHECK="some/path"
for template in $(find build/ -name '*.tar'); do
    tar -xvf $template --to-command=md5sum | 
        grep $PATH_TO_CHECK -A 1 | 
        grep -v $PATH_TO_CHECK | 
        awk '{print $1}' | 
        md5sum | 
        awk "{print \"$template\",\$1}"
done

*note: An invalid path simply returns nothing.

安人多梦 2024-07-31 14:22:48

如果不提取档案也不需要差异,请尝试 diff 的 < strong>-q 选项:

diff -q 1.tar 2.tar

安静结果将是“1.tar 2.tar 不同” 或者什么都没有,如果没有差异的话。

If not extracting the archives nor needing the differences, try diff's -q option:

diff -q 1.tar 2.tar

This quiet result will be "1.tar 2.tar differ" or nothing, if no differences.

画骨成沙 2024-07-31 14:22:48

有一个名为 archdiff 的工具。 它基本上是一个可以查看档案的 Perl 脚本。

Takes two archives, or an archive and a directory and shows a summary of the
differences between them.

There is tool called archdiff. It is basically a perl script that can look into the archives.

Takes two archives, or an archive and a directory and shows a summary of the
differences between them.
来日方长 2024-07-31 14:22:48

我有一个类似的问题,我通过 python 解决了它,这是代码。
ps:虽然这段代码是用来比较两个zipball的内容,但它与tarball类似,希望我可以帮助你

import zipfile
import os,md5
import hashlib
import shutil

def decompressZip(zipName, dirName):
    try:
        zipFile = zipfile.ZipFile(zipName, "r")
        fileNames = zipFile.namelist()
        for file in fileNames:
            zipFile.extract(file, dirName)
        zipFile.close()
        return fileNames
    except Exception,e:
        raise Exception,e

def md5sum(filename):
    f = open(filename,"rb")
    md5obj = hashlib.md5()
    md5obj.update(f.read())
    hash = md5obj.hexdigest()
    f.close()
    return str(hash).upper()

if __name__ == "__main__":
    oldFileList = decompressZip("./old.zip", "./oldDir")
    newFileList = decompressZip("./new.zip", "./newDir")

    oldDict = dict()
    newDict = dict()

    for oldFile in oldFileList:
        tmpOldFile = "./oldDir/" + oldFile
        if not os.path.isdir(tmpOldFile):
            oldFileMD5 = md5sum(tmpOldFile)
            oldDict[oldFile] = oldFileMD5

    for newFile in newFileList:
        tmpNewFile = "./newDir/" + newFile
        if not os.path.isdir(tmpNewFile):
            newFileMD5 = md5sum(tmpNewFile)
            newDict[newFile] = newFileMD5

    additionList = list()
    modifyList = list()

    for key in newDict:
        if not oldDict.has_key(key):
            additionList.append(key)
        else:
            newMD5 = newDict[key]
            oldMD5 = oldDict[key]
            if not newMD5 == oldMD5:
            modifyList.append(key)

    print "new file lis:%s" % additionList
    print "modified file list:%s" % modifyList

    shutil.rmtree("./oldDir")
    shutil.rmtree("./newDir")

I have a similar question and i resolve it by python, here is the code.
ps:although this code is used to compare two zipball's content,but it's similar with tarball, hope i can help you

import zipfile
import os,md5
import hashlib
import shutil

def decompressZip(zipName, dirName):
    try:
        zipFile = zipfile.ZipFile(zipName, "r")
        fileNames = zipFile.namelist()
        for file in fileNames:
            zipFile.extract(file, dirName)
        zipFile.close()
        return fileNames
    except Exception,e:
        raise Exception,e

def md5sum(filename):
    f = open(filename,"rb")
    md5obj = hashlib.md5()
    md5obj.update(f.read())
    hash = md5obj.hexdigest()
    f.close()
    return str(hash).upper()

if __name__ == "__main__":
    oldFileList = decompressZip("./old.zip", "./oldDir")
    newFileList = decompressZip("./new.zip", "./newDir")

    oldDict = dict()
    newDict = dict()

    for oldFile in oldFileList:
        tmpOldFile = "./oldDir/" + oldFile
        if not os.path.isdir(tmpOldFile):
            oldFileMD5 = md5sum(tmpOldFile)
            oldDict[oldFile] = oldFileMD5

    for newFile in newFileList:
        tmpNewFile = "./newDir/" + newFile
        if not os.path.isdir(tmpNewFile):
            newFileMD5 = md5sum(tmpNewFile)
            newDict[newFile] = newFileMD5

    additionList = list()
    modifyList = list()

    for key in newDict:
        if not oldDict.has_key(key):
            additionList.append(key)
        else:
            newMD5 = newDict[key]
            oldMD5 = oldDict[key]
            if not newMD5 == oldMD5:
            modifyList.append(key)

    print "new file lis:%s" % additionList
    print "modified file list:%s" % modifyList

    shutil.rmtree("./oldDir")
    shutil.rmtree("./newDir")
若沐 2024-07-31 14:22:47

还可以尝试 pkgdiff 来可视化包之间的差异(检测添加/删除/重命名的文件和更改的内容,存在为零代码(如果未更改):

pkgdiff PKG-0.tgz PKG-1.tgz

在此处输入图像描述

在此处输入图像描述

Try also pkgdiff to visualize differences between packages (detects added/removed/renamed files and changed content, exist with zero code if unchanged):

pkgdiff PKG-0.tgz PKG-1.tgz

enter image description here

enter image description here

呆橘 2024-07-31 14:22:47

您是否控制这些 tar 文件的创建?
如果是这样,最好的技巧是创建 MD5 校验和并将其存储在存档本身的文件中。 然后,当您想要比较两个文件时,只需提取此校验和文件并比较它们即可。


如果您有能力只提取一个 tar 文件您可以使用 tar--diff 选项来查找与其他 tar 文件内容的差异。


如果您只需要比较文件名及其大小,那么还有一个粗略的技巧
请记住,这并不能保证其他文件相同!

执行 tar tvf 列出每个文件的内容并将输出存储在两个不同的文件中。 然后,切掉除文件名和大小列之外的所有内容。 最好也对这两个文件进行排序。 然后,只需在两个列表之间进行文件比较即可。

请记住,最后一个方案并不真正执行校验和。

tar 和输出示例(本例中所有文件的大小均为零)。

$ tar tvfj pack1.tar.bz2
drwxr-xr-x user/group 0 2009-06-23 10:29:51 dir1/
-rw-r--r-- user/group 0 2009-06-23 10:29:50 dir1/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:51 dir1/file2
drwxr-xr-x user/group 0 2009-06-23 10:29:59 dir2/
-rw-r--r-- user/group 0 2009-06-23 10:29:57 dir2/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:59 dir2/file3
drwxr-xr-x user/group 0 2009-06-23 10:29:45 dir3/

生成排序名称/大小列表的命令

$ tar tvfj pack1.tar.bz2 | awk '{printf "%10s %s\n",$3,$6}' | sort -k 2
0 dir1/
0 dir1/file1
0 dir1/file2
0 dir2/
0 dir2/file1
0 dir2/file3
0 dir3/

您可以获取两个这样的排序列表并比较它们。
如果适合您,您还可以使用日期和时间列。

Are you controlling the creation of these tar files?
If so, the best trick would be to create a MD5 checksum and store it in a file within the archive itself. Then, when you want to compare two files, you just extract this checksum files and compare them.


If you can afford to extract just one tar file, you can use the --diff option of tar to look for differences with the contents of other tar file.


One more crude trick if you are fine with just a comparison of the filenames and their sizes.
Remember, this does not guarantee that the other files are same!

execute a tar tvf to list the contents of each file and store the outputs in two different files. then, slice out everything besides the filename and size columns. Preferably sort the two files too. Then, just do a file diff between the two lists.

Just remember that this last scheme does not really do checksum.

Sample tar and output (all files are zero size in this example).

$ tar tvfj pack1.tar.bz2
drwxr-xr-x user/group 0 2009-06-23 10:29:51 dir1/
-rw-r--r-- user/group 0 2009-06-23 10:29:50 dir1/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:51 dir1/file2
drwxr-xr-x user/group 0 2009-06-23 10:29:59 dir2/
-rw-r--r-- user/group 0 2009-06-23 10:29:57 dir2/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:59 dir2/file3
drwxr-xr-x user/group 0 2009-06-23 10:29:45 dir3/

Command to generate sorted name/size list

$ tar tvfj pack1.tar.bz2 | awk '{printf "%10s %s\n",$3,$6}' | sort -k 2
0 dir1/
0 dir1/file1
0 dir1/file2
0 dir2/
0 dir2/file1
0 dir2/file3
0 dir3/

You can take two such sorted lists and diff them.
You can also use the date and time columns if that works for you.

抱着落日 2024-07-31 14:22:47

tarsum 是几乎是你所需要的。 获取其输出,通过排序运行它以使每个输出的顺序相同,然后使用 diff 比较两者。 这应该能让你进行基本的实现,并且通过修改 Python 代码来完成整个工作,可以很容易地将这些步骤拉到主程序中。

tarsum is almost what you need. Take its output, run it through sort to get the ordering identical on each, and then compare the two with diff. That should get you a basic implementation going, and it would be easily enough to pull those steps into the main program by modifying the Python code to do the whole job.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文