确定目录中是否添加、删除或修改了任何文件

发布于 2024-12-03 03:42:20 字数 726 浏览 1 评论 0原文

我正在尝试编写一个Python脚本来获取目录中所有文件的md5sum(在Linux中)。我相信我已经在下面的代码中完成了这一点。

我希望能够运行它以确保目录中的文件没有更改,并且没有添加要删除的文件。

问题是,如果我对目录中的文件进行更改,然后又将其更改回来。运行下面的函数我得到了不同的结果。 (即使我将修改后的文件改回来。

谁能解释一下。如果您能想到解决方法,请告诉我?

def get_dir_md5(dir_path):
    """Build a tar file of the directory and return its md5 sum"""
    temp_tar_path = 'tests.tar'
    t = tarfile.TarFile(temp_tar_path,mode='w')  
    t.add(dir_path)
    t.close()

    m = hashlib.md5()
    m.update(open(temp_tar_path,'rb').read())
    ret_str = m.hexdigest()

    #delete tar file
    os.remove(temp_tar_path)
    return ret_str

编辑: 正如这些好人所回答的那样,看起来 tar 包含标题信息,例如修改日期。使用 zip 或其他格式会有什么不同吗?

还有其他解决方法吗?

I'm trying to write a Python script that will get the md5sum of all files in a directory (in Linux). Which I believe I have done in the code below.

I want to be able to run this to make sure no files within the directory have changed, and no files have been added for deleted.

The problem is if I make a change to a file in the directory but then change it back. I get a different result from running the function below. (Even though I changed the modified file back.

Can anyone explain this. And let me know if you can think of a work-around?

def get_dir_md5(dir_path):
    """Build a tar file of the directory and return its md5 sum"""
    temp_tar_path = 'tests.tar'
    t = tarfile.TarFile(temp_tar_path,mode='w')  
    t.add(dir_path)
    t.close()

    m = hashlib.md5()
    m.update(open(temp_tar_path,'rb').read())
    ret_str = m.hexdigest()

    #delete tar file
    os.remove(temp_tar_path)
    return ret_str

Edit:
As these fine folks have answered, it looks like tar includes header information like date modified. Would using zip work any differently or another format?

Any other ideas for work arounds?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

淡紫姑娘! 2024-12-10 03:42:20

正如其他答案提到的,即使内容相同,两个 tar 文件也可能不同,因为 tar 元数据更改或文件顺序更改。您应该直接对文件数据运行校验和,对目录列表进行排序以确保它们始终处于相同的顺序。如果您想在校验和中包含一些元数据,请手动包含它。

使用 os.walk 的未经测试的示例:

import os
import os.path
def get_dir_md5(dir_root):
    """Build a tar file of the directory and return its md5 sum"""

    hash = hashlib.md5()
    for dirpath, dirnames, filenames in os.walk(dir_root, topdown=True):

        dirnames.sort(key=os.path.normcase)
        filenames.sort(key=os.path.normcase)

        for filename in filenames:
            filepath = os.path.join(dirpath, filename)

            # If some metadata is required, add it to the checksum

            # 1) filename (good idea)
            # hash.update(os.path.normcase(os.path.relpath(filepath, dir_root))

            # 2) mtime (possibly a bad idea)
            # st = os.stat(filepath)
            # hash.update(struct.pack('d', st.st_mtime))

            # 3) size (good idea perhaps)
            # hash.update(bytes(st.st_size))

            f = open(filepath, 'rb')
            for chunk in iter(lambda: f.read(65536), b''):
                hash.update(chunk)

    return hash.hexdigest()

As the other answers mentioned, two tar files can be different even if the contents are the same either due to tar metadata changes or to file order changes. You should run the checksum on the file data directly, sorting the directory lists to ensure they are always in the same order. If you want to include some metadata in the checksum, include it manually.

Untested example using os.walk:

import os
import os.path
def get_dir_md5(dir_root):
    """Build a tar file of the directory and return its md5 sum"""

    hash = hashlib.md5()
    for dirpath, dirnames, filenames in os.walk(dir_root, topdown=True):

        dirnames.sort(key=os.path.normcase)
        filenames.sort(key=os.path.normcase)

        for filename in filenames:
            filepath = os.path.join(dirpath, filename)

            # If some metadata is required, add it to the checksum

            # 1) filename (good idea)
            # hash.update(os.path.normcase(os.path.relpath(filepath, dir_root))

            # 2) mtime (possibly a bad idea)
            # st = os.stat(filepath)
            # hash.update(struct.pack('d', st.st_mtime))

            # 3) size (good idea perhaps)
            # hash.update(bytes(st.st_size))

            f = open(filepath, 'rb')
            for chunk in iter(lambda: f.read(65536), b''):
                hash.update(chunk)

    return hash.hexdigest()
明天过后 2024-12-10 03:42:20

TAR 文件头包含一个文件修改时间字段;更改文件的行为,即使该更改后来又改回来,也意味着 TAR 文件头将不同,从而导致不同的哈希值。

TAR file headers include a field for the modified time of the file; the act of changing a file, even if that change is later changed back, will mean the TAR file headers will be different, leading to different hashes.

会傲 2024-12-10 03:42:20

您不需要制作 TAR 文件来执行您建议的操作。

这是您的解决算法:

  1. 遍历目录树;
  2. 取每个文件的md5签名;
  3. 对签名进行排序;
  4. 获取各个文件的所有签名的文本字符串的 md5 签名。

生成的单个签名将是您正在寻找的。

哎呀,你甚至不需要Python。你可以这样做:

find /path/to/dir/ -type f -name *.py -exec md5sum {} + | awk '{print $1}'\
| sort | md5sum

You do not need to make the TAR file to do what you propose.

Here is your workaround algorithm:

  1. Walk the directory tree;
  2. Take the md5 signature of each file;
  3. Sort the signatures;
  4. Take the md5 signature of the text string of all the signatures of the individual files.

The single resulting signature will be what you are looking for.

Heck, you don't even need Python. You can do this:

find /path/to/dir/ -type f -name *.py -exec md5sum {} + | awk '{print $1}'\
| sort | md5sum
心清如水 2024-12-10 03:42:20

tar 文件包含实际文件内容之外的元数据,例如文件访问时间、修改时间等。即使文件内容没有更改,tar 文件也会事实有所不同。

tar files contain metadata beyond the actual file contents, such as file access times, modification times, etc. Even if the file contents don't change, the tar file will in fact be different.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文