当前位置：文江博客话题详情

如何生成文件的 MD5 校验和？

发布于 2024-09-13 07:33:41 字数 73 浏览 18 评论 0原文

有没有简单的方法可以在 Python 中生成（和检查）文件列表的 MD5 校验和？（我正在开发一个小程序，我想确认文件的校验和）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尛丟丟 2024-09-20 07:33:41

你可以使用 hashlib.md5()

注意，有时你会无法适应内存中的整个文件。在这种情况下，您必须顺序读取 4096 字节的块并将它们提供给 md5 方法：

import hashlib
def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

注意： hash_md5.hexdigest() 将返回摘要的十六进制字符串表示形式，如果您只需要打包字节，请使用return hash_md5.digest()，这样您就不必转换回来。

You can use hashlib.md5()

Note that sometimes you won't be able to fit the whole file in memory. In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the md5 method:

import hashlib
def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

Note: hash_md5.hexdigest() will return the hex string representation for the digest, if you just need the packed bytes use return hash_md5.digest(), so you don't have to convert back.

回复收藏 0 原文

凉宸 2024-09-20 07:33:41

有一种内存效率低下的方法。

单个文件：

import hashlib
def file_as_bytes(file):
    with file:
        return file.read()

print hashlib.md5(file_as_bytes(open(full_path, 'rb'))).hexdigest()

文件列表：

[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

不过请记住，MD5 已知已损坏，因此不应被损坏用于任何目的，因为漏洞分析可能非常棘手，并且分析代码将来可能用于解决安全问题的任何可能用途是不可能的。恕我直言，它应该从库中完全删除，这样每个使用它的人都被迫更新。因此，您应该这样做：

[(fname, hashlib.sha256(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

如果您只想要 128 位的摘要，您可以执行 .digest()[:16]。

这将为您提供一个元组列表，每个元组包含其文件名及其哈希值。

我再次强烈质疑您对 MD5 的使用。您至少应该使用 SHA1，并且考虑到最近在 SHA1 中发现的缺陷，可能还不是这样。有些人认为只要您不将 MD5 用于“加密”目的，就可以了。但事情的范围往往会比你最初预期的更广泛，而你的随意漏洞分析可能会被证明是完全有缺陷的。最好一开始就养成使用正确算法的习惯。只需输入一组不同的字母即可。这并不难。

这是一种更复杂的方法，但是内存效率高：

import hashlib

def hash_bytestr_iter(bytesiter, hasher, ashexstr=False):
    for block in bytesiter:
        hasher.update(block)
    return hasher.hexdigest() if ashexstr else hasher.digest()

def file_as_blockiter(afile, blocksize=65536):
    with afile:
        block = afile.read(blocksize)
        while len(block) > 0:
            yield block
            block = afile.read(blocksize)


[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5()))
    for fname in fnamelst]

而且，由于 MD5 已损坏并且不应该再使用：

[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256()))
    for fname in fnamelst]

同样，您可以将 [:16]<如果您只需要 128 位的摘要，请在调用 hash_bytestr_iter(...) 之后使用 /code> 。

There is a way that's pretty memory inefficient.

single file:

import hashlib
def file_as_bytes(file):
    with file:
        return file.read()

print hashlib.md5(file_as_bytes(open(full_path, 'rb'))).hexdigest()

list of files:

[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

Recall though, that MD5 is known broken and should not be used for any purpose since vulnerability analysis can be really tricky, and analyzing any possible future use your code might be put to for security issues is impossible. IMHO, it should be flat out removed from the library so everybody who uses it is forced to update. So, here's what you should do instead:

[(fname, hashlib.sha256(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

If you only want 128 bits worth of digest you can do .digest()[:16].

This will give you a list of tuples, each tuple containing the name of its file and its hash.

Again I strongly question your use of MD5. You should be at least using SHA1, and given recent flaws discovered in SHA1, probably not even that. Some people think that as long as you're not using MD5 for 'cryptographic' purposes, you're fine. But stuff has a tendency to end up being broader in scope than you initially expect, and your casual vulnerability analysis may prove completely flawed. It's best to just get in the habit of using the right algorithm out of the gate. It's just typing a different bunch of letters is all. It's not that hard.

Here is a way that is more complex, but memory efficient:

import hashlib

def hash_bytestr_iter(bytesiter, hasher, ashexstr=False):
    for block in bytesiter:
        hasher.update(block)
    return hasher.hexdigest() if ashexstr else hasher.digest()

def file_as_blockiter(afile, blocksize=65536):
    with afile:
        block = afile.read(blocksize)
        while len(block) > 0:
            yield block
            block = afile.read(blocksize)


[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5()))
    for fname in fnamelst]

And, again, since MD5 is broken and should not really ever be used anymore:

[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256()))
    for fname in fnamelst]

Again, you can put [:16] after the call to hash_bytestr_iter(...) if you only want 128 bits worth of digest.

回复收藏 0 原文

不忘初心 2024-09-20 07:33:41

我显然没有添加任何根本性的新内容，但在我达到评论状态之前添加了这个答案，加上代码区域使事情变得更加清晰 - 无论如何，特别是从 Omnifarious 的答案中回答 @Nemo 的问题：

我碰巧在考虑稍微检查一下（特别是来这里寻找有关块大小的建议），并发现这种方法可能比您预期的要快。从对大约文件进行校验和的几种方法中的每一种中获取最快（但非常典型）timeit.timeit或/usr/bin/time结果。 11MB：

$ ./sum_methods.py
crc32_mmap(filename) 0.0241742134094
crc32_read(filename) 0.0219960212708
subprocess.check_output(['cksum', filename]) 0.0553209781647
md5sum_mmap(filename) 0.0286180973053
md5sum_read(filename) 0.0311000347137
subprocess.check_output(['md5sum', filename]) 0.0332629680634
$ time md5sum /tmp/test.data.300k
d3fe3d5d4c2460b5daacc30c6efbc77f  /tmp/test.data.300k

real    0m0.043s
user    0m0.032s
sys     0m0.010s
$ stat -c '%s' /tmp/test.data.300k
11890400

所以，看起来 Python 和 /usr/bin/md5sum 对于 11MB 文件大约需要 30 毫秒。相关的 md5sum 函数（上面清单中的 md5sum_read）与 Omnifarious 的非常相似：

import hashlib
def md5sum(filename, blocksize=65536):
    hash = hashlib.md5()
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            hash.update(block)
    return hash.hexdigest()

当然，这些来自单次运行（mmap 是当至少进行几十次运行时，总是会快一点），并且我的通常在缓冲区耗尽后得到一个额外的 f.read(blocksize) ，但它是相当可重复的，并且显示 命令行上的 md5sum 不一定比 Python 实现更快...

编辑：抱歉这么长时间的延迟，有一段时间没有看过这个了，但是为了回答 @EdRandall 的问题，我会写下来Adler32 实现。但是，我还没有为其运行基准测试。它基本上与 CRC32 相同：所有内容都是 zlib.adler32() 调用，而不是 init、update 和digest 调用：

import zlib
def adler32sum(filename, blocksize=65536):
    checksum = zlib.adler32("")
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            checksum = zlib.adler32(block, checksum)
    return checksum & 0xffffffff

请注意，这必须以空字符串开始，因为阿德勒总和从零开始与 "" 的总和（即 1）确实不同——CRC 可以从 0 开始。需要使用 AND-ing 使其成为 32 位无符号整数，这确保它在各个 Python 版本中返回相同的值。

I'm clearly not adding anything fundamentally new, but added this answer before I was up to commenting status, plus the code regions make things more clear -- anyway, specifically to answer @Nemo's question from Omnifarious's answer:

I happened to be thinking about checksums a bit (came here looking for suggestions on block sizes, specifically), and have found that this method may be faster than you'd expect. Taking the fastest (but pretty typical) timeit.timeit or /usr/bin/time result from each of several methods of checksumming a file of approx. 11MB:

$ ./sum_methods.py
crc32_mmap(filename) 0.0241742134094
crc32_read(filename) 0.0219960212708
subprocess.check_output(['cksum', filename]) 0.0553209781647
md5sum_mmap(filename) 0.0286180973053
md5sum_read(filename) 0.0311000347137
subprocess.check_output(['md5sum', filename]) 0.0332629680634
$ time md5sum /tmp/test.data.300k
d3fe3d5d4c2460b5daacc30c6efbc77f  /tmp/test.data.300k

real    0m0.043s
user    0m0.032s
sys     0m0.010s
$ stat -c '%s' /tmp/test.data.300k
11890400

So, looks like both Python and /usr/bin/md5sum take about 30ms for an 11MB file. The relevant md5sum function (md5sum_read in the above listing) is pretty similar to Omnifarious's:

import hashlib
def md5sum(filename, blocksize=65536):
    hash = hashlib.md5()
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            hash.update(block)
    return hash.hexdigest()

Granted, these are from single runs (the mmap ones are always a smidge faster when at least a few dozen runs are made), and mine's usually got an extra f.read(blocksize) after the buffer is exhausted, but it's reasonably repeatable and shows that md5sum on the command line is not necessarily faster than a Python implementation...

EDIT: Sorry for the long delay, haven't looked at this in some time, but to answer @EdRandall's question, I'll write down an Adler32 implementation. However, I haven't run the benchmarks for it. It's basically the same as the CRC32 would have been: instead of the init, update, and digest calls, everything is a zlib.adler32() call:

import zlib
def adler32sum(filename, blocksize=65536):
    checksum = zlib.adler32("")
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            checksum = zlib.adler32(block, checksum)
    return checksum & 0xffffffff

Note that this must start off with the empty string, as Adler sums do indeed differ when starting from zero versus their sum for "", which is 1 -- CRC can start with 0 instead. The AND-ing is needed to make it a 32-bit unsigned integer, which ensures it returns the same value across Python versions.

回复收藏 0 原文

尬尬 2024-09-20 07:33:41

在 Python 3.8+ 中，您可以使用赋值运算符 := （以及hashlib) 像这样：

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)

print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

考虑使用 hashlib .blake2b 而不是 md5（只需将上面代码片段中的 md5 替换为 blake2b）。它具有加密安全性，并且比 MD5 更快。

In Python 3.8+, you can can use the assignment operator := (along with hashlib) like this:

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)

print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

Consider using hashlib.blake2b instead of md5 (just replace md5 with blake2b in the above snippet). It's cryptographically secure and faster than MD5.

回复收藏 0 原文

因为看清所以看轻 2024-09-20 07:33:41

hashlib.md5(pathlib.Path('path/to/file').read_bytes()).hexdigest()

hashlib.md5(pathlib.Path('path/to/file').read_bytes()).hexdigest()

回复收藏 0 原文

聚集的泪 2024-09-20 07:33:41

在 Python 3.11+ 中，有一个新的可读和节省内存的方法：

import hashlib
with open(path, "rb") as f:
    digest = hashlib.file_digest(f, "md5")
print(digest.hexdigest())

In Python 3.11+, there's a new readable and memory-efficient method:

import hashlib
with open(path, "rb") as f:
    digest = hashlib.file_digest(f, "md5")
print(digest.hexdigest())

回复收藏 0 原文

浴红衣 2024-09-20 07:33:41

您可以使用 simple-file-checksum¹，仅使用 subprocess 调用 openssl（适用于 macOS/Linux 和 CertUtil（适用于 Windows）并仅提取输出摘要：

安装：

pip install simple-file-checksum

用法：

>>> from simple_file_checksum import get_checksum
>>> get_checksum("path/to/file.txt")
'9e107d9d372bb6826bd81d3542a419d6'
>>> get_checksum("path/to/file.txt", algorithm="MD5")
'9e107d9d372bb6826bd81d3542a419d6'

还支持 SHA1、SHA256、SHA384 和 SHA512 算法。

¹ 披露：我是 simple-file 的作者-校验和。

You could use simple-file-checksum¹, which just uses subprocess to call openssl for macOS/Linux and CertUtil for Windows and extracts only the digest from the output:

Installation:

pip install simple-file-checksum

Usage:

>>> from simple_file_checksum import get_checksum
>>> get_checksum("path/to/file.txt")
'9e107d9d372bb6826bd81d3542a419d6'
>>> get_checksum("path/to/file.txt", algorithm="MD5")
'9e107d9d372bb6826bd81d3542a419d6'

The SHA1, SHA256, SHA384, and SHA512 algorithms are also supported.

¹ Disclosure: I am the author of simple-file-checksum.

回复收藏 0 原文

落在眉间の轻吻 2024-09-20 07:33:41

将 file_path 更改为您的文件

import hashlib
def getMd5(file_path):
    m = hashlib.md5()
    with open(file_path,'rb') as f:
        lines = f.read()
        m.update(lines)
    md5code = m.hexdigest()
    return md5code

change the file_path to your file

import hashlib
def getMd5(file_path):
    m = hashlib.md5()
    with open(file_path,'rb') as f:
        lines = f.read()
        m.update(lines)
    md5code = m.hexdigest()
    return md5code

回复收藏 0 原文

绿阴红影里的.如风往事 2024-09-20 07:33:41

你可以在这里使用 shell。

from subprocess import check_output

#for windows & linux
hash = check_output(args='md5sum imp_file.txt', shell=True).decode().split(' ')[0]

#for mac
hash = check_output(args='md5 imp_file.txt', shell=True).decode().split('=')[1]

you can make use of the shell here.

from subprocess import check_output

#for windows & linux
hash = check_output(args='md5sum imp_file.txt', shell=True).decode().split(' ')[0]

#for mac
hash = check_output(args='md5 imp_file.txt', shell=True).decode().split('=')[1]

回复收藏 0 原文

~没有更多了~