在Python中获取大文件的MD5哈希值
我已经使用了hashlib(它取代了 md5 in Python 2.6/3.0),如果我打开一个文件并将其内容放入 hashlib.md5()
功能。
问题在于文件非常大,其大小可能会超过 RAM 大小。
如何在不将整个文件加载到内存的情况下获取文件的 MD5 哈希值?
I have used hashlib (which replaces md5 in Python 2.6/3.0), and it worked fine if I opened a file and put its content in the hashlib.md5()
function.
The problem is with very big files that their sizes could exceed the RAM size.
How can I get the MD5 hash of a file without loading the whole file into memory?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
您需要以合适大小的块读取文件:
注意:确保使用“rb”打开文件 - 否则您将得到错误的结果。
因此,要用一种方法完成所有工作 - 使用类似的方法:
上面的更新基于 Frerich Raabe 提供的注释 - 我对此进行了测试,发现它在我的 Python 2.7.2 Windows 安装上是正确的,
我使用 jacksum 工具。
You need to read the file in chunks of suitable size:
Note: Make sure you open your file with the 'rb' to the open - otherwise you will get the wrong result.
So to do the whole lot in one method - use something like:
The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2.7.2 Windows installation
I cross-checked the results using the jacksum tool.
将文件分成 8192 字节的块(或 128 字节的其他倍数),并使用
update()
将它们连续提供给 MD5。这利用了 MD5 有 128 字节摘要块(8192 是 128×64)这一事实。 由于您没有将整个文件读入内存,因此不会使用超过 8192 字节的内存。
在 Python 3.8+ 中你可以这样做
Break the file into 8192-byte chunks (or some other multiple of 128 bytes) and feed them to MD5 consecutively using
update()
.This takes advantage of the fact that MD5 has 128-byte digest blocks (8192 is 128×64). Since you're not reading the entire file into memory, this won't use much more than 8192 bytes of memory.
In Python 3.8+ you can do
蟒蛇 3.7
Python 3.8 及以上
版本 原始文章
如果您想要一种更Pythonic(没有
while True
)的方式来读取文件,请检查此代码:请注意,
iter()
函数需要返回的迭代器在 EOF 处停止的空字节字符串,因为read()
返回b''
(不仅仅是''
)。Python < 3.7
Python 3.8 and above
Original post
If you want a more Pythonic (no
while True
) way of reading the file, check this code:Note that the
iter()
function needs an empty byte string for the returned iterator to halt at EOF, sinceread()
returnsb''
(not just''
).这是我的 Piotr Czapla 方法:
Here's my version of Piotr Czapla's method:
对于这个问题使用多个评论/答案,这是我的解决方案:
Using multiple comment/answers for this question, here is my solution:
Python 2/3 可移植解决方案
要计算校验和(md5、sha1 等),您必须以二进制模式打开文件,因为您将对字节值求和:
要成为 Python 2.7 和 移植,您应该使用 io 包,如下所示:
如果您的文件很大,您可能更喜欢按块读取文件,以避免将整个文件内容存储在内存中:
Python 3可 这里是使用
iter()
带有哨兵(空字符串)的函数。如果您的文件非常很大,您可能还需要显示进度信息。 您可以通过调用打印或记录计算的字节数的回调函数来做到这一点:
A Python 2/3 portable solution
To calculate a checksum (md5, sha1, etc.), you must open the file in binary mode, because you'll sum bytes values:
To be Python 2.7 and Python 3 portable, you ought to use the
io
packages, like this:If your files are big, you may prefer to read the file by chunks to avoid storing the whole file content in memory:
The trick here is to use the
iter()
function with a sentinel (the empty string).If your files are really big, you may also need to display progress information. You can do that by calling a callback function which prints or logs the amount of calculated bytes:
Bastien Semene 的代码的混音这需要Hawkwing评论关于通用哈希功能考虑...
A remix of Bastien Semene's code that takes the Hawkwing comment about generic hashing function into consideration...
如果不阅读完整内容,您无法获取其 md5。 但您可以使用更新 函数逐块读取文件内容。
m.update(a); m.update(b) 相当于m.update(a+b)。
You can't get its md5 without reading the full content. But you can use the update function to read the file's content block by block.
m.update(a); m.update(b) is equivalent to m.update(a+b).
我认为下面的代码更Pythonic:
I think the following code is more Pythonic:
我不喜欢循环。 基于 Nathan Feger 的回答:
I don't like loops. Based on Nathan Feger's answer:
Yuval Adam 的回答的实现Django:
Implementation of Yuval Adam's answer for Django:
正如 @pseyfert 的评论中提到的; 在 Python 3.11 及更高版本中,
hashlib.file_digest()
。 虽然没有明确记录,但该函数内部使用了类似于已接受答案中的分块方法,从其 源代码(第 230-236 行)。
该函数还提供了一个仅关键字参数
_bufsize
,其默认值为 2^18=262,144 字节,用于控制分块的缓冲区大小; 然而,考虑到它的前导下划线和缺失的文档,它可能应该被视为一个实现细节。无论如何,以下代码等效地重现了 Python 3.11+ 中接受的答案(除了不同的块大小):
As mentioned in @pseyfert's comment; in Python 3.11 and above,
hashlib.file_digest()
can be used. While not explicitly documented, internally the function uses a chunking approach similar to the one in the accepted answer, as can be seen from its source code (lines 230–236).The function also provides a keyword-only argument
_bufsize
with a default value of 2^18 = 262,144 bytes that controls the buffer size for chunking; however, given its leading underscore and missing documentation, it should probably rather be considered an implementation detail.In any case, the following code equivalently reproduces the accepted answer in Python 3.11+ (apart from the different chunk size):
我不确定这里是否有太多的喧闹。 我最近在 MySQL 中的 md5 和存储为 blob 的文件方面遇到了问题,因此我尝试了各种文件大小和简单的 Python 方法,即:
对于 2 KB 到 20 KB 的文件大小范围,我无法检测到任何明显的性能差异。 MB,因此不需要对散列进行“分块”。 无论如何,如果 Linux 必须访问磁盘,它可能至少会像普通程序员阻止它这样做的能力一样完成。 事实上,这个问题与 md5 无关。 如果您使用 MySQL,请不要忘记已经存在的 md5() 和 sha1() 函数。
I'm not sure that there isn't a bit too much fussing around here. I recently had problems with md5 and files stored as blobs in MySQL, so I experimented with various file sizes and the straightforward Python approach, viz:
I couldn’t detect any noticeable performance difference with a range of file sizes 2 KB to 20 MB and therefore no need to 'chunk' the hashing. Anyway, if Linux has to go to disk, it will probably do it at least as well as the average programmer's ability to keep it from doing so. As it happened, the problem was nothing to do with md5. If you're using MySQL, don't forget the md5() and sha1() functions already there.