Hashlib Python 模块方法更新中的最大字节数限制
我正在尝试使用 hashlib 模块中的函数 hashlib.md5() 计算文件的 md5 哈希值。
所以我写了这段代码:
Buffer = 128
f = open("c:\\file.tct", "rb")
m = hashlib.md5()
while True:
p = f.read(Buffer)
if len(p) != 0:
m.update(p)
else:
break
print m.hexdigest()
f.close()
我注意到如果我将 Buffer 变量值增加到 64、128、256 等,函数更新会更快。 有一个我不能超过的上限吗?我想这可能只是 RAM 内存问题,但我不知道。
I am trying to compute md5 hash of a file with the function hashlib.md5() from hashlib module.
So that I writed this piece of code:
Buffer = 128
f = open("c:\\file.tct", "rb")
m = hashlib.md5()
while True:
p = f.read(Buffer)
if len(p) != 0:
m.update(p)
else:
break
print m.hexdigest()
f.close()
I noted the function update is faster if I increase Buffer variable value with 64, 128, 256 and so on.
There is a upper limit I cannot exceed? I suppose it might only a RAM memory problem but I don't know.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
大(≈
2**40
)块大小会导致MemoryError
,即除了可用 RAM 之外没有任何限制。另一方面,在我的机器上,bufsize
受到2**31-1
的限制:大的
chunksize
可能和非常小的块一样慢。测量一下。我发现对于 ≈
10
MB 文件,2**15
chunksize
是我测试过的文件中最快的。Big (≈
2**40
) chunk sizes lead toMemoryError
i.e., there is no limit other than available RAM. On the other handbufsize
is limited by2**31-1
on my machine:Big
chunksize
can be as slow as a very small one. Measure it.I find that for ≈
10
MB files the2**15
chunksize
is the fastest for the files I've tested.为了能够处理任意大的文件,您需要以块的形式读取它们。此类块的大小最好应该是 2 的幂,并且在 md5 的情况下,最小可能的块由 64 字节(512 位)组成,因为 512 位块是算法运行的单位。
但如果我们超越这个范围并尝试建立一个确切的标准,比如说 2048 字节块是否优于 4096 字节块......我们可能会失败。这需要非常仔细地测试和测量,并且几乎总是根据经验随意选择该值。
To be able to handle arbitrarily large files you need to read them in blocks. The size of such blocks should preferably be a power of 2, and in the case of md5 the minimum possible block consists of 64 bytes (512 bits) as 512-bit blocks are the units on which the algorithm operates.
But if we go beyond that and try to establish an exact criterion whether, say 2048-byte block is better than 4096-byte block... we will likely fail. This needs to be very carefully tested and measured, and almost always the value is being chosen at will, judging from experience.
缓冲区值是一次读取并存储在内存中的字节数,所以是的,唯一的限制是您的可用内存。
然而,更大的值并不自动更快。在某些时候,如果缓冲区太大,您可能会遇到内存分页问题或其他内存分配速度减慢的问题。您应该尝试越来越大的值,直到速度收益递减。
The buffer value is the number of bytes that is read and stored in memory at once, so yes, the only limit is your available memory.
However, bigger values are not automatically faster. At some point, you might run into memory paging issues or other slowdowns with memory allocation if the buffer is too large. You should experiment with larger and larger values until you hit diminishing returns in speed.