计算数据结构的 md5 哈希值
我想计算不是字符串的 md5 哈希值,而是整个数据结构的 md5 哈希值。我了解执行此操作的方法的机制(分派值的类型、规范化字典键顺序和其他随机性、递归到子值等)。但这似乎是一种通常有用的操作,所以我很惊讶我需要自己进行此操作。
Python 有没有更简单的方法来实现这一点?
更新:已经建议使用pickle,这是一个好主意,但是pickle并没有规范化字典键顺序:
>>> import cPickle as pickle
>>> import hashlib, random
>>> for i in range(10):
... k = [i*i for i in range(1000)]
... random.shuffle(k)
... d = dict.fromkeys(k, 1)
... p = pickle.dumps(d)
... print hashlib.md5(p).hexdigest()
...
51b5855799f6d574c722ef9e50c2622b
43d6b52b885f4ecb4b4be7ecdcfbb04e
e7be0e6d923fe1b30c6fbd5dcd3c20b9
aebb2298be19908e523e86a3f3712207
7db3fe10dcdb70652f845b02b6557061
43945441efe82483ba65fda471d79254
8e4196468769333d170b6bb179b4aee0
951446fa44dba9a1a26e7df9083dcadf
06b09465917d3881707a4909f67451ae
386e3f08a3c1156edd1bd0f3862df481
I want to compute an md5 hash not of a string, but of an entire data structure. I understand the mechanics of a way to do this (dispatch on the type of the value, canonicalize dictionary key order and other randomness, recurse into sub-values, etc). But it seems like the kind of operation that would be generally useful, so I'm surprised I need to roll this myself.
Is there some simpler way in Python to achieve this?
UPDATE: pickle has been suggested, and it's a good idea, but pickling doesn't canonicalize dictionary key order:
>>> import cPickle as pickle
>>> import hashlib, random
>>> for i in range(10):
... k = [i*i for i in range(1000)]
... random.shuffle(k)
... d = dict.fromkeys(k, 1)
... p = pickle.dumps(d)
... print hashlib.md5(p).hexdigest()
...
51b5855799f6d574c722ef9e50c2622b
43d6b52b885f4ecb4b4be7ecdcfbb04e
e7be0e6d923fe1b30c6fbd5dcd3c20b9
aebb2298be19908e523e86a3f3712207
7db3fe10dcdb70652f845b02b6557061
43945441efe82483ba65fda471d79254
8e4196468769333d170b6bb179b4aee0
951446fa44dba9a1a26e7df9083dcadf
06b09465917d3881707a4909f67451ae
386e3f08a3c1156edd1bd0f3862df481
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
json.dumps() 可以按键对字典进行排序。所以你不需要其他依赖项:
打印:
json.dumps() can sort dictionaries by key. So you don't need other dependencies:
Prints:
bencode 对字典进行排序:
打印:
bencode sorts dictionaries so:
prints:
我最终自己写了它,因为我认为我必须这样做:
I ended up writing it myself as I thought I would have to:
您可以使用内置的 pprint ,它将涵盖比建议的更多情况
json.dumps()
解决方案。例如,datetime
-对象将被正确处理。您的示例重写为使用
pprint
而不是json
:You could use the builtin pprint that will cover some more cases than the proposed
json.dumps()
solution. For exampledatetime
-objects will be handled correctly.Your example rewritten to use
pprint
instead ofjson
:更新:由于按键顺序随机性,这不适用于字典。抱歉,我还没有想到。
这应该适用于任何 python 数据结构,也适用于对象。
UPDATE: this won't work for dictionaries due to key order randomness. Sorry, I've not thought of it.
This should work for any python data structure, and for objects as well.
虽然它确实需要依赖
joblib
,但我发现joblib.hashing.hash(object)
工作得很好,专为与joblib
的磁盘缓存机制一起使用而设计。根据经验,它似乎在每次运行中都会产生一致的结果,即使是pickle
在不同运行中混合的数据也是如此。或者,您可能对
artemis-ml
的compute_fixed_hash
函数,理论上以跨运行一致的方式对对象进行哈希处理。不过,我自己还没有测试过。抱歉在最初的问题发生数百万年后才发布
While it does require a dependency on
joblib
, I've found thatjoblib.hashing.hash(object)
works very well and is designed for use withjoblib
's disk caching mechanism. Empirically it seems to be producing consistent results from run to run, even on data thatpickle
mixes up on different runs.Alternatively, you might be interested in
artemis-ml
'scompute_fixed_hash
function, which theoretically hashes objects in a way that is consistent across runs. However, I've not tested it myself.Sorry for posting millions of years after the original question ????
ROCKY 方式:将所有结构项放入一个父实体中(如果还没有),对它们进行递归和排序/规范化等,然后计算其
repr
的 md5。ROCKY way: Put all your struct items in one parent entity (if not already), recurse and sort/canonicalize/etc them, then calculate the md5 of its
repr
.在 JSON 序列化上计算校验和是一个好主意,因为对于某些本身不可 JSON 序列化的 Python 数据结构来说,它易于实现且易于扩展。
这是我对 @webwurst 答案的修订版本,它以块的形式生成 JSON 字符串,并立即使用这些字符串来计算最终校验和,以防止大型对象消耗过多的内存:
Calculating checksum upon a JSON serialization is a good idea as it's easy to implement and easy to extend for some Python data structures that are natively not JSON serializable.
This is my revised version of @webwurst's answer, which generates the JSON string in chunks that are immediately consumed to calculate the final checksum, to prevent excessive memory consumption for a large object: