计算数据结构的 md5 哈希值

发布于 2024-10-26 06:59:52 字数 826 浏览 6 评论 0原文

我想计算不是字符串的 md5 哈希值，而是整个数据结构的 md5 哈希值。我了解执行此操作的方法的机制（分派值的类型、规范化字典键顺序和其他随机性、递归到子值等）。但这似乎是一种通常有用的操作，所以我很惊讶我需要自己进行此操作。

Python 有没有更简单的方法来实现这一点？

更新：已经建议使用pickle，这是一个好主意，但是pickle并没有规范化字典键顺序：

>>> import cPickle as pickle
>>> import hashlib, random 
>>> for i in range(10):
...  k = [i*i for i in range(1000)]
...  random.shuffle(k)
...  d = dict.fromkeys(k, 1)
...  p = pickle.dumps(d)
...  print hashlib.md5(p).hexdigest()
...
51b5855799f6d574c722ef9e50c2622b
43d6b52b885f4ecb4b4be7ecdcfbb04e
e7be0e6d923fe1b30c6fbd5dcd3c20b9
aebb2298be19908e523e86a3f3712207
7db3fe10dcdb70652f845b02b6557061
43945441efe82483ba65fda471d79254
8e4196468769333d170b6bb179b4aee0
951446fa44dba9a1a26e7df9083dcadf
06b09465917d3881707a4909f67451ae
386e3f08a3c1156edd1bd0f3862df481

原文

I want to compute an md5 hash not of a string, but of an entire data structure. I understand the mechanics of a way to do this (dispatch on the type of the value, canonicalize dictionary key order and other randomness, recurse into sub-values, etc). But it seems like the kind of operation that would be generally useful, so I'm surprised I need to roll this myself.

Is there some simpler way in Python to achieve this?

UPDATE: pickle has been suggested, and it's a good idea, but pickling doesn't canonicalize dictionary key order:

>>> import cPickle as pickle
>>> import hashlib, random 
>>> for i in range(10):
...  k = [i*i for i in range(1000)]
...  random.shuffle(k)
...  d = dict.fromkeys(k, 1)
...  p = pickle.dumps(d)
...  print hashlib.md5(p).hexdigest()
...
51b5855799f6d574c722ef9e50c2622b
43d6b52b885f4ecb4b4be7ecdcfbb04e
e7be0e6d923fe1b30c6fbd5dcd3c20b9
aebb2298be19908e523e86a3f3712207
7db3fe10dcdb70652f845b02b6557061
43945441efe82483ba65fda471d79254
8e4196468769333d170b6bb179b4aee0
951446fa44dba9a1a26e7df9083dcadf
06b09465917d3881707a4909f67451ae
386e3f08a3c1156edd1bd0f3862df481

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

巡山小妖精 2024-11-02 06:59:52

json.dumps() 可以按键对字典进行排序。所以你不需要其他依赖项：

import hashlib
import json

data = ['only', 'lists', [1,2,3], 'dictionaries', {'a':0,'b':1}, 'numbers', 47, 'strings']
data_md5 = hashlib.md5(json.dumps(data, sort_keys=True).encode('utf-8')).hexdigest()

print(data_md5)

打印：

87e83d90fc0d03f2c05631e2cd68ea02

json.dumps() can sort dictionaries by key. So you don't need other dependencies:

import hashlib
import json

data = ['only', 'lists', [1,2,3], 'dictionaries', {'a':0,'b':1}, 'numbers', 47, 'strings']
data_md5 = hashlib.md5(json.dumps(data, sort_keys=True).encode('utf-8')).hexdigest()

print(data_md5)

Prints:

87e83d90fc0d03f2c05631e2cd68ea02

回复收藏 0 原文

忘年祭陌 2024-11-02 06:59:52

bencode 对字典进行排序：

import hashlib
import bencode
data = ['only', 'lists', [1,2,3], 
'dictionaries', {'a':0,'b':1}, 'numbers', 47, 'strings']
data_md5 = hashlib.md5(bencode.bencode(data)).hexdigest()
print data_md5

打印：

af1b88ca9fd8a3e828b40ed1b9a2cb20

bencode sorts dictionaries so:

import hashlib
import bencode
data = ['only', 'lists', [1,2,3], 
'dictionaries', {'a':0,'b':1}, 'numbers', 47, 'strings']
data_md5 = hashlib.md5(bencode.bencode(data)).hexdigest()
print data_md5

prints:

af1b88ca9fd8a3e828b40ed1b9a2cb20

回复收藏 0 原文

往日 2024-11-02 06:59:52

我最终自己写了它，因为我认为我必须这样做：

class Hasher(object):
    """Hashes Python data into md5."""
    def __init__(self):
        self.md5 = md5()

    def update(self, v):
        """Add `v` to the hash, recursively if needed."""
        self.md5.update(str(type(v)))
        if isinstance(v, basestring):
            self.md5.update(v)
        elif isinstance(v, (int, long, float)):
            self.update(str(v))
        elif isinstance(v, (tuple, list)):
            for e in v:
                self.update(e)
        elif isinstance(v, dict):
            keys = v.keys()
            for k in sorted(keys):
                self.update(k)
                self.update(v[k])
        else:
            for k in dir(v):
                if k.startswith('__'):
                    continue
                a = getattr(v, k)
                if inspect.isroutine(a):
                    continue
                self.update(k)
                self.update(a)

    def digest(self):
        """Retrieve the digest of the hash."""
        return self.md5.digest()

I ended up writing it myself as I thought I would have to:

class Hasher(object):
    """Hashes Python data into md5."""
    def __init__(self):
        self.md5 = md5()

    def update(self, v):
        """Add `v` to the hash, recursively if needed."""
        self.md5.update(str(type(v)))
        if isinstance(v, basestring):
            self.md5.update(v)
        elif isinstance(v, (int, long, float)):
            self.update(str(v))
        elif isinstance(v, (tuple, list)):
            for e in v:
                self.update(e)
        elif isinstance(v, dict):
            keys = v.keys()
            for k in sorted(keys):
                self.update(k)
                self.update(v[k])
        else:
            for k in dir(v):
                if k.startswith('__'):
                    continue
                a = getattr(v, k)
                if inspect.isroutine(a):
                    continue
                self.update(k)
                self.update(a)

    def digest(self):
        """Retrieve the digest of the hash."""
        return self.md5.digest()

回复收藏 0 原文

梦年海沫深 2024-11-02 06:59:52

您可以使用内置的 pprint ，它将涵盖比建议的更多情况 json.dumps() 解决方案。例如，datetime-对象将被正确处理。

您的示例重写为使用 pprint 而不是 json：

>>> import hashlib, random, pprint
>>> for i in range(10):
...     k = [i*i for i in range(1000)]
...     random.shuffle(k)
...     d = dict.fromkeys(k, 1)
...     print hashlib.md5(pprint.pformat(d)).hexdigest()
... 
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db

You could use the builtin pprint that will cover some more cases than the proposed json.dumps() solution. For example datetime-objects will be handled correctly.

Your example rewritten to use pprint instead of json:

>>> import hashlib, random, pprint
>>> for i in range(10):
...     k = [i*i for i in range(1000)]
...     random.shuffle(k)
...     d = dict.fromkeys(k, 1)
...     print hashlib.md5(pprint.pformat(d)).hexdigest()
... 
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db

回复收藏 0 原文

遗弃Ｍ 2024-11-02 06:59:52

更新：由于按键顺序随机性，这不适用于字典。抱歉，我还没有想到。

import hashlib
import cPickle as pickle
data = ['anything', 'you', 'want']
data_pickle = pickle.dumps(data)
data_md5 = hashlib.md5(data_pickle).hexdigest()

这应该适用于任何 python 数据结构，也适用于对象。

UPDATE: this won't work for dictionaries due to key order randomness. Sorry, I've not thought of it.

import hashlib
import cPickle as pickle
data = ['anything', 'you', 'want']
data_pickle = pickle.dumps(data)
data_md5 = hashlib.md5(data_pickle).hexdigest()

This should work for any python data structure, and for objects as well.

回复收藏 0 原文

青巷忧颜 2024-11-02 06:59:52

虽然它确实需要依赖 joblib，但我发现 joblib.hashing.hash(object) 工作得很好，专为与 joblib 的磁盘缓存机制一起使用而设计。根据经验，它似乎在每次运行中都会产生一致的结果，即使是 pickle 在不同运行中混合的数据也是如此。

或者，您可能对 artemis-ml 的 compute_fixed_hash 函数，理论上以跨运行一致的方式对对象进行哈希处理。不过，我自己还没有测试过。

抱歉在最初的问题发生数百万年后才发布

回复收藏 0 原文

梦一生花开无言 2024-11-02 06:59:52

ROCKY 方式：将所有结构项放入一个父实体中（如果还没有），对它们进行递归和排序/规范化等，然后计算其 repr 的 md5。

回复收藏 0 原文

弄潮 2024-11-02 06:59:52

在 JSON 序列化上计算校验和是一个好主意，因为对于某些本身不可 JSON 序列化的 Python 数据结构来说，它易于实现且易于扩展。

这是我对 @webwurst 答案的修订版本，它以块的形式生成 JSON 字符串，并立即使用这些字符串来计算最终校验和，以防止大型对象消耗过多的内存：

import hashlib
import json

def checksum(obj, method='md5', *,
        # If dicts with different key order are to be treated as identical.
        # Set False otherwise.
        sort_keys=True,

        # A dict with circular referencing objects can never be hashed and is
        # out of scope of this topic. Skip checking such cases to save an
        # in-memory mapping as we don't expect them. Set True otherwise.
        check_circular=False,

        # Set True to output bytes instead of hex string to save memory, if the
        # checksum is used only for internal comparison and not to be output.
        ouput_bytes=False,
        ):
    m = hashlib.new(method)
    encoder = json.JSONEncoder(
        check_circular=check_circular,
        sort_keys=sort_keys,
        ensure_ascii=False,  # don't escape Unicode chars to save bytes
        separators=(',', ':'),  # reduce default spaces to be more compact
    )
    for chunk in encoder.iterencode(obj):
        m.update(chunk.encode('UTF-8'))

    if ouput_bytes:
        return m.digest()

    return m.hexdigest()

def test():
    data = [
        'only',
        'tuples', ('foo', 'bar'),
        'lists', [1,2,3],
        'dictionaries', {'a':0,'b':1},
        'numbers', 47,
        'strings', '哈囉世界',
    ]
    chk = checksum(data)
    print(chk)

if __name__ == '__main__':
    test()

Calculating checksum upon a JSON serialization is a good idea as it's easy to implement and easy to extend for some Python data structures that are natively not JSON serializable.

This is my revised version of @webwurst's answer, which generates the JSON string in chunks that are immediately consumed to calculate the final checksum, to prevent excessive memory consumption for a large object:

import hashlib
import json

def checksum(obj, method='md5', *,
        # If dicts with different key order are to be treated as identical.
        # Set False otherwise.
        sort_keys=True,

        # A dict with circular referencing objects can never be hashed and is
        # out of scope of this topic. Skip checking such cases to save an
        # in-memory mapping as we don't expect them. Set True otherwise.
        check_circular=False,

        # Set True to output bytes instead of hex string to save memory, if the
        # checksum is used only for internal comparison and not to be output.
        ouput_bytes=False,
        ):
    m = hashlib.new(method)
    encoder = json.JSONEncoder(
        check_circular=check_circular,
        sort_keys=sort_keys,
        ensure_ascii=False,  # don't escape Unicode chars to save bytes
        separators=(',', ':'),  # reduce default spaces to be more compact
    )
    for chunk in encoder.iterencode(obj):
        m.update(chunk.encode('UTF-8'))

    if ouput_bytes:
        return m.digest()

    return m.hexdigest()

def test():
    data = [
        'only',
        'tuples', ('foo', 'bar'),
        'lists', [1,2,3],
        'dictionaries', {'a':0,'b':1},
        'numbers', 47,
        'strings', '哈囉世界',
    ]
    chk = checksum(data)
    print(chk)

if __name__ == '__main__':
    test()

回复收藏 0 原文

~没有更多了~