如何在Python中对大对象(数据集)进行哈希处理?
我想计算包含机器学习数据集的 Python 类的哈希值。 哈希值旨在用于缓存,因此我想到了 md5
或 sha1
。 问题在于大部分数据都存储在 NumPy 数组中; 这些不提供 __hash__() 成员。 目前,我为每个成员执行 pickle.dumps()
并根据这些字符串计算哈希值。 但是,我发现以下链接表明同一对象可能会导致不同的序列化字符串:
计算包含 Numpy 数组的 Python 类的散列的最佳方法是什么?
I would like to calculate a hash of a Python class containing a dataset for Machine Learning. The hash is meant to be used for caching, so I was thinking of md5
or sha1
.
The problem is that most of the data is stored in NumPy arrays; these do not provide a __hash__()
member. Currently I do a pickle.dumps()
for each member and calculate a hash based on these strings. However, I found the following links indicating that the same object could lead to different serialization strings:
What would be the best method to calculate a hash for a Python class containing Numpy arrays?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
感谢约翰·蒙哥马利,我想我已经找到了一个解决方案,而且我认为它比将可能巨大的数组中的每个数字转换为字符串的开销更少:
我可以创建数组的字节视图并使用这些更新哈希值。 不知何故,这似乎给出了与使用数组直接更新相同的摘要:
Thanks to John Montgomery I think I have found a solution, and I think it has less overhead than converting every number in possibly huge arrays to strings:
I can create a byte-view of the arrays and use these to update the hash. And somehow this seems to give the same digest as directly updating using the array:
数组中数据的格式是什么? 难道你不能迭代数组,将它们转换成字符串(通过一些可重现的方式),然后通过更新将其输入到你的哈希中吗?
例如,
不要忘记 numpy 数组不会提供 __hash__() 因为它们是可变的。 因此,请小心不要在计算哈希值后修改数组(因为它将不再相同)。
What's the format of the data in the arrays? Couldn't you just iterate through the arrays, convert them into a string (via some reproducible means) and then feed that into your hash via update?
e.g.
Don't forget though that numpy arrays won't provide
__hash__()
because they are mutable. So be careful not to modify the arrays after your calculated your hash (as it will no longer be the same).有一个用于记忆函数的包,它使用 numpy 数组作为输入joblib。 从这个问题中找到。
There is a package for memoizing functions that use numpy arrays as inputs joblib. Found from this question.
使用 Numpy 1.10.1 和 python 2.7.6,如果数组是 C 连续的,您现在可以使用 hashlib 简单地对 numpy 数组进行哈希处理(如果不是,则使用 numpy.ascontigouslyarray()),例如
Using Numpy 1.10.1 and python 2.7.6, you can now simply hash numpy arrays using hashlib if the array is C-contiguous (use
numpy.ascontiguousarray()
if not), e.g.这是我在 jug 中执行此操作的方法(在回答此问题时使用 git HEAD ):
原因是
e.data
仅适用于某些数组(连续数组)。 与a.view(np.uint8)
相同(如果数组不连续,则会失败并出现非描述性类型错误)。Here is how I do it in jug (git HEAD at the time of this answer):
The reason is that
e.data
is only available for some arrays (contiguous arrays). Same thing witha.view(np.uint8)
(which fails with a non-descriptive type error if the array is not contiguous).从某种程度上来说,最快的似乎是:
a 是一个 numpy ndarray。
显然不是安全散列,但它应该有利于缓存等。
Fastest by some margin seems to be:
a is a numpy ndarray.
Obviously not secure hashing, but it should be good for caching etc.
array.data 始终是可哈希的,因为它是一个缓冲区对象。 简单:)(除非您关心具有完全相同数据的不同形状数组之间的差异等。(即,这是合适的,除非形状、字节顺序和其他数组“参数”也必须计入散列)
array.data is always hashable, because it's a buffer object. easy :) (unless you care about the difference between differently-shaped arrays with the exact same data, etc.. (ie this is suitable unless shape, byteorder, and other array 'parameters' must also figure into the hash)