确定性密钥序列化

发布于 2024-09-04 06:04:27 字数 1881 浏览 9 评论 0原文

我正在编写一个持久保存到磁盘的映射类。我目前只允许 str 键，但如果我可以使用更多类型，那就太好了：希望达到任何可散列的类型（即与内置 dict)，但更合理的是我会接受 string、unicode、int 和这些类型的元组。

为此，我想得出一个确定性的序列化方案。

选项 1 - Pickle 密钥

我的第一个想法是使用 pickle（或 cPickle）模块来序列化密钥，但我注意到 pickle 和 cPickle 的输出彼此不匹配：

>>> import pickle
>>> import cPickle
>>> def dumps(x):
...     print repr(pickle.dumps(x))
...     print repr(cPickle.dumps(x))
... 
>>> dumps(1)
'I1\n.'
'I1\n.'
>>> dumps('hello')
"S'hello'\np0\n."
"S'hello'\np1\n."
>>> dumps((1, 2, 'hello'))
"(I1\nI2\nS'hello'\np0\ntp1\n."
"(I1\nI2\nS'hello'\np1\ntp2\n."

是否存在对某些类型集具有确定性的 pickle 实现/协议组合（例如只能将 cPickle 与协议 0 一起使用）？

选项 2 - Repr 和 ast.literal_eval

另一种选择是使用 repr 转储并使用 ast.literal_eval 加载。我编写了一个函数来确定给定的键是否能够在这个过程中幸存下来（它在允许的类型上相当保守）：

def is_reprable_key(key):
    return type(key) in (int, str, unicode) or (type(key) == tuple and all(
        is_reprable_key(x) for x in key))

此方法的问题是 repr 本身对于我所定义的类型是否具有确定性已允许这里。我相信由于 str/unicode 文字的变化，这将无法在 2/3 版本障碍中幸存。这对于 2**32 - 1 2**32 - 1 的整数也不起作用。 x < 2**64 在 32 和 64 位平台之间跳转。是否还有其他条件（即字符串在同一解释器中的不同条件下序列化是否不同）？编辑：我只是想了解这种情况发生的情况，而不一定要克服它们。

选项 3：自定义 repr

另一个可能有点过分的选项是编写我自己的 repr，它可以消除我知道（或怀疑可能存在）问题的 repr 内容。我刚刚在这里写了一个例子： http://gist.github.com/423945

（如果这一切失败得很惨，然后我可以存储键的哈希值以及键和值的pickle，然后迭代具有匹配哈希值的行，寻找一个能够unpickles到预期键的哈希值，但这确实使其他一些事情变得复杂我宁愿不这样做编辑： 事实证明内置哈希在跨平台上也不是确定性的。）

有什么见解吗？

原文

I'm writing a mapping class which persists to the disk. I am currently allowing only str keys but it would be nice if I could use a couple more types: hopefully up to anything that is hashable (ie. same requirements as the builtin dict), but more reasonable I would accept string, unicode, int, and tuples of these types.

To that end I would like to derive a deterministic serialization scheme.

Option 1 - Pickling the key

The first thought I had was to use the pickle (or cPickle) module to serialize the key, but I noticed that the output from pickle and cPickle do not match each other:

>>> import pickle
>>> import cPickle
>>> def dumps(x):
...     print repr(pickle.dumps(x))
...     print repr(cPickle.dumps(x))
... 
>>> dumps(1)
'I1\n.'
'I1\n.'
>>> dumps('hello')
"S'hello'\np0\n."
"S'hello'\np1\n."
>>> dumps((1, 2, 'hello'))
"(I1\nI2\nS'hello'\np0\ntp1\n."
"(I1\nI2\nS'hello'\np1\ntp2\n."

Is there any implementation/protocol combination of pickle which is deterministic for some set of types (e.g. can only use cPickle with protocol 0)?

Option 2 - Repr and ast.literal_eval

Another option is to use repr to dump and ast.literal_eval to load. I have written a function to determine if a given key would survive this process (it is rather conservative on the types it allows):

def is_reprable_key(key):
    return type(key) in (int, str, unicode) or (type(key) == tuple and all(
        is_reprable_key(x) for x in key))

The question for this method is if repr itself is deterministic for the types that I have allowed here. I believe this would not survive the 2/3 version barrier due to the change in str/unicode literals. This also would not work for integers where 2**32 - 1 < x < 2**64 jumping between 32 and 64 bit platforms. Are there any other conditions (ie. do strings serialize differently under different conditions in the same interpreter)? Edit: I'm just trying to understand the conditions that this breaks down, not necessarily overcome them.

Option 3: Custom repr

Another option which is likely overkill is to write my own repr which flattens out the things of repr which I know (or suspect may be) a problem. I just wrote an example here: http://gist.github.com/423945

(If this all fails miserably then I can store the hash of the key along with the pickle of both the key and value, then iterate across rows that have a matching hash looking for one that unpickles to the expected key, but that really does complicate a few other things and I would rather not do it. Edit: it turns out that the builtin hash is not deterministic across platforms either. Scratch that.)

Any insights?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尘世孤行 2024-09-11 06:04:27

重要提示：如果字典或集合类型嵌入到您尝试序列化的对象中，则 repr() 不是确定性的。密钥可以按任何顺序打印。

例如 print repr({'a':1, 'b':2}) 可能会打印为 {'a':1, 'b':2}或 {'b':2, 'a':1}，具体取决于 Python 决定如何管理字典中的键。

回复收藏 0 原文

笑着哭最痛 2024-09-11 06:04:27

在阅读了基本类型的 repr 实现的大部分源代码（CPython 2.6.5）后，我得出结论（有合理的信心）这些类型的 repr 实际上是确定性的。但坦率地说，这是预料之中的。

我相信 repr 方法很容易受到 marshal 方法崩溃的几乎所有相同情况的影响（longs > 2 **32 永远不可能是 32 位机器上的 int，不能保证在版本或解释器等之间不会改变）。

我暂时的解决方案是使用 repr 方法并编写一个全面的测试套件，以确保 repr 在我使用的各种平台上返回相同的值。

从长远来看，自定义 repr 函数将消除所有平台/实现差异，但对于手头的项目来说肯定是大材小用。不过，我将来可能会这样做。

回复收藏 0 原文

慕巷 2024-09-11 06:04:27

“任何对于内置字典可接受的键的值”都是不可行的：此类值包括未定义 __hash__ 或比较的类的任意实例，隐式使用其 id 出于散列和比较的目的，即使在同一程序的运行中，id 也不会相同（除非这些运行在所有方面都完全相同，这很难安排 - - 相同的输入、相同的启动时间、完全相同的环境等）。

对于字符串、unicode、整数和元组，其项目都是这些类型（包括嵌套元组），marshal 模块可以提供帮助（在单个版本的 Python 中：封送代码可以而且确实会跨版本更改）。例如：

>>> marshal.dumps(23)
'i\x17\x00\x00\x00'
>>> marshal.dumps('23')
't\x02\x00\x00\x0023'
>>> marshal.dumps(u'23')
'u\x02\x00\x00\x0023'
>>> marshal.dumps((23,))
'(\x01\x00\x00\x00i\x17\x00\x00\x00'

这是Python 2； Python 3 会类似（除了这些字节字符串的所有表示形式都会有一个前导 b ，但这是一个外观问题，当然 u'23' 变得无效语法，'23' 成为 Unicode 字符串）。可以看到大概的思路：前导字节代表类型，比如u代表Unicode字符串，i代表整数，(代表元组；那么对于容器来说（作为小端整数），项目的数量后面跟着项目本身，并且整数被序列化为小端形式。 marshal 被设计为可以跨平台移植。给定版本；不是跨版本），因为它用作已编译字节码文件（.pyc 或 .pyo）中的底层序列化。

"Any value which is an acceptable key for a builtin dict" is not feasible: such values include arbitrary instances of classes that don't define __hash__ or comparisons, implicitly using their id for hashing and comparison purposes, and the ids won't be the same even across runs of the very same program (unless those runs are strictly identical in all respects, which is very tricky to arrange -- identical inputs, identical starting times, absolutely identical environment, etc, etc).

For strings, unicodes, ints, and tuples whose items are all of these kinds (including nested tuples), the marshal module could help (within a single version of Python: marshaling code can and does change across versions). E.g.:

>>> marshal.dumps(23)
'i\x17\x00\x00\x00'
>>> marshal.dumps('23')
't\x02\x00\x00\x0023'
>>> marshal.dumps(u'23')
'u\x02\x00\x00\x0023'
>>> marshal.dumps((23,))
'(\x01\x00\x00\x00i\x17\x00\x00\x00'

This is Python 2; Python 3 would be similar (except that all the representation of these byte strings would have a leading b, but that's a cosmetic issue, and of course u'23' becomes invalid syntax and '23' becomes a Unicode string). You can see the general idea: a leading byte represents the type, such as u for Unicode strings, i for integers, ( for tuples; then for containers comes (as a little-endian integer) the number of items followed by the items themselves, and integers are serialized into a little-endian form. marshal is designed to be portable across platforms (for a given version; not across versions) because it's used as the underlying serializations in compiled bytecode files (.pyc or .pyo).