确定性密钥序列化
我正在编写一个持久保存到磁盘的映射类。我目前只允许 str
键,但如果我可以使用更多类型,那就太好了:希望达到任何可散列的类型(即与内置 dict
),但更合理的是我会接受 string、unicode、int 和这些类型的元组。
为此,我想得出一个确定性的序列化方案。
选项 1 - Pickle 密钥
我的第一个想法是使用 pickle(或 cPickle)模块来序列化密钥,但我注意到 pickle
和 cPickle
的输出彼此不匹配:
>>> import pickle
>>> import cPickle
>>> def dumps(x):
... print repr(pickle.dumps(x))
... print repr(cPickle.dumps(x))
...
>>> dumps(1)
'I1\n.'
'I1\n.'
>>> dumps('hello')
"S'hello'\np0\n."
"S'hello'\np1\n."
>>> dumps((1, 2, 'hello'))
"(I1\nI2\nS'hello'\np0\ntp1\n."
"(I1\nI2\nS'hello'\np1\ntp2\n."
是否存在对某些类型集具有确定性的 pickle
实现/协议组合(例如只能将 cPickle
与协议 0 一起使用)?
选项 2 - Repr 和 ast.literal_eval
另一种选择是使用 repr
转储并使用 ast.literal_eval
加载。我编写了一个函数来确定给定的键是否能够在这个过程中幸存下来(它在允许的类型上相当保守):
def is_reprable_key(key):
return type(key) in (int, str, unicode) or (type(key) == tuple and all(
is_reprable_key(x) for x in key))
此方法的问题是 repr
本身对于我所定义的类型是否具有确定性已允许这里。我相信由于 str/unicode 文字的变化,这将无法在 2/3 版本障碍中幸存。这对于 2**32 - 1
2**32 - 1
的整数也不起作用。 x < 2**64
在 32 和 64 位平台之间跳转。是否还有其他条件(即字符串在同一解释器中的不同条件下序列化是否不同)? 编辑:我只是想了解这种情况发生的情况,而不一定要克服它们。
选项 3:自定义 repr
另一个可能有点过分的选项是编写我自己的 repr,它可以消除我知道(或怀疑可能存在)问题的 repr 内容。我刚刚在这里写了一个例子: http://gist.github.com/423945
(如果这一切失败得很惨,然后我可以存储键的哈希值以及键和值的pickle,然后迭代具有匹配哈希值的行,寻找一个能够unpickles到预期键的哈希值,但这确实使其他一些事情变得复杂我宁愿不这样做编辑: 事实证明 内置哈希
在跨平台上也不是确定性的。)
有什么见解吗?
I'm writing a mapping class which persists to the disk. I am currently allowing only str
keys but it would be nice if I could use a couple more types: hopefully up to anything that is hashable (ie. same requirements as the builtin dict
), but more reasonable I would accept string, unicode, int, and tuples of these types.
To that end I would like to derive a deterministic serialization scheme.
Option 1 - Pickling the key
The first thought I had was to use the pickle (or cPickle) module to serialize the key, but I noticed that the output from pickle
and cPickle
do not match each other:
>>> import pickle
>>> import cPickle
>>> def dumps(x):
... print repr(pickle.dumps(x))
... print repr(cPickle.dumps(x))
...
>>> dumps(1)
'I1\n.'
'I1\n.'
>>> dumps('hello')
"S'hello'\np0\n."
"S'hello'\np1\n."
>>> dumps((1, 2, 'hello'))
"(I1\nI2\nS'hello'\np0\ntp1\n."
"(I1\nI2\nS'hello'\np1\ntp2\n."
Is there any implementation/protocol combination of pickle
which is deterministic for some set of types (e.g. can only use cPickle
with protocol 0)?
Option 2 - Repr and ast.literal_eval
Another option is to use repr
to dump and ast.literal_eval
to load. I have written a function to determine if a given key would survive this process (it is rather conservative on the types it allows):
def is_reprable_key(key):
return type(key) in (int, str, unicode) or (type(key) == tuple and all(
is_reprable_key(x) for x in key))
The question for this method is if repr
itself is deterministic for the types that I have allowed here. I believe this would not survive the 2/3 version barrier due to the change in str/unicode literals. This also would not work for integers where 2**32 - 1 < x < 2**64
jumping between 32 and 64 bit platforms. Are there any other conditions (ie. do strings serialize differently under different conditions in the same interpreter)? Edit: I'm just trying to understand the conditions that this breaks down, not necessarily overcome them.
Option 3: Custom repr
Another option which is likely overkill is to write my own repr
which flattens out the things of repr which I know (or suspect may be) a problem. I just wrote an example here: http://gist.github.com/423945
(If this all fails miserably then I can store the hash of the key along with the pickle of both the key and value, then iterate across rows that have a matching hash looking for one that unpickles to the expected key, but that really does complicate a few other things and I would rather not do it. Edit: it turns out that the builtin hash
is not deterministic across platforms either. Scratch that.)
Any insights?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
重要提示:如果字典或集合类型嵌入到您尝试序列化的对象中,则
repr()
不是确定性的。密钥可以按任何顺序打印。例如
print repr({'a':1, 'b':2})
可能会打印为{'a':1, 'b':2}
或{'b':2, 'a':1}
,具体取决于 Python 决定如何管理字典中的键。Important note:
repr()
is not deterministic if a dictionary or set type is embedded in the object you are trying to serialize. The keys could be printed in any order.For example
print repr({'a':1, 'b':2})
might print out as{'a':1, 'b':2}
or{'b':2, 'a':1}
, depending on how Python decides to manage the keys in the dictionary.在阅读了基本类型的 repr 实现的大部分源代码(CPython 2.6.5)后,我得出结论(有合理的信心)这些类型的
repr
实际上是确定性的。但坦率地说,这是预料之中的。我相信
repr
方法很容易受到marshal
方法崩溃的几乎所有相同情况的影响(long
s > 2 **32 永远不可能是 32 位机器上的int
,不能保证在版本或解释器等之间不会改变)。我暂时的解决方案是使用
repr
方法并编写一个全面的测试套件,以确保repr
在我使用的各种平台上返回相同的值。从长远来看,自定义 repr 函数将消除所有平台/实现差异,但对于手头的项目来说肯定是大材小用。不过,我将来可能会这样做。
After reading through much of the source (of CPython 2.6.5) for the implementation of repr for the basic types I have concluded (with reasonable confidence) that
repr
of these types is, in fact, deterministic. But, frankly, this was expected.I believe that the
repr
method is susceptible to nearly all of the same cases under which themarshal
method would break down (long
s > 2**32 can never be anint
on a 32bit machine, not guaranteed to not change between versions or interpreters, etc.).My solution for the time being has been to use the
repr
method and write a comprehensive test suite to make sure thatrepr
returns the same values on the various platforms I am using.In the long run the custom repr function would flatten out all platform/implementation differences, but is certainly overkill for the project at hand. I may do this in the future, however.
“任何对于内置字典可接受的键的值”都是不可行的:此类值包括未定义 __hash__ 或比较的类的任意实例,隐式使用其 id 出于散列和比较的目的,即使在同一程序的运行中,
id
也不会相同(除非这些运行在所有方面都完全相同,这很难安排 - - 相同的输入、相同的启动时间、完全相同的环境等)。对于字符串、unicode、整数和元组,其项目都是这些类型(包括嵌套元组),marshal 模块可以提供帮助(在单个版本的 Python 中:封送代码可以而且确实会跨版本更改)。例如:
这是Python 2; Python 3 会类似(除了这些字节字符串的所有表示形式都会有一个前导
b
,但这是一个外观问题,当然u'23'
变得无效语法,'23'
成为 Unicode 字符串)。可以看到大概的思路:前导字节代表类型,比如u
代表Unicode字符串,i
代表整数,(
代表元组;那么对于容器来说(作为小端整数),项目的数量后面跟着项目本身,并且整数被序列化为小端形式。 marshal 被设计为可以跨平台移植。给定版本;不是跨版本),因为它用作已编译字节码文件(.pyc
或.pyo
)中的底层序列化。"Any value which is an acceptable key for a builtin dict" is not feasible: such values include arbitrary instances of classes that don't define
__hash__
or comparisons, implicitly using theirid
for hashing and comparison purposes, and theid
s won't be the same even across runs of the very same program (unless those runs are strictly identical in all respects, which is very tricky to arrange -- identical inputs, identical starting times, absolutely identical environment, etc, etc).For strings, unicodes, ints, and tuples whose items are all of these kinds (including nested tuples), the marshal module could help (within a single version of Python: marshaling code can and does change across versions). E.g.:
This is Python 2; Python 3 would be similar (except that all the representation of these byte strings would have a leading
b
, but that's a cosmetic issue, and of courseu'23'
becomes invalid syntax and'23'
becomes a Unicode string). You can see the general idea: a leading byte represents the type, such asu
for Unicode strings,i
for integers,(
for tuples; then for containers comes (as a little-endian integer) the number of items followed by the items themselves, and integers are serialized into a little-endian form.marshal
is designed to be portable across platforms (for a given version; not across versions) because it's used as the underlying serializations in compiled bytecode files (.pyc
or.pyo
).您在段落中提到了一些要求,我认为您可能希望对这些要求更清楚一些。到目前为止,我了解到:
那么,您是否正在尝试构建一个通用的“这可以解决所有问题”的解决方案,或者只是尝试解决当前项目中继续存在的紧迫问题?您应该多花一点时间来提出一套明确的要求。
对我来说,使用哈希似乎是最好的解决方案,但后来你抱怨说,你将有多行具有相同的哈希,这意味着你将存储足够的值,甚至不用担心哈希。
You mention a few requirements in the paragraph, and I think you might want to be a little more clear on these. So far I gather:
So, are you trying to build a general 'this can do it all' solution, or just trying to solve an immediate problem to continue on within a current project? You should spend a little more time to come up with a clear set of requirements.
Using a hash seemed like the best solution to me, but then you complain that you're going to have multiple rows with the same hash implying you're going to be storing enough values to even worry about the hash.