Python 中 json 序列化为何比 yaml 序列化快这么多?
我的代码严重依赖 yaml 进行跨语言序列化,在努力加快某些东西的速度时,我注意到 yaml 与其他序列化方法(例如 pickle、json)相比非常慢。
所以真正让我震惊的是,当输出几乎相同时,json 比 yaml 快得多。
>>> import yaml, cjson; d={'foo': {'bar': 1}}
>>> yaml.dump(d, Dumper=yaml.SafeDumper)
'foo: {bar: 1}\n'
>>> cjson.encode(d)
'{"foo": {"bar": 1}}'
>>> import yaml, cjson;
>>> timeit("yaml.dump(d, Dumper=yaml.SafeDumper)", setup="import yaml; d={'foo': {'bar': 1}}", number=10000)
44.506911039352417
>>> timeit("yaml.dump(d, Dumper=yaml.CSafeDumper)", setup="import yaml; d={'foo': {'bar': 1}}", number=10000)
16.852826118469238
>>> timeit("cjson.encode(d)", setup="import cjson; d={'foo': {'bar': 1}}", number=10000)
0.073784112930297852
PyYaml 的 CSafeDumper 和 cjson 都是用 C 编写的,所以这不是 C 与 Python 的速度问题。我什至向其中添加了一些随机数据以查看 cjson 是否正在执行任何缓存,但它仍然比 PyYaml 快得多。我意识到 yaml 是 json 的超集,但是使用如此简单的输入,yaml 序列化器怎么会慢 2 个数量级呢?
I have code that relies heavily on yaml for cross-language serialization and while working on speeding some stuff up I noticed that yaml was insanely slow compared to other serialization methods (e.g., pickle, json).
So what really blows my mind is that json is so much faster that yaml when the output is nearly identical.
>>> import yaml, cjson; d={'foo': {'bar': 1}}
>>> yaml.dump(d, Dumper=yaml.SafeDumper)
'foo: {bar: 1}\n'
>>> cjson.encode(d)
'{"foo": {"bar": 1}}'
>>> import yaml, cjson;
>>> timeit("yaml.dump(d, Dumper=yaml.SafeDumper)", setup="import yaml; d={'foo': {'bar': 1}}", number=10000)
44.506911039352417
>>> timeit("yaml.dump(d, Dumper=yaml.CSafeDumper)", setup="import yaml; d={'foo': {'bar': 1}}", number=10000)
16.852826118469238
>>> timeit("cjson.encode(d)", setup="import cjson; d={'foo': {'bar': 1}}", number=10000)
0.073784112930297852
PyYaml's CSafeDumper and cjson are both written in C so it's not like this is a C vs Python speed issue. I've even added some random data to it to see if cjson is doing any caching, but it's still way faster than PyYaml. I realize that yaml is a superset of json, but how could the yaml serializer be 2 orders of magnitude slower with such simple input?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
一般来说,决定解析速度的不是输出的复杂性,而是接受输入的复杂性。 JSON 语法非常简洁。 YAML 解析器相对复杂,导致开销增加。
我不是 YAML 解析器实现者,因此如果没有一些分析数据和大量示例,我无法具体说明数量级。无论如何,在对基准数据充满信心之前,请务必对大量输入进行测试。
更新哎呀,误读了问题。 :-( 尽管输入语法很大,序列化仍然可以非常快;但是,浏览源代码,它看起来像 PyYAML 的 Python 级序列化 构造表示图,而 simplejson 将内置 Python 数据类型直接编码为文本块。
In general, it's not the complexity of the output that determines the speed of parsing, but the complexity of the accepted input. The JSON grammar is very concise. The YAML parsers are comparatively complex, leading to increased overheads.
I'm not a YAML parser implementor, so I can't speak specifically to the orders of magnitude without some profiling data and a big corpus of examples. In any case, be sure to test over a large body of inputs before feeling confident in benchmark numbers.
Update Whoops, misread the question. :-( Serialization can still be blazingly fast despite the large input grammar; however, browsing the source, it looks like PyYAML's Python-level serialization constructs a representation graph whereas simplejson encodes builtin Python datatypes directly into text chunks.
在我开发的应用程序中,字符串到数字(float/int)之间的类型推断是解析 yaml 的最大开销,因为字符串可以不带引号编写。因为 json 中的所有字符串都在引号中,所以解析字符串时不会回溯。一个很好的例子是值 0000000000000000000s 会减慢速度。在读到该值的末尾之前,您无法判断该值是一个字符串。
其他答案是正确的,但这是我在实践中发现的具体细节。
In applications I've worked on, the type inference between strings to numbers (float/int) is where the largest overhead is for parsing yaml is because strings can be written without quotes. Because all strings in json are in quotes there is no backtracking when parsing strings. A great example where this would slow down is the value 0000000000000000000s. You cannot tell this value is a string until you've read to the end of it.
The other answers are correct but this is a specific detail that I've discovered in practice.
说到效率,我使用过 YAML 一段时间,并被这种语言中某些名称/值分配的简单性所吸引。然而,在这个过程中,我时不时地被 YAML 的技巧之一绊倒,语法中的微妙变化允许您以更简洁的风格编写特殊情况等等。最后,虽然 YAML 的语法几乎可以肯定在形式上是一致的,但它给我留下了某种“模糊”的感觉。然后,我限制自己不接触现有的、有效的 YAML 代码,并以更加迂回、故障安全的语法编写所有新内容,这让我放弃了所有 YAML。结果是,YAML 试图看起来像 W3C 标准,并生成了一个关于其概念和规则的难以阅读的小型文献库。
我认为,这远远超出了所需的智力开销。看看 SGML/XML:由 IBM 在 60 年代开发,由 ISO 标准化,被无数人称为 HTML(以简化和修改的形式),在世界各地记录、记录、再次记录。小 JSON 出现并杀死了那条龙。 JSON 为何能在如此短的时间内得到如此广泛的使用,而且只有一个微薄的网站(以及一位 javascript 杰出人物来支持它)?它的简单性在于它的语法完全没有疑问,并且易于学习和使用。
XML 和 YAML 对于人类来说很难,对于计算机来说也很难。 JSON 对人类和计算机来说都非常友好且易于使用。
Speaking about efficiency, I used YAML for a time and felt attracted by the simplicity that some name/value assignments take on in this language. However, in the process I tripped so and so often about one of YAML’s finesses, subtle variations in the grammar that allow you to write special cases in a more concise style and such. In the end, although YAML’s grammar is almost for certain formally consistent, it has left me with a certain feeling of ‘vagueness’. I then restricted myself to not touch existing, working YAML code and write everything new in a more roundabout, fail-safe syntax—which made me abandon all of YAML. The upshot is that YAML tries to look like a W3C standard, and produces a small library of hard to read literature concerning its concepts and rules.
This, I feel, is by far more intellectual overhead than needed. Look at SGML/XML: developed by IBM in the roaring 60s, standardized by the ISO, known (in a dumbed-down and modified form) as HTML to uncounted millions of people, documented and documented and documented again the world over. Comes up little JSON and slays that dragon. How could JSON become so widely used in so short a time, with just one meager website (and a javascript luminary to back it)? It is in its simplicity, the sheer absence of doubt in its grammar, the ease of learning and using it.
XML and YAML are hard for humans, and they are hard for computers. JSON is quite friendly and easy to both humans and computers.
粗略地看一下 python-yaml 会发现它的设计比 cjson 复杂得多:
更复杂的设计几乎总是意味着更慢的设计,这比大多数人需要的要复杂得多。
A cursory look at python-yaml suggests its design is much more complex than cjson's:
More complex designs almost invariably mean slower designs, and this is far more complex than most people will ever need.
尽管您已经接受了答案,但不幸的是,这只
PyYAML 文档的方向有些挥手,
引用该文档中不正确的声明:PyYAML
在转储期间不制作表示图,它会创建一个
lineair 流(就像
json
一样,保留一桶 ID 以查看是否有递归)。
首先,您必须意识到,虽然
cjson
转储器是仅手工编写的 C 代码,YAML 的 CSafeDumper 共享四个转储阶段中的两个
(
Representer
和Resolver
)与普通的纯 Python SafeDumper并且其他两个阶段(串行器和发射器)不是
完全用 C 手工编写,但由 Cython 模块组成
它调用 C 库 libyaml 来发出。
除了这个重要的部分之外,对你的问题的简单回答
为什么需要更长的时间,因为转储 YAML 的作用更多。事实并非如此
很大程度上是因为 YAML 正如 @flow 声称的那样更难,但是因为额外的
YAML 可以做到这一点,这使得它比 JSON 更强大,也更强大
用户友好,如果您需要使用编辑器处理结果。那
意味着即使应用这些额外功能,YAML 库也会花费更多时间,
在许多情况下,也只是检查某些内容是否适用。
这是一个示例:即使您从未使用过 PyYAML
代码,你会注意到转储器没有引用
foo
并且栏
。这并不是因为这些字符串是键,而 YAML 不是键具有 JSON 的限制,即映射的键需要
是字符串。例如,Python 字符串是映射中的值可以
也可以不加引号(即普通)。
重点是可以,因为情况并非总是如此。拿来当
实例仅包含数字字符的字符串:
12345678
。这需要用引号写出来,否则这看起来就像一个数字(并在解析时读回)。
PyYAML 如何知道何时引用字符串、何时不引用字符串?关于倾销
它实际上首先转储字符串,然后解析结果以使得
当然,当它读回结果时,它会得到原始值。
如果事实证明并非如此,则应用引号。
让我再重复一遍上一句的重要部分,所以
你不必重新阅读它:
这意味着它会应用它在以下情况下匹配的所有正则表达式:
加载以查看生成的标量是否会作为整数加载,
float、boolean、datetime等,判断是否需要引号
应用与否。 ¹
在任何具有复杂数据的实际应用程序中,基于 JSON 的
dumper/loader 太简单了,直接使用还有很多
与倾销相同的程序相比,智能必须存在于您的程序中
复杂数据直接转换为 YAML。一个简单的例子是当你想要工作时
带有日期时间戳,在这种情况下你必须将字符串转换回来
如果您使用的是 JSON,则可以自行返回
datetime.datetime
。装载过程中你必须这样做,因为这是一个值
与某些(希望可识别的)键相关联:
或与列表中的位置相关联:
或基于字符串的格式(例如使用正则表达式)。
在所有这些情况下,您的程序都需要做更多的工作。相同
倾销适用,这不仅意味着额外的开发时间。
让我们用我在机器上得到的东西重新生成你的计时
这样我们就可以将它们与其他测量值进行比较。我重写了你的代码
某种程度上,因为它不完整(
timeit
?)并导入了其他事情两次。由于
>>>
提示,仅剪切和粘贴也是不可能的。输出:
现在让
转储一个包含日期时间的简单数据结构
这给出了:
对于上面的时间,我创建了一个模块 myjson 来包装
cjson.encode
并定义了上面的stringify
。如果你使用它:给出:
这个仍然相当简单的输出,已经让你从两个订单中回来了
速度的数量级差异不到一个数量级。
YAML 的普通标量和块样式格式可以提供更好的可读数据。
您可以在序列(或映射)中添加尾随逗号
与 JSON 中的相同数据相比,手动编辑 YAML 数据时的失败更少。
YAML 标签允许在数据内指示您的(复杂)类型。什么时候
使用 JSON 你必须在你的代码中注意更多的事情
比映射、序列、整数、浮点数、布尔值复杂
字符串。此类代码需要开发时间,并且不太可能
与 python-cjson 一样快(当然您可以自由编写代码
在C语言中也是如此。
转储一些数据,例如递归数据结构(例如拓扑
data),或复杂的键是在 PyYAML 库中预定义的。那里有
JSON 库刚刚出错,并为此实施解决方法
重要的是,很可能会减慢速度差异不太相关的事情。
这种强大的功能和灵活性是以较低的速度为代价的。什么时候
转储许多简单的东西 JSON 是更好的选择,你不太可能
无论如何都要手动编辑结果。对于任何涉及
编辑或复杂对象或两者兼而有之,您仍然应该考虑使用
YAML。
1 可以强制将所有 Python 字符串转储为 YAML
带(双)引号的标量,但设置样式不足以
防止所有读回。
Although you have an accepted answer, unfortunately that only does
some handwaving in the direction of the PyYAML documentation and
quotes a statement in that documentation that is not correct: PyYAML
does not make a representation graph during dumping, it creates a
lineair stream (and just like
json
keeps a bucket of IDs to see if there arerecursions).
First of all you have to realize that while the
cjson
dumper ishandcrafted C-code only, YAML's CSafeDumper shares two of the four dump stages
(
Representer
andResolver
) with the normal pure Python SafeDumperand that the other two stages (the Serializer and Emitter) are not
written completely handcrafted in C, but consist of a Cython module
which calls the C library
libyaml
for emitting.Apart from that significant part, the simple answer to your question
why it takes longer, is that dumping YAML does more. This is not so
much because YAML is harder as @flow claims, but because that extra
that YAML can do, makes it so much more powerful than JSON and also more
user friendly, if you need to process the result with an editor. That
means more time is spent in the YAML library even when applying these extra features,
and in many cases also just checking if something applies.
Here is an example: even if you have never gone through the PyYAML
code, you'll have noticed that the dumper doesn't quote
foo
andbar
. That is not because these strings are are keys, as YAML doesn'thave the restriction that JSON has, that a key for a mapping needs to
be string. E.g. a Python string that is a value in mapping can
also be unquoted (i.e. plain).
The emphasis is on can, because it is not always so. Take for
instance a string that consists of numeral characters only:
12345678
. This needs to be written out with quotes as otherwise thiswould look exactly like a number (and read back in as such when parsing).
How does PyYAML know when to quote a string and when not? On dumping
it actually first dumps the string, then parses the result to make
sure, that when it reads that result back, it gets the original value.
And if that proves not to be the case, it applies quotes.
Let me repeat the important part of the previous sentence again, so
you don't have to re-read it:
This means it applies all of the regex matching it does when
loading to see if the resulting scalar would load as an integer,
float, boolean, datetime, etc., to determine whether quotes need to be
applied or not.¹
In any real application with complex data, a JSON based
dumper/loader is too simple to use directly and a lot more
intelligence has to be in your program compared to dumping the same
complex data directly to YAML. A simplified example is when you want to work
with date-time stamps, in that case you have to convert a string back
and forth to
datetime.datetime
yourself if you are using JSON. During loadingyou have to do that either based on the fact that this is a value
associated with some (hopefully recognisable) key:
or with a position in a list:
or based on the format of the string (e.g. using regex).
In all of these cases much more work needs to be done in your program. The same
holds for dumping and that does not only mean extra development time.
Lets regenerate your timings with what I get on my machine
so we can compare them with other measurements. I rewrote your code
somewhat, because it was incomplete (
timeit
?) and imported otherthings twice. It was also impossible to just cut and paste because of the
>>>
prompts.and this outputs:
Now lets
dump a simple data structure that includes a
datetime
This gives:
For the timing of the above I created a module myjson that wraps
cjson.encode
and has the abovestringify
defined. If you use that:giving:
That still rather simple output, already brings you back from two orders
of magnitude difference in speed to less than only one order of magnitude.
YAML's plain scalars and block style formatting make for better readable data.
That you can have a trailing comma in a sequence (or mapping) makes for
less failures when manually editing YAML data as with same data in JSON.
YAML tags allow for in-data indication of your (complex) types. When
using JSON you have to take care, in your code, of anything more
complex than mappings, sequences, integers, floats, booleans and
strings. Such code requires development time, and is unlikely to be
as fast as
python-cjson
(you are of course free to write your codein C as well.
Dumping some data, like recursive data-structures (e.g. topological
data), or complex keys is pre-defined in the PyYAML library. There the
JSON library just errors out, and implement workaround for that is
non-trivial and most likely slows things that speed differences are less relevant.
Such power and flexibility comes at a price of lower speed. When
dumping many simple things JSON is the better choice, you are unlikely
going to edit the result by hand anyway. For anyting that involves
editing or complex objects or both, you should still consider using
YAML.
¹ It is possible to force dumping of all Python strings as YAML
scalars with (double) quotes, but setting the style is not enough to
prevent all readback.