最近我发现MessagePack,一个替代二进制序列化格式为 Google 的 协议缓冲区 和 JSON 也优于两者。
还有 MongoDB 用于存储数据的 BSON 序列化格式。
有人可以详细说明 BSON 与 MessagePack 的差异和缺点/优点吗?
只是为了完成高性能二进制序列化格式的列表:还有 Gobs 将成为 Google Protocol Buffers 的继承者。然而,与所有其他提到的格式相比,这些格式与语言无关,并且依赖于 Go 的内置反射 至少还有除 Go 之外的其他语言的 Gobs 库。
Recently I've found MessagePack, an alternative binary serialization format to Google's Protocol Buffers and JSON which also outperforms both.
Also there's the BSON serialization format that is used by MongoDB for storing data.
Can somebody elaborate the differences and the dis-/advantages of BSON vs MessagePack?
Just to complete the list of performant binary serialization formats: There are also Gobs which are going to be the successor of Google's Protocol Buffers. However in contrast to all the other mentioned formats those are not language-agnostic and rely on Go's built-in reflection there are also Gobs libraries for at least on other language than Go.
发布评论
评论(6)
// 请注意,我是 MessagePack 的作者。这个答案可能有偏见。
格式设计
兼容JSON
尽管名字如此,但与 MessagePack 相比,BSON 对 JSON 的兼容性并不好。
BSON 有特殊类型,如“ObjectId”、“Min key”、“UUID”或“MD5”(我认为 MongoDB 需要这些类型)。这些类型与 JSON 不兼容。这意味着当您将对象从 BSON 转换为 JSON 时,某些类型信息可能会丢失,但当然只有当这些特殊类型位于 BSON 源中时才会丢失。在单个服务中同时使用 JSON 和 BSON 可能是一个缺点。
MessagePack 旨在透明地从/到 JSON 进行转换。
MessagePack 比 BSON 小
MessagePack 的格式比 BSON 更简洁。因此,MessagePack 可以序列化小于 BSON 的对象。
例如,一个简单的映射 {"a":1, "b":2} 使用 MessagePack 序列化为 7 个字节,而 BSON 使用 19 个字节。
BSON支持就地更新
使用 BSON,您可以修改部分存储的对象,而无需重新序列化整个对象。假设映射 {"a":1, "b":2} 存储在文件中,并且您想要将 "a" 的值从 1 更新到 2000。
对于 MessagePack,1 仅使用 1 个字节,但 2000 使用 3 个字节。所以“b”必须向后移动2个字节,而“b”不被修改。
对于 BSON,1 和 2000 都使用 5 个字节。由于这种冗长,您不必移动“b”。
MessagePack有RPC
MessagePack、Protocol Buffers、Thrift 和 Avro 支持 RPC。但 BSON 没有。
这些差异意味着 MessagePack 最初是为网络通信而设计的,而 BSON 是为存储而设计的。
实现和 API 设计
MessagePack 具有类型检查 API(Java、C++ 和 D)
MessagePack 支持静态类型。
与 JSON 或 BSON 一起使用的动态类型对于 Ruby、Python 或 JavaScript 等动态语言非常有用。但对于静态语言来说就麻烦了。您必须编写无聊的类型检查代码。
MessagePack 提供类型检查 API。它将动态类型对象转换为静态类型对象。下面是一个简单的示例 (C++):
MessagePack 有 IDL
与类型检查API有关,MessagePack支持IDL。 (规范可从:http://wiki.msgpack.org/display/MSGPACK /Design+of+IDL)
Protocol Buffers 和 Thrift 需要 IDL(不支持动态类型)并提供更成熟的 IDL 实现。
MessagePack 具有流 API(Ruby、Python、Java、C++...)
MessagePack 支持流式反序列化器。此功能对于网络通信很有用。下面是一个示例 (Ruby):
// Please note that I'm author of MessagePack. This answer may be biased.
Format design
Compatibility with JSON
In spite of its name, BSON's compatibility with JSON is not so good compared with MessagePack.
BSON has special types like "ObjectId", "Min key", "UUID" or "MD5" (I think these types are required by MongoDB). These types are not compatible with JSON. That means some type information can be lost when you convert objects from BSON to JSON, but of course only when these special types are in the BSON source. It can be a disadvantage to use both JSON and BSON in single service.
MessagePack is designed to be transparently converted from/to JSON.
MessagePack is smaller than BSON
MessagePack's format is less verbose than BSON. As the result, MessagePack can serialize objects smaller than BSON.
For example, a simple map {"a":1, "b":2} is serialized in 7 bytes with MessagePack, while BSON uses 19 bytes.
BSON supports in-place updating
With BSON, you can modify part of stored object without re-serializing the whole of the object. Let's suppose a map {"a":1, "b":2} is stored in a file and you want to update the value of "a" from 1 to 2000.
With MessagePack, 1 uses only 1 byte but 2000 uses 3 bytes. So "b" must be moved backward by 2 bytes, while "b" is not modified.
With BSON, both 1 and 2000 use 5 bytes. Because of this verbosity, you don't have to move "b".
MessagePack has RPC
MessagePack, Protocol Buffers, Thrift and Avro support RPC. But BSON doesn't.
These differences imply that MessagePack is originally designed for network communication while BSON is designed for storages.
Implementation and API design
MessagePack has type-checking APIs (Java, C++ and D)
MessagePack supports static-typing.
Dynamic-typing used with JSON or BSON are useful for dynamic languages like Ruby, Python or JavaScript. But troublesome for static languages. You must write boring type-checking codes.
MessagePack provides type-checking API. It converts dynamically-typed objects into statically-typed objects. Here is a simple example (C++):
MessagePack has IDL
It's related to the type-checking API, MessagePack supports IDL. (specification is available from: http://wiki.msgpack.org/display/MSGPACK/Design+of+IDL)
Protocol Buffers and Thrift require IDL (don't support dynamic-typing) and provide more mature IDL implementation.
MessagePack has streaming API (Ruby, Python, Java, C++, ...)
MessagePack supports streaming deserializers. This feature is useful for network communication. Here is an example (Ruby):
我认为非常重要的一点是,这取决于您的客户端/服务器环境。
如果您在不进行检查的情况下多次传递字节,例如使用消息队列系统或将日志条目流式传输到磁盘,那么您可能更喜欢使用二进制编码来强调紧凑的大小。否则就是不同环境下的具体问题。
某些环境可以非常快速地与 msgpack/protobuf 进行序列化和反序列化,而其他环境则不然。一般来说,语言/环境越低级,二进制序列化的效果就越好。在高级语言(node.js、.Net、JVM)中,您经常会发现 JSON 序列化实际上更快。那么问题就变成了你的网络开销或多或少比你的内存/CPU 受到限制?
关于 msgpack、bson 和 protocol buffers...msgpack 是该组中最小的字节,protocol buffers 大致相同。 BSON 定义了比其他两种更广泛的本机类型,并且可能更适合您的对象模型,但这使其更加冗长。协议缓冲区的优点是被设计为流式传输......这使其成为二进制传输/存储格式的更自然的格式。
就我个人而言,我倾向于 JSON 直接提供的透明度,除非明确需要减少流量。通过使用 gzip 压缩数据的 HTTP,网络开销的差异在格式之间就不再是问题了。
I think it's very important to mention that it depends on what your client/server environment look like.
If you are passing bytes multiple times without inspection, such as with a message queue system or streaming log entries to disk, then you may well prefer a binary encoding to emphasize the compact size. Otherwise it's a case by case issue with different environments.
Some environments can have very fast serialization and deserialization to/from msgpack/protobuf's, others not so much. In general, the more low-level the language/environment the better binary serialization will work. In higher level languages (node.js, .Net, JVM) you will often see that JSON serialization is actually faster. The question then becomes is your network overhead more or less constrained than your memory/cpu?
With regards to msgpack vs bson vs protocol buffers... msgpack is the least bytes of the group, protocol buffers being about the same. BSON defines more broad native types than the other two, and may be a better match to your object model, but this makes it more verbose. Protocol buffers have the advantage of being designed to stream... which makes it a more natural format for a binary transfer/storage format.
Personally, I would lean towards the transparency that JSON offers directly, unless there is a clear need for lighter traffic. Over HTTP with gzipped data, the difference in network overhead are even less of an issue between the formats.
嗯,正如作者所说,MessagePack 最初是为网络通信而设计的,而 BSON 是为存储而设计的。
MessagePack 比较紧凑,而 BSON 比较冗长。
MessagePack 旨在节省空间,而 BSON 则专为 CURD(节省时间)而设计。
最重要的是,MessagePack的类型系统(前缀)遵循Huffman编码,这里我画了一个MessagePack的Huffman树(点击链接看图):
Well,as the author said,MessagePack is originally designed for network communication while BSON is designed for storages.
MessagePack is compact while BSON is verbose.
MessagePack is meant to be space-efficient while BSON is designed for CURD (time-efficient).
Most importantly, MessagePack's type system (prefix) follow Huffman encoding, here I drawed a Huffman tree of MessagePack(click link to see image):
尚未提及的一个关键区别是 BSON 包含整个文档和进一步嵌套的子文档的大小信息(以字节为单位)。
这对于尺寸和性能都很重要的受限环境(例如嵌入式)有两个主要好处。
A key difference not yet mentioned is that BSON contains size information in bytes for the entire document and further nested sub-documents.
This has two major benefits for restricted environments (e.g. embedded) where size and performance is important.
我做了快速基准测试来比较 MessagePack 与 BSON 的编码和解码速度。至少如果您有大型二进制数组,BSON 会更快:
Using C# Newtonsoft.Json and MessagePack by neuecc:
I made quick benchmark to compare encoding and decoding speed of MessagePack vs BSON. BSON is faster at least if you have large binary arrays:
Using C# Newtonsoft.Json and MessagePack by neuecc:
快速测试表明,缩小的 JSON 的反序列化速度比二进制 MessagePack 更快。在测试中,Article.json 是 550kb 的缩小 JSON,Article.mpack 是其 420kb MP 版本。当然可能是一个实施问题。
MessagePack:
JSON:
所以时间是:
所以节省了空间,但速度更快?编号。
测试版本:
Quick test shows minified JSON is deserialized faster than binary MessagePack. In the tests Article.json is 550kb minified JSON, Article.mpack is 420kb MP-version of it. May be an implementation issue of course.
MessagePack:
JSON:
So times are:
So space is saved, but faster? No.
Tested versions: