高性能实体序列化:BSON 与 MessagePack(与 JSON)

发布于 2024-11-15 15:39:52 字数 767 浏览 3 评论 0 原文

最近我发现MessagePack,一个替代二进制序列化格式为 Google 的 协议缓冲区JSON 也优于两者。

还有 MongoDB 用于存储数据的 BSON 序列化格式。

有人可以详细说明 BSON 与 MessagePack 的差异和缺点/优点吗?


只是为了完成高性能二进制序列化格式的列表:还有 Gobs 将成为 Google Protocol Buffers 的继承者。然而,与所有其他提到的格式相比,这些格式与语言无关,并且依赖于 Go 的内置反射 至少还有除 Go 之外的其他语言的 Gobs 库。

Recently I've found MessagePack, an alternative binary serialization format to Google's Protocol Buffers and JSON which also outperforms both.

Also there's the BSON serialization format that is used by MongoDB for storing data.

Can somebody elaborate the differences and the dis-/advantages of BSON vs MessagePack?


Just to complete the list of performant binary serialization formats: There are also Gobs which are going to be the successor of Google's Protocol Buffers. However in contrast to all the other mentioned formats those are not language-agnostic and rely on Go's built-in reflection there are also Gobs libraries for at least on other language than Go.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

紙鸢 2024-11-22 15:39:52

// 请注意,我是 MessagePack 的作者。这个答案可能有偏见。

格式设计

  1. 兼容JSON

    尽管名字如此,但与 MessagePack 相比,BSON 对 JSON 的兼容性并不好。

    BSON 有特殊类型,如“ObjectId”、“Min key”、“UUID”或“MD5”(我认为 MongoDB 需要这些类型)。这些类型与 JSON 不兼容。这意味着当您将对象从 BSON 转换为 JSON 时,某些类型信息可能会丢失,但当然只有当这些特殊类型位于 BSON 源中时才会丢失。在单个服务中同时使用 JSON 和 BSON 可能是一个缺点。

    MessagePack 旨在透明地从/到 JSON 进行转换。

  2. MessagePack 比 BSON 小

    MessagePack 的格式比 BSON 更简洁。因此,MessagePack 可以序列化小于 BSON 的对象。

    例如,一个简单的映射 {"a":1, "b":2} 使用 MessagePack 序列化为 7 个字节,而 BSON 使用 19 个字节。

  3. BSON支持就地更新

    使用 BSON,您可以修改部分存储的对象,而无需重新序列化整个对象。假设映射 {"a":1, "b":2} 存储在文件中,并且您想要将 "a" 的值从 1 更新到 2000。

    对于 MessagePack,1 仅使用 1 个字节,但 2000 使用 3 个字节。所以“b”必须向后移动2个字节,而“b”不被修改。

    对于 BSON,1 和 2000 都使用 5 个字节。由于这种冗长,您不必移动“b”。

  4. MessagePack有RPC

    MessagePack、Protocol Buffers、Thrift 和 Avro 支持 RPC。但 BSON 没有。

这些差异意味着 MessagePack 最初是为网络通信而设计的,而 BSON 是为存储而设计的。

实现和 API 设计

  1. MessagePack 具有类型检查 API(Java、C++ 和 D)

    MessagePack 支持静态类型。

    与 JSON 或 BSON 一起使用的动态类型对于 Ruby、Python 或 JavaScript 等动态语言非常有用。但对于静态语言来说就麻烦了。您必须编写无聊的类型检查代码。

    MessagePack 提供类型检查 API。它将动态类型对象转换为静态类型对象。下面是一个简单的示例 (C++):

    #include <msgpack.hpp>

    class myclass {
    private:
        std::string str;
        std::vector<int> vec;
    public:
        // This macro enables this class to be serialized/deserialized
        MSGPACK_DEFINE(str, vec);
    };

    int main(void) {
        // serialize
        myclass m1 = ...;

        msgpack::sbuffer buffer;
        msgpack::pack(&buffer, m1);

        // deserialize
        msgpack::unpacked result;
        msgpack::unpack(&result, buffer.data(), buffer.size());

        // you get dynamically-typed object
        msgpack::object obj = result.get();

        // convert it to statically-typed object
        myclass m2 = obj.as<myclass>();
    }
  1. MessagePack 有 IDL

    与类型检查API有关,MessagePack支持IDL。 (规范可从:http://wiki.msgpack.org/display/MSGPACK /Design+of+IDL)

    Protocol Buffers 和 Thrift 需要 IDL(不支持动态类型)并提供更成熟的 IDL 实现。

  2. MessagePack 具有流 API(Ruby、Python、Java、C++...)

    MessagePack 支持流式反序列化器。此功能对于网络通信很有用。下面是一个示例 (Ruby):

    require 'msgpack'

    # write objects to stdout
    $stdout.write [1,2,3].to_msgpack
    $stdout.write [1,2,3].to_msgpack

    # read objects from stdin using streaming deserializer
    unpacker = MessagePack::Unpacker.new($stdin)
    # use iterator
    unpacker.each {|obj|
      p obj
    }

// Please note that I'm author of MessagePack. This answer may be biased.

Format design

  1. Compatibility with JSON

    In spite of its name, BSON's compatibility with JSON is not so good compared with MessagePack.

    BSON has special types like "ObjectId", "Min key", "UUID" or "MD5" (I think these types are required by MongoDB). These types are not compatible with JSON. That means some type information can be lost when you convert objects from BSON to JSON, but of course only when these special types are in the BSON source. It can be a disadvantage to use both JSON and BSON in single service.

    MessagePack is designed to be transparently converted from/to JSON.

  2. MessagePack is smaller than BSON

    MessagePack's format is less verbose than BSON. As the result, MessagePack can serialize objects smaller than BSON.

    For example, a simple map {"a":1, "b":2} is serialized in 7 bytes with MessagePack, while BSON uses 19 bytes.

  3. BSON supports in-place updating

    With BSON, you can modify part of stored object without re-serializing the whole of the object. Let's suppose a map {"a":1, "b":2} is stored in a file and you want to update the value of "a" from 1 to 2000.

    With MessagePack, 1 uses only 1 byte but 2000 uses 3 bytes. So "b" must be moved backward by 2 bytes, while "b" is not modified.

    With BSON, both 1 and 2000 use 5 bytes. Because of this verbosity, you don't have to move "b".

  4. MessagePack has RPC

    MessagePack, Protocol Buffers, Thrift and Avro support RPC. But BSON doesn't.

These differences imply that MessagePack is originally designed for network communication while BSON is designed for storages.

Implementation and API design

  1. MessagePack has type-checking APIs (Java, C++ and D)

    MessagePack supports static-typing.

    Dynamic-typing used with JSON or BSON are useful for dynamic languages like Ruby, Python or JavaScript. But troublesome for static languages. You must write boring type-checking codes.

    MessagePack provides type-checking API. It converts dynamically-typed objects into statically-typed objects. Here is a simple example (C++):

    #include <msgpack.hpp>

    class myclass {
    private:
        std::string str;
        std::vector<int> vec;
    public:
        // This macro enables this class to be serialized/deserialized
        MSGPACK_DEFINE(str, vec);
    };

    int main(void) {
        // serialize
        myclass m1 = ...;

        msgpack::sbuffer buffer;
        msgpack::pack(&buffer, m1);

        // deserialize
        msgpack::unpacked result;
        msgpack::unpack(&result, buffer.data(), buffer.size());

        // you get dynamically-typed object
        msgpack::object obj = result.get();

        // convert it to statically-typed object
        myclass m2 = obj.as<myclass>();
    }
  1. MessagePack has IDL

    It's related to the type-checking API, MessagePack supports IDL. (specification is available from: http://wiki.msgpack.org/display/MSGPACK/Design+of+IDL)

    Protocol Buffers and Thrift require IDL (don't support dynamic-typing) and provide more mature IDL implementation.

  2. MessagePack has streaming API (Ruby, Python, Java, C++, ...)

    MessagePack supports streaming deserializers. This feature is useful for network communication. Here is an example (Ruby):

    require 'msgpack'

    # write objects to stdout
    $stdout.write [1,2,3].to_msgpack
    $stdout.write [1,2,3].to_msgpack

    # read objects from stdin using streaming deserializer
    unpacker = MessagePack::Unpacker.new($stdin)
    # use iterator
    unpacker.each {|obj|
      p obj
    }
暗喜 2024-11-22 15:39:52

我认为非常重要的一点是,这取决于您的客户端/服务器环境。

如果您在不进行检查的情况下多次传递字节,例如使用消息队列系统或将日志条目流式传输到磁盘,那么您可能更喜欢使用二进制编码来强调紧凑的大小。否则就是不同环境下的具体问题。

某些环境可以非常快速地与 msgpack/protobuf 进行序列化和反序列化,而其他环境则不然。一般来说,语言/环境越低级,二进制序列化的效果就越好。在高级语言(node.js、.Net、JVM)中,您经常会发现 JSON 序列化实际上更快。那么问题就变成了你的网络开销或多或少比你的内存/CPU 受到限制?

关于 msgpack、bson 和 protocol buffers...msgpack 是该组中最小的字节,protocol buffers 大致相同。 BSON 定义了比其他两种更广泛的本机类型,并且可能更适合您的对象模型,但这使其更加冗长。协议缓冲区的优点是被设计为流式传输......这使其成为二进制传输/存储格式的更自然的格式。

就我个人而言,我倾向于 JSON 直接提供的透明度,除非明确需要减少流量。通过使用 gzip 压缩数据的 HTTP,网络开销的差异在格式之间就不再是问题了。

I think it's very important to mention that it depends on what your client/server environment look like.

If you are passing bytes multiple times without inspection, such as with a message queue system or streaming log entries to disk, then you may well prefer a binary encoding to emphasize the compact size. Otherwise it's a case by case issue with different environments.

Some environments can have very fast serialization and deserialization to/from msgpack/protobuf's, others not so much. In general, the more low-level the language/environment the better binary serialization will work. In higher level languages (node.js, .Net, JVM) you will often see that JSON serialization is actually faster. The question then becomes is your network overhead more or less constrained than your memory/cpu?

With regards to msgpack vs bson vs protocol buffers... msgpack is the least bytes of the group, protocol buffers being about the same. BSON defines more broad native types than the other two, and may be a better match to your object model, but this makes it more verbose. Protocol buffers have the advantage of being designed to stream... which makes it a more natural format for a binary transfer/storage format.

Personally, I would lean towards the transparency that JSON offers directly, unless there is a clear need for lighter traffic. Over HTTP with gzipped data, the difference in network overhead are even less of an issue between the formats.

舟遥客 2024-11-22 15:39:52

嗯,正如作者所说,MessagePack 最初是为网络通信而设计的,而 BSON 是为存储而设计的。

MessagePack 比较紧凑,而 BSON 比较冗长。
MessagePack 旨在节省空间,而 BSON 则专为 CURD(节省时间)而设计。

最重要的是,MessagePack的类型系统(前缀)遵循Huffman编码,这里我画了一个MessagePack的Huffman树(点击链接看图):

MessagePack 哈夫曼树

Well,as the author said,MessagePack is originally designed for network communication while BSON is designed for storages.

MessagePack is compact while BSON is verbose.
MessagePack is meant to be space-efficient while BSON is designed for CURD (time-efficient).

Most importantly, MessagePack's type system (prefix) follow Huffman encoding, here I drawed a Huffman tree of MessagePack(click link to see image):

Huffman Tree of MessagePack

再可℃爱ぅ一点好了 2024-11-22 15:39:52

尚未提及的一个关键区别是 BSON 包含整个文档和进一步嵌套的子文档的大小信息(以字节为单位)。

document    ::=     int32 e_list

这对于尺寸和性能都很重要的受限环境(例如嵌入式)有两个主要好处。

  1. 您可以立即检查要解析的数据是否代表完整的文档,或者您是否需要在某个时刻请求更多数据(无论是来自某些连接还是存储)。由于这很可能是异步操作,因此您可能在解析之前已经发送了新请求。
  2. 您的数据可能包含包含与您无关的信息的整个子文档。 BSON 允许您通过使用子文档的大小信息来跳过它,从而轻松地遍历到子文档之后的下一个对象。另一方面,msgpack 包含所谓映射内的元素数量(类似于 BSON 的子文档)。虽然这无疑是有用的信息,但它对解析器没有帮助。您仍然必须解析地图内的每个对象,而不能跳过它。根据数据的结构,这可能会对性能产生巨大影响。

A key difference not yet mentioned is that BSON contains size information in bytes for the entire document and further nested sub-documents.

document    ::=     int32 e_list

This has two major benefits for restricted environments (e.g. embedded) where size and performance is important.

  1. You can immediately check if the data you're going to parse represents a complete document or if you're going to need to request more at some point (be it from some connection or storage). Since this is most likely an asynchronous operation you might already send a new request before parsing.
  2. Your data might contain entire sub-documents with irrelevant information for you. BSON allows you to easily traverse to the next object past the sub-document by using the size information of the sub-document to skip it. msgpack on the other hands contains the number of elements inside whats called a map (similar to BSON's sub-documents). While this is undoubtedly useful information it doesn't help the parser. You'd still have to parse every single object inside the map and can't just skip it. Depending on the structure of your data this might have a huge impact on performance.
墨洒年华 2024-11-22 15:39:52

我做了快速基准测试来比较 MessagePack 与 BSON 的编码和解码速度。至少如果您有大型二进制数组,BSON 会更快:

BSON writer: 2296 ms (243487 bytes)
BSON reader: 435 ms
MESSAGEPACK writer: 5472 ms (243510 bytes)
MESSAGEPACK reader: 1364 ms

Using C# Newtonsoft.Json and MessagePack by neuecc:

    public class TestData
    {
        public byte[] buffer;
        public bool foobar;
        public int x, y, w, h;
    }

    static void Main(string[] args)
    {
        try
        {
            int loop = 10000;

            var buffer = new TestData();
            TestData data2;
            byte[] data = null;
            int val = 0, val2 = 0, val3 = 0;

            buffer.buffer = new byte[243432];

            var sw = new Stopwatch();

            sw.Start();
            for (int i = 0; i < loop; i++)
            {
                data = SerializeBson(buffer);
                val2 = data.Length;
            }

            var rc1 = sw.ElapsedMilliseconds;

            sw.Restart();
            for (int i = 0; i < loop; i++)
            {
                data2 = DeserializeBson(data);
                val += data2.buffer[0];
            }
            var rc2 = sw.ElapsedMilliseconds;

            sw.Restart();
            for (int i = 0; i < loop; i++)
            {
                data = SerializeMP(buffer);
                val3 = data.Length;
                val += data[0];
            }

            var rc3 = sw.ElapsedMilliseconds;

            sw.Restart();
            for (int i = 0; i < loop; i++)
            {
                data2 = DeserializeMP(data);
                val += data2.buffer[0];
            }
            var rc4 = sw.ElapsedMilliseconds;

            Console.WriteLine("Results:", val);
            Console.WriteLine("BSON writer: {0} ms ({1} bytes)", rc1, val2);
            Console.WriteLine("BSON reader: {0} ms", rc2);
            Console.WriteLine("MESSAGEPACK writer: {0} ms ({1} bytes)", rc3, val3);
            Console.WriteLine("MESSAGEPACK reader: {0} ms", rc4);
        }
        catch (Exception e)
        {
            Console.WriteLine(e);
        }

        Console.ReadLine();
    }

    static private byte[] SerializeBson(TestData data)
    {
        var ms = new MemoryStream();

        using (var writer = new Newtonsoft.Json.Bson.BsonWriter(ms))
        {
            var s = new Newtonsoft.Json.JsonSerializer();
            s.Serialize(writer, data);
            return ms.ToArray();
        }
    }

    static private TestData DeserializeBson(byte[] data)
    {
        var ms = new MemoryStream(data);

        using (var reader = new Newtonsoft.Json.Bson.BsonReader(ms))
        {
            var s = new Newtonsoft.Json.JsonSerializer();
            return s.Deserialize<TestData>(reader);
        }
    }

    static private byte[] SerializeMP(TestData data)
    {
        return MessagePackSerializer.Typeless.Serialize(data);
    }

    static private TestData DeserializeMP(byte[] data)
    {
        return (TestData)MessagePackSerializer.Typeless.Deserialize(data);
    }

I made quick benchmark to compare encoding and decoding speed of MessagePack vs BSON. BSON is faster at least if you have large binary arrays:

BSON writer: 2296 ms (243487 bytes)
BSON reader: 435 ms
MESSAGEPACK writer: 5472 ms (243510 bytes)
MESSAGEPACK reader: 1364 ms

Using C# Newtonsoft.Json and MessagePack by neuecc:

    public class TestData
    {
        public byte[] buffer;
        public bool foobar;
        public int x, y, w, h;
    }

    static void Main(string[] args)
    {
        try
        {
            int loop = 10000;

            var buffer = new TestData();
            TestData data2;
            byte[] data = null;
            int val = 0, val2 = 0, val3 = 0;

            buffer.buffer = new byte[243432];

            var sw = new Stopwatch();

            sw.Start();
            for (int i = 0; i < loop; i++)
            {
                data = SerializeBson(buffer);
                val2 = data.Length;
            }

            var rc1 = sw.ElapsedMilliseconds;

            sw.Restart();
            for (int i = 0; i < loop; i++)
            {
                data2 = DeserializeBson(data);
                val += data2.buffer[0];
            }
            var rc2 = sw.ElapsedMilliseconds;

            sw.Restart();
            for (int i = 0; i < loop; i++)
            {
                data = SerializeMP(buffer);
                val3 = data.Length;
                val += data[0];
            }

            var rc3 = sw.ElapsedMilliseconds;

            sw.Restart();
            for (int i = 0; i < loop; i++)
            {
                data2 = DeserializeMP(data);
                val += data2.buffer[0];
            }
            var rc4 = sw.ElapsedMilliseconds;

            Console.WriteLine("Results:", val);
            Console.WriteLine("BSON writer: {0} ms ({1} bytes)", rc1, val2);
            Console.WriteLine("BSON reader: {0} ms", rc2);
            Console.WriteLine("MESSAGEPACK writer: {0} ms ({1} bytes)", rc3, val3);
            Console.WriteLine("MESSAGEPACK reader: {0} ms", rc4);
        }
        catch (Exception e)
        {
            Console.WriteLine(e);
        }

        Console.ReadLine();
    }

    static private byte[] SerializeBson(TestData data)
    {
        var ms = new MemoryStream();

        using (var writer = new Newtonsoft.Json.Bson.BsonWriter(ms))
        {
            var s = new Newtonsoft.Json.JsonSerializer();
            s.Serialize(writer, data);
            return ms.ToArray();
        }
    }

    static private TestData DeserializeBson(byte[] data)
    {
        var ms = new MemoryStream(data);

        using (var reader = new Newtonsoft.Json.Bson.BsonReader(ms))
        {
            var s = new Newtonsoft.Json.JsonSerializer();
            return s.Deserialize<TestData>(reader);
        }
    }

    static private byte[] SerializeMP(TestData data)
    {
        return MessagePackSerializer.Typeless.Serialize(data);
    }

    static private TestData DeserializeMP(byte[] data)
    {
        return (TestData)MessagePackSerializer.Typeless.Deserialize(data);
    }
匿名的好友 2024-11-22 15:39:52

快速测试表明,缩小的 JSON 的反序列化速度比二进制 MessagePack 更快。在测试中,Article.json 是 550kb 的缩小 JSON,Article.mpack 是其 420kb MP 版本。当然可能是一个实施问题。

MessagePack:

//test_mp.js
var msg = require('msgpack');
var fs = require('fs');

var article = fs.readFileSync('Article.mpack');

for (var i = 0; i < 10000; i++) {
    msg.unpack(article);    
}

JSON:

// test_json.js
var msg = require('msgpack');
var fs = require('fs');

var article = fs.readFileSync('Article.json', 'utf-8');

for (var i = 0; i < 10000; i++) {
    JSON.parse(article);
}

所以时间是:

Anarki:Downloads oleksii$ time node test_mp.js 

real    2m45.042s
user    2m44.662s
sys     0m2.034s

Anarki:Downloads oleksii$ time node test_json.js 

real    2m15.497s
user    2m15.458s
sys     0m0.824s

所以节省了空间,但速度更快?编号。

测试版本:

Anarki:Downloads oleksii$ node --version
v0.8.12
Anarki:Downloads oleksii$ npm list msgpack
/Users/oleksii
└── [email protected]  

Quick test shows minified JSON is deserialized faster than binary MessagePack. In the tests Article.json is 550kb minified JSON, Article.mpack is 420kb MP-version of it. May be an implementation issue of course.

MessagePack:

//test_mp.js
var msg = require('msgpack');
var fs = require('fs');

var article = fs.readFileSync('Article.mpack');

for (var i = 0; i < 10000; i++) {
    msg.unpack(article);    
}

JSON:

// test_json.js
var msg = require('msgpack');
var fs = require('fs');

var article = fs.readFileSync('Article.json', 'utf-8');

for (var i = 0; i < 10000; i++) {
    JSON.parse(article);
}

So times are:

Anarki:Downloads oleksii$ time node test_mp.js 

real    2m45.042s
user    2m44.662s
sys     0m2.034s

Anarki:Downloads oleksii$ time node test_json.js 

real    2m15.497s
user    2m15.458s
sys     0m0.824s

So space is saved, but faster? No.

Tested versions:

Anarki:Downloads oleksii$ node --version
v0.8.12
Anarki:Downloads oleksii$ npm list msgpack
/Users/oleksii
└── [email protected]  
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文