协议缓冲区数组中浪费了字节?

发布于 2024-10-20 07:20:57 字数 317 浏览 1 评论 0原文

我有一个这样的协议缓冲区设置:

[ProtoContract]
Foo
{
    [ProtoMember(1)]
    Bar[] Bars;
}

单个 Bar 被编码为 67 字节协议缓冲区。这听起来很正确,因为我知道 Bar 几乎只是一个 64 字节数组,并且长度前缀有 3 个字节的开销。

然而,当我用 20 个 Bar 的数组对 Foo 进行编码时,它需要 1362 个字节。 20 * 67 是 1340,因此仅对数组进行编码就有 22 个字节的开销!

为什么这会占用这么多空间?我能做些什么来减少它吗?

I have a protocol buffer setup like this:

[ProtoContract]
Foo
{
    [ProtoMember(1)]
    Bar[] Bars;
}

A single Bar gets encoded to a 67 byte protocol buffer. This sounds about right because I know that a Bar is pretty much just a 64 byte array, and then there are 3 bytes overhead for length prefixing.

However, when I encode a Foo with an array of 20 Bars it takes 1362 bytes. 20 * 67 is 1340, so there are 22 bytes of overhead just for encoding an array!

Why does this take up so much space? And is there anything I can do to reduce it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

单身情人 2024-10-27 07:20:57

这个开销非常简单,就是它需要知道 20 个对象中每个对象的开始和结束位置的信息。在不破坏格式的情况下,我无法做任何不同的事情(即做一些违反规范的事情)。

如果你真的想要血淋淋的细节:

数组或列表(如果我们排除“打包”,这在这里不适用)只是一个重复的子消息块。子消息有两种布局;字符串和组。对于字符串,布局为:

[header][length][data]

其中 header 是线路类型和字段编号的变体编码混搭(在本例中为十六进制 08,字段为 1),lengthdata 的 varint 编码大小,data 是子对象本身。对于小对象(data 小于 128 字节),这通常意味着每个对象有 2 字节的开销,具体取决于 a:字段编号(大于 15 的字段占用更多空间)和 b:数据的大小。

对于组,布局为:

[header][data][footer]

其中 header 是线路类型和字段编号的变体编码混搭(在本例中为十六进制 0B,字段为 1),data 是子对象,footer 是另一个 varint 混搭,用于指示对象的结尾(在本例中为十六进制 0C,字段为 1)。

组通常不太受青睐,但它们的优点是,随着数据大小的增长,它们不会产生任何开销。对于小字段数(小于 16),每个对象的开销同样为 2 个字节。当然,您需要为大字段数支付双倍费用。

This overhead is quite simply the information it needs to know where each of the 20 objects starts and ends. There is nothing I can do different here without breaking the format (i.e. doing something contrary to the spec).

If you really want the gory details:

An array or list is (if we exclude "packed", which doesn't apply here) simply a repeated block of sub-messages. There are two layouts available for sub-messages; strings and groups. With a string, the layout is:

[header][length][data]

where header is the varint-encoded mash of the wire-type and field-number (hex 08 in this case with field 1), length is the varint-encoded size of data, and data is the sub-object itself. For small objects (data less than 128 bytes) this often means 2 bytes overhead per object, depending on a: the field number (fields above 15 take more space), and b: the size of the data.

With a group, the layout is:

[header][data][footer]

where header is the varint-encoded mash of the wire-type and field-number (hex 0B in this case with field 1), data is the sub-object, and footer is another varint mash to indicate the end of the object (hex 0C in this case with field 1).

Groups are less favored generally, but they have the advantage that they don't incur any overhead as data grows in size. For small field-numbers (less than 16) again the overhead is 2 bytes per object. Of course, you pay double for large field-numbers, instead.

晨光如昨 2024-10-27 07:20:57

默认情况下,数组实际上并不是作为数组传递的,而是作为重复成员传递的,这会产生更多的开销。

所以我猜想每个重复的数组元素实际上有 1 个字节的开销,加上顶部的 2 个额外字节的开销。

您可以通过使用“压缩”数组来减少开销。 protobuf-net 支持此功能:http://code.google.com/p/protobuf-net/

二进制格式的文档位于:http://code。 google.com/apis/protocolbuffers/docs/encoding.html

By default, arrays aren't actually passed as arrays, but as repeated members, which have a little more overhead.

So I'd guess you actually have 1 byte of overhead for each repeated array element, plus 2 extra bytes overhead on top.

You can lose the overhead by using a "packed" array. protobuf-net supports this: http://code.google.com/p/protobuf-net/

The documentation for the binary format is here: http://code.google.com/apis/protocolbuffers/docs/encoding.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文