Google Protocol Buffers 序列化在写入 1GB+ 时挂起数据
我正在使用协议缓冲区序列化来序列化大型数据集。当我的数据集包含 400000 个组合大小约为 1 GB 的自定义对象时,序列化会在 3~4 秒内返回。但是,当我的数据集包含 450000 个对象,总大小约为 1.2 GB 时,序列化调用永远不会返回,并且 CPU 不断消耗。
我正在使用 Protocol Buffers 的 .NET 端口。
I am serializing a large data set using protocol buffer serialization. When my data set contains 400000 custom objects of combined size around 1 GB, serialization returns in 3~4 seconds. But when my data set contains 450000 objects of combined size around 1.2 GB, serialization call never returns and CPU is constantly consumed.
I am using .NET port of Protocol Buffers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
看看新的评论,这似乎是(正如OP所指出的)
MemoryStream
容量有限。 protobuf 规范中的一个小问题是,由于子消息长度是可变的,并且必须为子消息添加前缀,因此通常需要缓冲部分直到知道长度。这对于大多数合理的图来说都很好,但是如果有一个特别大的图(除了“根对象有数百万个直接子对象”场景,这种情况不会受到影响),它最终可能会在内存中执行相当多的操作。如果您不依赖于特定的布局(可能是由于 .proto 与现有客户端的互操作),那么一个简单的修复如下:在 child (子对象)属性(包括列表/子对象数组),告诉它使用“组”序列化。这不是默认布局,但它表示“不使用长度前缀,而是使用开始/结束标记对”。这样做的缺点是如果您的反序列化代码不知道特定对象,则跳过该字段需要更长的时间,因为它不能只是说“寻求转发 231413 字节” - 相反,它必须遍历令牌才能知道对象何时完成。在大多数情况下,这根本不是问题,因为您的反序列化代码完全需要该数据。
要做到这一点:
protobuf-net 中的反序列化非常宽容(默认情况下有一个可选的严格模式),它会愉快地反序列化组来代替长度前缀,并用长度前缀代替组(意思是:任何数据)你已经存储在某个地方应该可以正常工作)。
Looking at the new comments, this appears to be (as the OP notes)
MemoryStream
capacity limited. A slight annoyance in the protobuf spec is that since sub-message lengths are variable and must prefix the sub-message, it is often necessary to buffer portions until the length is known. This is fine for most reasonable graphs, but if there is an exceptionally large graph (except for the "root object has millions of direct children" scenario, which doesn't suffer) it can end up doing quite a bit in-memory.If you aren't tied to a particular layout (perhaps due to .proto interop with an existing client), then a simple fix is as follows: on child (sub-object) properties (including lists / arrays of sub-objects), tell it to use "group" serialization. This is not the default layout, but it says "instead of using a length-prefix, use a start/end pair of tokens". The downside of this is that if your deserialization code doesn't know about a particular object, it takes longer to skip the field, as it can't just say "seek forwards 231413 bytes" - it instead has to walk the tokens to know when the object is finished. In most cases this isn't an issue at all, since your deserialization code fully expects that data.
To do this:
The deserialization in protobuf-net is very forgiving (by default there is an optional strict mode), and it will happily deserialize groups in place of length-prefix, and length-prefix in place of groups (meaning: any data you have already stored somewhere should work fine).
1.2G 内存非常接近 32 位 .Net 进程的托管内存限制。我的猜测是序列化会触发
OutOfMemoryException
并且一切都会崩溃。您应该尝试使用几个较小的序列化而不是一个巨大的序列化,或者迁移到 64 位进程。
干杯,
弗洛里安
1.2G of memory is dangerously close to the managed memory limit for 32 bit .Net processes. My guess is the serialization triggers an
OutOfMemoryException
and all hell breaks loose.You should try to use several smaller serializations rather than a gigantic one, or move to a 64bit process.
Cheers,
Florian