C++ 使用 Boost 对复杂数据进行序列化
我有一组希望从中序列化数据的类。 不过,有大量数据(我们正在讨论具有多达一百万或更多类实例的 std::map)。
不想太早优化我的代码,我想尝试一个简单而干净的 XML 实现,所以我使用tinyXML 将数据保存到 XML,但它太慢了。 所以我开始考虑使用 Boost.Serialization 编写和读取标准 ascii 或二进制文件。
它似乎更适合这项任务,因为我不必在开始之前分配所有这些内存作为开销。
我的问题本质上是如何为文件格式规划最佳序列化策略。 如果没有必要,我并不是特别想序列化整个地图,因为它实际上只是我想要的内容。 稍微玩了一下序列化(并查看了输出),我不明白如何将数据重新加载到地图末尾,例如,如果我只是一个接一个地保存所有项目。 规划序列化策略时需要考虑哪些问题?
谢谢。
I have a set of classes I wish to serialize the data from. There is a lot of data though, (we're talking a std::map with up to a million or more class instances).
Not wishing to optimize my code too early, I thought I'd try a simple and clean XML implementation, so I used tinyXML to save the data out to XML, but it was just far too slow. So I've started looking at using Boost.Serialization writing and reading standard ascii or binary.
It seems to be much better suited to the task as I don't have to allocate all this memory as an overhead before I get started.
My question is essentially how to go about planning an optimal serialization strategy for a file format. I don't particularly want to serialize the whole map if it's not necessary, as it's really only the contents I'm after. Having played around with serialization a little (and looked at the output), I don't understand how loading the data back in could know when it's reached the end of the map for example, if I simply save out all the items one after another. What issues do you need to consider when planning a serialization strategy?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
阅读此常见问题解答! 这对入门有帮助吗?
Read this FAQ! Does that help to get started?
boost.serialization 有很多优点。 例如,正如您所说,只需包含具有指定签名的方法,就允许框架序列化和反序列化您的数据。 此外,boost.serialization 包括所有标准 STL 容器的序列化器和读取器,因此您不必担心是否所有键都已存储(它们会)或如何在反序列化时检测映射中的最后一个条目(它将是自动检测)。
然而,有一些考虑因素需要考虑。 例如,如果您的类中有一个需要计算或用于加速的字段(例如索引或哈希表),则不必存储这些字段,但您必须考虑到您必须根据从磁盘读取的数据重建这些结构。
至于你提到的“文件格式”,我认为有时我们会尝试关注格式而不是数据。 我的意思是,只要您能够使用(例如)boost.serialization 无缝检索数据,文件的确切格式并不重要。 如果您想与不使用序列化的其他实用程序共享文件,那就是另一回事了。 但仅出于序列化(反序列化)的目的,您不必关心内部文件格式。
There are many advantages to boost.serialization. For instance, as you say, just including a method with a specified signature, allows the framework to serialize and deserialize your data. Also, boost.serialization includes serializers and readers for all the standard STL containers, so you don't have to bother if all keys have been stored (they will) or how to detect the last entry in the map when deserializing (it will be detected automatically).
There are, however, some considerations to make. For example, if you have a field in your class that it is calculated, or used to speed-up, such as indexes or hash tables, you don't have to store these, but you have to take into account that you have to reconstruct these structures from the data read from the disk.
As for the "file format" you mention, I think some times we try to focus in the format rather than in the data. I mean, the exact format of the file don't matter as long as you are able to retrieve the data seamlessly using (say) boost.serialization. If you want to share the file with other utilities that don't use serialization, that's another thing. But just for the purposes of (de)serialization, you don't have to care about the internal file format.
如果没有必要,我并不特别想序列化整个地图,因为它实际上只是我想要的内容。
这是否意味着您真的不需要序列化整个对象? 也许您应该重新考虑仅使用基于文本的格式。 如果您确实只需要序列化映射中键/值对的子集,那么您可能应该将它们写入文本文件并稍后读取它们。 您不一定需要
XML
; 每个映射键只需一行,后跟一行带有值的行即可。I don't particularly want to serialize the whole map if it's not necessary, as it's really only the contents I'm after.
Does that mean you don't really need to serialize the whole object? Maybe you should reconsider just using a text-based format. If you really need to serialize only a subset of the key/value pairs in a map then you should probably just write them to a text file and read them in later. You don't necessarily need
XML
; just one line per map key followed by one line with the value should work.如果您想要的只是键值对,那么重要的是键和值采用的类型,这将影响您处理事物的方式。
一般来说,序列化映射本身是一个糟糕的计划,因为您可能希望稍后更改关联容器类型,但不使以前的序列化文件无效(或必须翻译)。
如果您希望避免再次重建容器的成本,则在某些情况下序列化容器可能很有用(但预先调整容器的大小通常足以避免绝大多数此类开销),但这应该是基于特定方面的决定您的应用程序和使用情况。
如果您提供键/值的类型,我们可以提供更多帮助。 如果没有这个,这里有一些一般性提示:
If all you want is key value pairs then the important thing is the types the keys and values take, this will colour how you deal with things.
Serialising the map itself would be a poor plan in general since you may wish to change your associative container type later but not invalidate (or have to translate) previous serialised files.
Serialising the container can be useful in certain circumstances if you wish to avoid the cost of rebuilding the container again (but pre-sizing the container is normally sufficient to avoid the vast majority of this overhead) but this should be a decision based on specific aspects of your application and usage.
If you supply the type of the key/values we can help more. without this here are some general tips:
使用 Google 的协议缓冲区,它是一种与语言无关、平台无关的协议缓冲区,序列化结构化数据的可扩展方式,用于通信协议、数据存储等。 Google 几乎所有内部 RPC 协议和文件格式都使用 Protocol Buffers。
有针对 C++、Java、Python、Perl、C# 和 Ruby 的绑定。
您在元数据 .proto 文件中描述您的数据
然后您可以在 C++ 中使用它,如下所示:
或者像这样:
有关更完整的示例,请参阅 教程。
Use Google's Protocol Buffers which is a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.
There are bindings for C++, Java, Python, Perl, C#, and Ruby.
You describe your data in metadata .proto files
Then you would use it in C++ like this:
Or like this:
For a more complete example, see the tutorials.