C++ 使用 Boost 对复杂数据进行序列化

发布于 2024-07-13 21:55:51 字数 408 浏览 5 评论 0原文

我有一组希望从中序列化数据的类。 不过,有大量数据(我们正在讨论具有多达一百万或更多类实例的 std::map)。

不想太早优化我的代码,我想尝试一个简单而干净的 XML 实现,所以我使用tinyXML 将数据保存到 XML,但它太慢了。 所以我开始考虑使用 Boost.Serialization 编写和读取标准 ascii 或二进制文件。

它似乎更适合这项任务,因为我不必在开始之前分配所有这些内存作为开销。

我的问题本质上是如何为文件格式规划最佳序列化策略。 如果没有必要,我并不是特别想序列化整个地图,因为它实际上只是我想要的内容。 稍微玩了一下序列化(并查看了输出),我不明白如何将数据重新加载到地图末尾,例如,如果我只是一个接一个地保存所有项目。 规划序列化策略时需要考虑哪些问题?

谢谢。

I have a set of classes I wish to serialize the data from. There is a lot of data though, (we're talking a std::map with up to a million or more class instances).

Not wishing to optimize my code too early, I thought I'd try a simple and clean XML implementation, so I used tinyXML to save the data out to XML, but it was just far too slow. So I've started looking at using Boost.Serialization writing and reading standard ascii or binary.

It seems to be much better suited to the task as I don't have to allocate all this memory as an overhead before I get started.

My question is essentially how to go about planning an optimal serialization strategy for a file format. I don't particularly want to serialize the whole map if it's not necessary, as it's really only the contents I'm after. Having played around with serialization a little (and looked at the output), I don't understand how loading the data back in could know when it's reached the end of the map for example, if I simply save out all the items one after another. What issues do you need to consider when planning a serialization strategy?

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

时光瘦了 2024-07-20 21:55:51

阅读此常见问题解答! 这对入门有帮助吗?

Read this FAQ! Does that help to get started?

夏雨凉 2024-07-20 21:55:51

boost.serialization 有很多优点。 例如,正如您所说,只需包含具有指定签名的方法,就允许框架序列化和反序列化您的数据。 此外,boost.serialization 包括所有标准 STL 容器的序列化器和读取器,因此您不必担心是否所有键都已存储(它们会)或如何在反序列化时检测映射中的最后一个条目(它将是自动检测)。

然而,有一些考虑因素需要考虑。 例如,如果您的类中有一个需要计算或用于加速的字段(例如索引或哈希表),则不必存储这些字段,您必须考虑到您必须根据从磁盘读取的数据重建这些结构。

至于你提到的“文件格式”,我认为有时我们会尝试关注格式而不是数据。 我的意思是,只要您能够使用(例如)boost.serialization 无缝检索数据,文件的确切格式并不重要。 如果您想与不使用序列化的其他实用程序共享文件,那就是另一回事了。 但仅出于序列化(反序列化)的目的,您不必关心内部文件格式。

There are many advantages to boost.serialization. For instance, as you say, just including a method with a specified signature, allows the framework to serialize and deserialize your data. Also, boost.serialization includes serializers and readers for all the standard STL containers, so you don't have to bother if all keys have been stored (they will) or how to detect the last entry in the map when deserializing (it will be detected automatically).

There are, however, some considerations to make. For example, if you have a field in your class that it is calculated, or used to speed-up, such as indexes or hash tables, you don't have to store these, but you have to take into account that you have to reconstruct these structures from the data read from the disk.

As for the "file format" you mention, I think some times we try to focus in the format rather than in the data. I mean, the exact format of the file don't matter as long as you are able to retrieve the data seamlessly using (say) boost.serialization. If you want to share the file with other utilities that don't use serialization, that's another thing. But just for the purposes of (de)serialization, you don't have to care about the internal file format.

浪荡不羁 2024-07-20 21:55:51

如果没有必要,我并不特别想序列化整个地图,因为它实际上只是我想要的内容。

这是否意味着您真的不需要序列化整个对象? 也许您应该重新考虑仅使用基于文本的格式。 如果您确实只需要序列化映射中键/值对的子集,那么您可能应该将它们写入文本文件并稍后读取它们。 您不一定需要 XML; 每个映射键只需一行,后跟一行带有值的行即可。

I don't particularly want to serialize the whole map if it's not necessary, as it's really only the contents I'm after.

Does that mean you don't really need to serialize the whole object? Maybe you should reconsider just using a text-based format. If you really need to serialize only a subset of the key/value pairs in a map then you should probably just write them to a text file and read them in later. You don't necessarily need XML; just one line per map key followed by one line with the value should work.

薄暮涼年 2024-07-20 21:55:51

如果您想要的只是键值对,那么重要的是键和值采用的类型,这将影响您处理事物的方式。

一般来说,序列化映射本身是一个糟糕的计划,因为您可能希望稍后更改关联容器类型,但不使以前的序列化文件无效(或必须翻译)。

如果您希望避免再次重建容器的成本,则在某些情况下序列化容器可能很有用(但预先调整容器的大小通常足以避免绝大多数此类开销),但这应该是基于特定方面的决定您的应用程序和使用情况。

如果您提供键/值的类型,我们可以提供更多帮助。 如果没有这个,这里有一些一般性提示:

  • 如果它们适合字符串表示,那么一个简单的 CSV 文件可能就足够了(但使用现有的读写器库,读取和写入合法的 CSV 比表面上看起来更困难)
  • 如果它们被修复那么简单的二进制格式将使读写变得非常容易(并且快速),但应注意承认以下问题:
    • 字节序
    • 您是否希望允许将这些文件简单地组合在一起,或者添加类似 CRC 的值以确保完整性(您可以同时执行这两个操作,但比较困难)
    • 您失去了 grep 文件的能力(这是真正的损失,您可能最终不得不为此重新发明部分工具链)
    • 更改平台/编译器/size_t 是否会破坏格式
  • 某些比 XML 更轻的结构化文本格式。 有几个 JSOM/YAML 等。这些将提供您很可能不需要的可扩展性。

If all you want is key value pairs then the important thing is the types the keys and values take, this will colour how you deal with things.

Serialising the map itself would be a poor plan in general since you may wish to change your associative container type later but not invalidate (or have to translate) previous serialised files.

Serialising the container can be useful in certain circumstances if you wish to avoid the cost of rebuilding the container again (but pre-sizing the container is normally sufficient to avoid the vast majority of this overhead) but this should be a decision based on specific aspects of your application and usage.

If you supply the type of the key/values we can help more. without this here are some general tips:

  • If they are amenable to string representation then a simple CSV file may be sufficient (but use an existing reader writer library for it, reading and writing legit CSV is harder than it looks superficially)
  • IF they are fixed width then a simple binary format will make reading and writing very easy (and quick) but care should be taken to acknowledge the issues of:
    • endianess
    • whether you wish to allow simple catting of such files together or add CRC like values for integrity (you can do both but it's harder)
    • You lose the ability to grep the files (this is a real loss, you may end having to reinvent parts of your toolchain for this)
    • whether changing platform/compiler/size_t will break the format
  • Some structured textual format that is lighter than XML. There are several JSOM/YAML etc. These will provide extensibility you quite likely don't require.
享受孤独 2024-07-20 21:55:51

使用 Google 的协议缓冲区,它是一种与语言无关、平台无关的协议缓冲区,序列化结构化数据的可扩展方式,用于通信协议、数据存储等。 Google 几乎所有内部​​ RPC 协议和文件格式都使用 Protocol Buffers。

有针对 C++、Java、Python、Perl、C# 和 Ruby 的绑定。

您在元数据 .proto 文件中描述您的数据

message Person {
  required int32 id = 1;
  required string name = 2;
  optional string email = 3;
}

然后您可以在 C++ 中使用它,如下所示:

Person person;
person.set_id(123);
person.set_name("Bob");
person.set_email("[email protected]");

fstream out("person.pb", ios::out | ios::binary | ios::trunc);
person.SerializeToOstream(&out);
out.close();

或者像这样:

Person person;
fstream in("person.pb", ios::in | ios::binary);
if (!person.ParseFromIstream(&in)) {
  cerr << "Failed to parse person.pb." << endl;
  exit(1);
}

cout << "ID: " << person.id() << endl;
cout << "name: " << person.name() << endl;
if (person.has_email()) {
  cout << "e-mail: " << person.email() << endl;
}

有关更完整的示例,请参阅 教程

Use Google's Protocol Buffers which is a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.

There are bindings for C++, Java, Python, Perl, C#, and Ruby.

You describe your data in metadata .proto files

message Person {
  required int32 id = 1;
  required string name = 2;
  optional string email = 3;
}

Then you would use it in C++ like this:

Person person;
person.set_id(123);
person.set_name("Bob");
person.set_email("[email protected]");

fstream out("person.pb", ios::out | ios::binary | ios::trunc);
person.SerializeToOstream(&out);
out.close();

Or like this:

Person person;
fstream in("person.pb", ios::in | ios::binary);
if (!person.ParseFromIstream(&in)) {
  cerr << "Failed to parse person.pb." << endl;
  exit(1);
}

cout << "ID: " << person.id() << endl;
cout << "name: " << person.name() << endl;
if (person.has_email()) {
  cout << "e-mail: " << person.email() << endl;
}

For a more complete example, see the tutorials.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文