如何在C++中高效地从结构复杂的文件中读取二进制数据

发布于 2024-12-16 15:01:08 字数 950 浏览 1 评论 0原文

我正在编写一段代码，使用 C++ IOStreams 读取跨越多个文件的几 GB 数据，出于多种设计原因，我选择了 C++ IOStreams 而不是 C API，我不会让您感到厌烦。由于数据是由运行我的代码的同一台机器上的单独程序生成的，因此我相信与字节序相关的问题在大多数情况下可以被忽略。

这些文件具有相当复杂的结构。例如，有一个标头描述特定二进制配置的记录数。在文件的后面，我必须使代码有条件地读取该行数。这种模式以一种复杂但有据可查的方式重复出现。

我的问题与如何有效地做到这一点有关 - 我确信我的进程将受到 IO 限制，所以我的直觉是，

std::vector<int> buffer;
buffer.reserve(500);
file.read( (char*)&buffer[0], 500 * sizeof(int));

我应该在一个文件中读取数据，而不是读取小块中的数据，例如以下方法一次完全并尝试在内存中处理它。所以我的相关问题是：

鉴于这似乎意味着读取 char* 或 std::vector 数组，您最好如何将该数组转换为正确表示文件结构所需的数据格式？
我的假设不正确吗？

我知道明显的答案是尝试然后稍后进行分析，我当然会进行分析。但这个问题更多的是关于如何在一开始就选择正确的方法 - 一种“选择正确的算法”优化，而不是我稍后在识别瓶颈后可以设想的优化！

我将对所提供的答案感兴趣 - 我倾向于只能找到相对简单的二进制文件的答案，上述方法适合这些文件。我的问题是，大部分二进制数据是根据文件标头中的数字有条件地构造的（即使标头也是这样格式化的！），所以我需要能够更仔细地处理文件。

提前致谢。

编辑：一些关于内存映射的评论 - 看起来不错，但不确定如何做到这一点，而且我读过的所有内容都告诉我它不可移植。我有兴趣尝试 mmap，但也有兴趣更便携的解决方案（如果有的话！）

原文

I am writing a piece of code to read in several GB of data that spans multiple files using C++ IOStreams, which I've chosen over the C API for a number of design reasons that I won't bore you with. Since the data is produced by a separate program on the same machine where my code will run, I am confident that issues such as those relating to endianess can, for the most part, be ignored.

The files have a reasonably complicated structure. For example, there is a header that describes the number of records of a particular binary configuration. Later in the file, I must make the code conditionally read that number of lines. This sort of pattern is repeated in a complicated, but well-documented way.

My question is related to how to do this efficiently - I'm sure my process is going to be IO-limited, so my instinct is that rather than reading in data in smallish blocks, such as the following approach

std::vector<int> buffer;
buffer.reserve(500);
file.read( (char*)&buffer[0], 500 * sizeof(int));

I should read in one file entirely at a time and try to process it in memory. So my interrelated questions:

Given that this would seem to mean reading in a char* or std::vector array, how would you best go about converting this array into the data format required to correctly represent the file structure?
Are my assumptions incorrect?

I know the obvious answer is to try and then to profile later, and profile I certainly will. But this question is more about how to pick the right approach at the beginning - a sort of "pick the right algorithm" optimisation, rather than the sort of optimisations that I could envisage doing after identifying bottlenecks later on!

I'll be interested in the answers offered up - I tend to only be able to find answers for relatively simple binary files, for which the approach above is suitable. My problem is that the bulk of the binary data is structured conditionally on the numbers in the header to the file (even the header is formatted this way!) so I need to be able to process the file a little more carefully.

Thanks in advance.

EDIT: Some comments coming through about memory mapping - looks good, but not sure how to do it and all I've read tells me it isn't portable. I'm interested in trying an mmap, but also in more portable solutions (if any!)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

月竹挽风 2024-12-23 15:01:08

使用 64 位操作系统和内存映射文件。如果您还需要支持 32 位操作系统，请使用可根据需要映射文件块的兼容层。

或者，如果您始终需要按文件顺序排列对象，只需编写一个合理的解析器来处理块中的对象即可。像这样：

1) 读入512KB的文件。

2）从我们读取的数据中提取尽可能多的对象。

3) 根据需要读入尽可能多的字节，以将缓冲区填充回 512KB。如果我们根本没有读取任何字节，则停止。

4) 转到步骤 3。

回复收藏 0 原文

温柔戏命师 2024-12-23 15:01:08

您可以 mmap 某些文件段（或整个文件，至少在 64 位机器上）。也许使用 madvise 和（在单独的线程中）预读

回复收藏 0 原文

败给现实 2024-12-23 15:01:08

我猜你已经有足够的内存了，只要你有足够的 RAM，内存映射当然是一个好主意。否则请大块阅读。

一旦数据在整个文件或大块内存中可用，最简单的读取方法是：

定义一个适当的结构
创建一个指向加载数据的内存中适当偏移量的指针
reinterpret_cast 指向的指针“适当结构”类型的指针或适当结构的数组。

如果需要，您可以使用 #pragmas 来确保包装尺寸/顺序等。但这又取决于操作系统/编译器。

回复收藏 0 原文

金兰素衣 2024-12-23 15:01:08

好吧，标头的长度是可变的，但你必须从某个地方开始。如果您必须先读入整个文件，它可能会变得有点混乱。整个文件可以用一个结构体来表示，该结构体包含标题直到某个长度描述符，然后是一个字节数组 - 您可以从那里开始。一旦获得了标头长度，您就可以将指针/长度设置为标头条目数组，然后迭代它们，从而为文件内容结构数组设置指针/长度等等。

所有各种数组可能需要打包结构体？

可恶的。我真的不喜欢我自己的设计:(

除了重写“单独的程序”以使用数据库或 XML 或其他东西之外，有人有更好的主意吗？

回复收藏 0 原文

~没有更多了~