如何在C++中高效地从结构复杂的文件中读取二进制数据
我正在编写一段代码,使用 C++ IOStreams 读取跨越多个文件的几 GB 数据,出于多种设计原因,我选择了 C++ IOStreams 而不是 C API,我不会让您感到厌烦。由于数据是由运行我的代码的同一台机器上的单独程序生成的,因此我相信与字节序相关的问题在大多数情况下可以被忽略。
这些文件具有相当复杂的结构。例如,有一个标头描述特定二进制配置的记录数。在文件的后面,我必须使代码有条件地读取该行数。这种模式以一种复杂但有据可查的方式重复出现。
我的问题与如何有效地做到这一点有关 - 我确信我的进程将受到 IO 限制,所以我的直觉是,
std::vector<int> buffer;
buffer.reserve(500);
file.read( (char*)&buffer[0], 500 * sizeof(int));
我应该在一个文件中读取 数据,而不是读取小块中的数据,例如以下方法一次完全并尝试在内存中处理它。所以我的相关问题是:
- 鉴于这似乎意味着读取 char* 或 std::vector 数组,您最好如何将该数组转换为正确表示文件结构所需的数据格式?
- 我的假设不正确吗?
我知道明显的答案是尝试然后稍后进行分析,我当然会进行分析。但这个问题更多的是关于如何在一开始就选择正确的方法 - 一种“选择正确的算法”优化,而不是我稍后在识别瓶颈后可以设想的优化!
我将对所提供的答案感兴趣 - 我倾向于只能找到相对简单的二进制文件的答案,上述方法适合这些文件。我的问题是,大部分二进制数据是根据文件标头中的数字有条件地构造的(即使标头也是这样格式化的!),所以我需要能够更仔细地处理文件。
提前致谢。
编辑:一些关于内存映射的评论 - 看起来不错,但不确定如何做到这一点,而且我读过的所有内容都告诉我它不可移植。我有兴趣尝试 mmap,但也有兴趣更便携的解决方案(如果有的话!)
I am writing a piece of code to read in several GB of data that spans multiple files using C++ IOStreams, which I've chosen over the C API for a number of design reasons that I won't bore you with. Since the data is produced by a separate program on the same machine where my code will run, I am confident that issues such as those relating to endianess can, for the most part, be ignored.
The files have a reasonably complicated structure. For example, there is a header that describes the number of records of a particular binary configuration. Later in the file, I must make the code conditionally read that number of lines. This sort of pattern is repeated in a complicated, but well-documented way.
My question is related to how to do this efficiently - I'm sure my process is going to be IO-limited, so my instinct is that rather than reading in data in smallish blocks, such as the following approach
std::vector<int> buffer;
buffer.reserve(500);
file.read( (char*)&buffer[0], 500 * sizeof(int));
I should read in one file entirely at a time and try to process it in memory. So my interrelated questions:
- Given that this would seem to mean reading in a char* or std::vector array, how would you best go about converting this array into the data format required to correctly represent the file structure?
- Are my assumptions incorrect?
I know the obvious answer is to try and then to profile later, and profile I certainly will. But this question is more about how to pick the right approach at the beginning - a sort of "pick the right algorithm" optimisation, rather than the sort of optimisations that I could envisage doing after identifying bottlenecks later on!
I'll be interested in the answers offered up - I tend to only be able to find answers for relatively simple binary files, for which the approach above is suitable. My problem is that the bulk of the binary data is structured conditionally on the numbers in the header to the file (even the header is formatted this way!) so I need to be able to process the file a little more carefully.
Thanks in advance.
EDIT: Some comments coming through about memory mapping - looks good, but not sure how to do it and all I've read tells me it isn't portable. I'm interested in trying an mmap, but also in more portable solutions (if any!)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
使用 64 位操作系统和内存映射文件。如果您还需要支持 32 位操作系统,请使用可根据需要映射文件块的兼容层。
或者,如果您始终需要按文件顺序排列对象,只需编写一个合理的解析器来处理块中的对象即可。像这样:
1) 读入512KB的文件。
2)从我们读取的数据中提取尽可能多的对象。
3) 根据需要读入尽可能多的字节,以将缓冲区填充回 512KB。如果我们根本没有读取任何字节,则停止。
4) 转到步骤 3。
Use a 64-bit OS and memory map the file. If you need to support a 32-bit OS as well, use a compatibility layer that maps chunks of the file as needed.
Alternatively, if you always need the objects in file order, just write a sane parser to handle the objects in chunks. Like this:
1) Read in 512KB of file.
2) Extract as many objects as possible from the data we read.
3) Read in as many bytes as needed to fill the buffer back up to 512KB. If we read no bytes at all, stop.
4) Go to step 3.
您可以 mmap 某些文件段(或整个文件,至少在 64 位机器上) 。也许使用 madvise 和(在单独的线程中)预读
You could mmap some file segments (or the entire file, at least on 64 bits machine). Perhaps use madvise and (in a separate thread) readahead
我猜你已经有足够的内存了,只要你有足够的 RAM,内存映射当然是一个好主意。否则请大块阅读。
一旦数据在整个文件或大块内存中可用,最简单的读取方法是:
reinterpret_cast
指向的指针“适当结构”类型的指针或适当结构的数组。如果需要,您可以使用 #pragmas 来确保包装尺寸/顺序等。但这又取决于操作系统/编译器。
I guess you already have enough to start off, memory mapping is certainly a neat idea as long as you have enough RAM. Else read in big chunks.
Once the data is available in memory whole file or a big chunk, the simplest way to read is to:
reinterpret_cast
the pointer to a pointer of type "appropriate struct" or an array of appropriate struct.You can use #pragmas to ensure the packing size/order etc if needed. But again this would be OS/Compiler dependent.
好吧,标头的长度是可变的,但你必须从某个地方开始。如果您必须先读入整个文件,它可能会变得有点混乱。整个文件可以用一个结构体来表示,该结构体包含标题直到某个长度描述符,然后是一个字节数组 - 您可以从那里开始。一旦获得了标头长度,您就可以将指针/长度设置为标头条目数组,然后迭代它们,从而为文件内容结构数组设置指针/长度等等。
所有各种数组可能需要打包结构体?
可恶的。我真的不喜欢我自己的设计:(
除了重写“单独的程序”以使用数据库或 XML 或其他东西之外,有人有更好的主意吗?
Well, OK, the header is of variable length, but you have to start somewhere. If you have to read in the whole file first, it can get a bit messy. The whole file can be represented by a struct containing the header up until some length descriptor and then a byte array - you can start there. Once you have the header length, you can set a pointer/length to an array of header entries and then iterate them and so set a pointer/length for an array of file content structs and so on and so on..
All the various arrays of structs would probably need to be packed?
Nasty. I don't really like my own design:(
Anyone got a better idea, other than rewriting the 'separate program' to use a database or XML or something?