从二进制文件中的结构中解析内容
使用 C#,我需要读取使用 FORTRAN 创建的打包二进制文件。该文件以“未格式化的顺序”格式存储,如下所述(大约在“未格式化的顺序文件”部分的页面中间):
http://www.tacc.utexas.edu/services/userguides/intel8/fc/f_ug1/pggfmsp.htm
作为您可以从 URL 中看到,该文件被组织成 130 字节或更少的“块”,并且每个块周围包含 2 个长度字节(由 FORTRAN 编译器插入)。
因此,我需要找到一种有效的方法来解析实际文件负载,使其脱离编译器插入的格式。
从文件中提取实际有效负载后,我需要将其解析为不同的数据类型。这将是下一个练习。
我的第一个想法是使用 File.ReadAllBytes 将整个文件放入字节数组中。然后,只需迭代字节,跳过格式化并将实际数据传输到第二个字节数组。
最后,第二个字节数组应该包含实际的文件内容减去所有格式,然后我需要返回以获得我需要的内容。
由于我对 C# 相当陌生,我认为可能有一种更好、更容易接受的方法来解决这个问题。
另外,如果有帮助的话,这些文件可能相当大(比如 30MB),尽管大多数文件会小得多......
Using C#, I need to read a packed binary file created using FORTRAN. The file is stored in an "Unformatted Sequential" format as described here (about half-way down the page in the "Unformatted Sequential Files" section):
http://www.tacc.utexas.edu/services/userguides/intel8/fc/f_ug1/pggfmsp.htm
As you can see from the URL, the file is organized into "chunks" of 130 bytes or less and includes 2 length bytes (inserted by the FORTRAN compiler) surrounding each chunk.
So, I need to find an efficient way to parse the actual file payload away from the compiler-inserted formatting.
Once I've extracted the actual payload from the file, I'll then need to parse it up into its varying data types. That'll be the next exercise.
My first thoughts are to slurp up the entire file into a byte array using File.ReadAllBytes
. Then, just iterate through the bytes, skipping the formatting and transferring the actual data to a second byte array.
In the end, that second byte array should contain the actual file contents minus all the formatting, which I'd then need to go back through to get what I need.
As I'm fairly new to C#, I thought there might be a better, more accepted way of tackling this.
Also, in case it's helpful, these files could be fairly large (say 30MB), though most will be much smaller...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
读取此类文件的一种方法是逐条记录(例如,读取长度字节,然后读取数据块,构建记录列表,这些记录只是字节数组)。然后记录的集合被传递到进一步的解析例程。
但是,如果您使用的是 4.0,则有一个 用于文件映射的新类,它会更高效,但工作方式与
ReadAllBytes
类似。如果您使用的是 ReadAllBytes 或 MemoryMappedFile,最好先解析所有记录长度,在大型二进制文件中构建内存“索引”。如果您只需要某些记录,这尤其有用。
One way to read files like this is record by record (e.g., read the length bytes and then the data chunk, building up a list of records, which are just byte arrays). The collection of records is then passed to further parsing routines.
However, if you're on 4.0, there is a new class for file mapping which would be more efficient yet work similarly to
ReadAllBytes
.If you're using
ReadAllBytes
orMemoryMappedFile
it's nice to build an in-memory "index" into the large binary file by parsing all the record lengths first. This is especially useful if you will only need certain records.不要遍历字节,而是查看 System.IO.BinaryReader。将文件作为
FileStream
打开,将其包装在BinaryReader
中,然后您可以直接从中读取原始类型,同时流指针会跟踪您在 Blob 中的偏移量。您可能必须自己考虑字节顺序和自定义类型,也许可以在读取单个字节的方法之上为BinaryReader
构建自己的扩展方法。如果您确实需要字节数组中的数据,并且首先将数组包装在
MemoryStream
中,则仍然可以使用BinaryReader
。对于这么大的文件,我会避开
File.ReadAllBytes
。FileStream
应该为您缓冲,斯蒂芬关于使用内存映射文件的建议听起来像是一个更复杂(可能更有效)的替代方案,特别是如果您需要进行第二次格式化。Rather than iterate through the bytes, take a look at
System.IO.BinaryReader
. Open the file as aFileStream
, wrap it in aBinaryReader
, and you can read primitive types from it directly, with the stream pointer keeping track of your offset into the blob. You might have to account for endianness and custom types yourself, maybe building your own extension methods forBinaryReader
on top of its method for reading individual bytes.If you do need the data in a byte array, you can still use
BinaryReader
if you wrap the array in aMemoryStream
first.With files that large, I'd steer clear of
File.ReadAllBytes
.FileStream
should buffer for you, and Stephen's suggestion for using memory-mapped files sounds like a more sophisticated (possibly more efficient) alternative to that, especially if you need to make a second pass for the formatting.