二进制数据文件头中应放入什么
我有一个模拟,可以读取我们创建的大型二进制数据文件(10 到 100 GB)。 出于速度原因,我们使用二进制。 这些文件依赖于系统,是从我们运行的每个系统上的文本文件转换而来的,所以我不关心可移植性。 当前的文件是 POD 结构的许多实例,使用 fwrite 编写。
我需要更改结构,因此我想添加一个包含文件版本号的标头,该标头将在结构更改时递增。 既然我正在这样做,我还想添加一些其他信息。 我正在考虑结构体的大小、字节顺序,也许还有创建二进制文件的代码的 svn 版本号。 还有什么可以补充的吗?
I have a simulation that reads large binary data files that we create (10s to 100s of GB). We use binary for speed reasons. These files are system dependent, converted from text files on each system that we run, so I'm not concerned about portability. The files currently are many instances of a POD struct, written with fwrite.
I need to change the struct, so I want to add a header that has a file version number in it, which will be incremented anytime the struct changes. Since I'm doing this, I want to add some other information as well. I'm thinking of the size of the struct, byte order, and maybe the svn version number of the code that created the binary file. Is there anything else that would be useful to add?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
根据我的经验,事后猜测您需要的数据总是浪费时间。 重要的是以可扩展的方式构建您的元数据。 对于 XML 文件,这很简单,但对于二进制文件则需要更多考虑。
我倾向于将元数据存储在文件末尾的结构中,而不是开头。 这有两个优点:
很容易被检测到。
附加到现有文件,无需
影响他们阅读代码。
我使用的最简单的元数据页脚看起来像这样:
在原始数据之后,写入元数据页脚,然后写入文件页脚。
读取文件时,查找到末尾 - sizeof(FileFooter)。 阅读页脚,并验证 magicString。 然后根据metadataFooterSize回溯并读取元数据。 根据文件中包含的页脚大小,您可以对缺少的字段使用默认值。
正如 KeithB 指出的那样,您甚至可以使用此技术将元数据存储为 XML 字符串,从而完全发挥两者的优点可扩展的元数据,具有二进制数据的紧凑性和速度。
In my experience, second-guessing the data you'll need is invariably wasted time. What's important is to structure your metadata in a way that is extensible. For XML files, that's straightforward, but binary files require a bit more thought.
I tend to store metadata in a structure at the END of the file, not the beginning. This has two advantages:
easily detected.
appended to existing files without
impacting their reading code.
The simplest metadata footer I use looks something like this:
After the raw data, the metadata footer and THEN the file footer are written.
When reading the file, seek to the end - sizeof(FileFooter). Read the footer, and verify the magicString. Then, seek back according to metadataFooterSize and read the metadata. Depending on the footer size contained in the file, you can use default values for missing fields.
As KeithB points out, you could even use this technique to store the metadata as an XML string, giving the advantages of both totally extensible metadata, with the compactness and speed of binary data.
对于大型二进制文件,我会认真考虑 HDF5(Google)。 即使您不想采用它,它也可能会为您设计自己的格式提供一些有用的方向。
For large binaries I'd look seriously at HDF5 (Google for it). Even if it's not something you want to adopt it might point you in some useful directions in designing your own formats.
对于大型二进制文件,除了版本号之外,我倾向于添加记录计数和 CRC,原因是大型二进制文件比较小的二进制文件更容易随着时间的推移或在传输过程中被截断和/或损坏。 最近我惊恐地发现 Windows 根本不能很好地处理这个问题,因为我使用资源管理器将大约 2TB 的数百个文件复制到连接的 NAS 设备上,发现每个副本上有 2-3 个文件被损坏(不完全损坏)复制)。
For large binaries, in addition to the version number I tend to put a record count and CRC, the reason being that large binaries are much more prone to get truncated and/or corrupted over time or during transfer than smaller ones. I found recently to my horror that Windows does not handle this well at all, as I used explorer to copy about 2TB across a couple of hundred files to an attached NAS device, and found 2-3 files on each copy were damaged (not completely copied).
如果稍后将其他结构写入二进制文件,则文件类型的标识符将很有用。
也许这可能是一个短字符串,因此您可以通过查看文件(通过十六进制编辑器)看到它包含的内容。
An identifier for the type of the file would be useful if you will have other structures written to binary files later on.
Maybe this could be a short string so you can see by a look into the file (via hex editor) what it contains.
如果它们那么大,我会在文件开头保留一块健康的空间(64K?),并将元数据以 XML 格式放在其中,后跟文件结束字符(对于 DOS/Ctrl-Z) Windows 下,unix 下按 ctrl-D 吗?)。 这样,您就可以使用广泛的 XML 工具集轻松检查和解析元数据。
否则,我会遵循其他人已经说过的内容:文件创建的时间戳、创建文件的机器的标识符、基本上您可以想到的用于诊断目的的任何其他内容。 理想情况下,您应该包括结构格式本身的定义。 如果您经常更改结构,那么维护正确版本的代码来读取各种格式的旧数据文件将是一件非常痛苦的事情。
正如 @highpercomp 所提到的,HDF5 的一大优势是,您无需担心结构格式的变化,只要您对名称和数据类型有一些约定即可。 结构名称和数据类型都存储在文件本身中,因此您可以将 C 代码打成碎片,但没关系,您仍然可以从 HDF5 文件中检索数据。 它让您更少担心数据的格式,而更多地担心数据的结构,即我不关心字节顺序,这是HDF5的问题,但我一定要关心字段名称等。
我喜欢 HDF5 的另一个原因是您可以选择使用压缩,这需要很少的时间,并且如果数据变化缓慢或除了一些错误的有趣点之外基本相同,则可以为您带来巨大的存储空间优势。
If they're that large, I'd reserve a healthy chunk (64K?) of space at the beginning of the file and put the metadata there in XML format followed by an end-of-file character (Ctrl-Z for DOS/Windows, ctrl-D for unix?). That way you can examine and parse the metadata easily with the wide range of toolsets out there for XML.
Otherwise I go with what other people have already said: timestamp for file creation, identifier for which machine it's created on, basically anything else that you can think of for diagnostic purposes. And ideally you would include the definition of the structure format itself. If you are changing the structure often, it's a big pain to maintain the proper version of code around to read various formats of old datafiles.
One big advantage of HDF5 as @highpercomp has mentioned, is that you just don't need to worry about changes in the structure format, as long as you have some convention of what the names and datatypes are. The structure names and datatypes are all stored in the file itself, so you can blow your C code to smithereens and it doesn't matter, you can still retrieve data from an HDF5 file. It lets you worry less about the format of data and more on the structure of data, i.e. I don't care about the sequence of bytes, that's HDF5's problem, but I do care about field names and the like.
Another reason I like HDF5 is you can choose to use compression, which takes a very small amount of time and can give you huge wins in storage space if the data is slowly-changing or mostly the same except for a few errant blips of interestingness.
@rstevens 说“文件类型的标识符”......合理的建议。 传统上,这被称为幻数,并且在文件中不是滥用术语(与代码不同,它是滥用术语)。 基本上,它是一些数字 - 通常至少 4 个字节,并且我通常确保这些字节中至少有一个不是 ASCII - 您可以使用它来验证文件是否属于您期望的类型,并且混淆的可能性很小。 您还可以在 /etc/magic (或本地等效文件)中编写一条规则来报告包含您的幻数的文件是您的特殊文件类型。
您应该包含文件格式版本号。 但是,我建议不要使用代码的 SVN 编号。 当文件格式不变时,您的代码可能会更改。
@rstevens said 'an identifier for the type of file'...sound advice. Conventionally, that's called a magic number and, in a file, isn't a term of abuse (unlike in code, where it is a term of abuse). Basically, it is some number - typically at least 4 bytes, and I usually ensure that at least one of those bytes is not ASCII - that you can use to validate that the file is of the type you expect with a low probability of being confused. You can also write a rule in /etc/magic (or local equivalent) to report that files containing your magic number are your special file type.
You should include a file format version number. However, I would recommend not using the SVN number of the code. Your code may change when the file format does not.
除了架构版本控制所需的任何信息之外,还可以添加在排除问题时可能有价值的详细信息。 例如:
我们发现这非常有用(a)获取我们本来必须要求客户提供的信息以及(b)获取正确的信息 -令人惊讶的是,有多少客户报告他们正在运行与数据声称不同的软件版本!
In addition to whatever information you need for schema versioning, add details that may be of value if you are troubleshooting an issue. For example:
We find this is very useful (a) in getting information we would otherwise have to ask the customer to provide and (b) getting correct information -- it is amazing how many customers report they are running a different version of the software to what the data claims!
您可以考虑将文件偏移量放在标头的固定位置,它告诉您实际数据在文件中的开始位置。 这可以让您在需要时更改标题的大小。
在一些情况下,我将值 0x12345678 放入标头中,以便我可以检测文件格式是否与处理该文件的机器的字节序相匹配。
You might consider putting a file offset in a fixed position in the header, which tells you where the actual data begins in the file. This would let you change the size of the header when needed.
In a couple of cases, I put the value 0x12345678 into the header so I could detect if the file format, matched the endianism of the machine that was processing it.
根据我在电信设备配置和固件升级方面的经验,您实际上只需要从版本(标头的固定部分)开始的几个预定义字节(这很重要)。 标头的其余部分是可选的,通过指示正确的版本,您始终可以显示如何处理它。 这里重要的是你最好将标题的“变量”部分放在文件末尾。 如果您计划对标头进行操作而不修改文件内容本身。 此外,这还简化了应重新计算变量标头部分的“追加”操作。
很高兴拥有固定大小标头的功能(在开始时):
好的,对于可变部分 XML 或标头中的一些相当可扩展的格式是个好主意,但真的需要吗? 我对 ASN 编码有很多经验……在大多数情况下,它的使用已经过头了。
好吧,当您查看 RFC 2126(第 4.3 章)。
As my experience with telecom equipment configuration and firmware upgrades shows you only really need several predefined bytes at the begin (this is important) which starts from version (fixed part of header). Rest of header is optional, by indicating proper version you can always show how to process it. Important thing here is you'd better place 'variable' part of header at the end of file. If you plan operations on header without modifying file content itself. Also this simplify 'append' operations which should recalculate variable header part.
Nice to have features for fixed size header (at the begin):
OK, for variable part XML or some pretty extensible format in header is good idea but is it really needed? I had lot of experience with ASN encoding... in most cases its usage was overshot.
Well, maybe you will have additional understanding when you look at things like TPKT format which is described in RFC 2126 (chapter 4.3).
如果您将版本号放入标头中,则可以在需要更改 POD 结构或向标头添加新字段时随时更改该版本。
因此,现在不要向标题添加内容,因为它可能会很有趣。 您只是在创建必须维护的代码,但没有什么实际价值。
If you are putting a version number in the header you can change that version anytime you need to change the POD struct or add new fields to the header.
So don't add stuff to the header now because it might be interesting. You are just creating code that you have to maintain but that has little real value.
对于大文件,您可能需要添加数据定义,以便您的文件格式变得自描述。
For large files, you might want to add data definitions, so your file format becomes self-describing.
我的变体结合了 Roddy 和 Jason S 的方法。
总之 - 将格式化的文本元数据放在文件末尾,并确定其存储在其他地方的长度。
1) 在文件的开头放置一个长度字段,以便您知道末尾元数据的长度,而不是假设固定长度。 这样,要获取元数据,您只需读取固定长度的初始字段,然后从文件末尾获取元数据 blob。
2) 使用 XML 或 YAML 或 JSON 作为元数据。 如果元数据附加在末尾,这尤其有用/安全,因为读取该文件的人不会仅仅因为它以 XML 开头而自动认为它都是 XML。
这种方法的唯一缺点是,当元数据增长时,您必须更新文件的头部和尾部,但其他部分可能无论如何都会被更新。 如果它只是更新诸如上次访问日期之类的琐事,那么元数据长度不会改变,因此它只需要就地更新。
My variation combines Roddy and Jason S's approaches.
In summary - put formatted text metadata at the end of the file with a way to determine its length stored elsewhere.
1) Put an length field at the beginning of your file so you know the length of the metadata at the end rather than assuming a fixed length. That way, to get the metadata you just read that fixed-length initial field and then get the metadata blob from the end of file.
2) Use XML or YAML or JSON for the metadata. This is especially useful/safe if the metadata is appended at the end because nobody reading the file is going to automatically think it's all XML just because it starts with XML.
The only disadvantage in this approach is when your metadata grows, you have to update both the head of the file and the tail but it's likely other parts will have been updated anyway. If it's just updating trivia like a last-accessed date then the metadata length won't change so it only needs an update in-place.