二进制文件格式:需要纠错吗?
为了提高效率,我需要以二进制格式序列化一些数据(通常为 10-100MB 文件的数据记录),并且我正在制定格式详细信息。我想知道我是否真的需要担心文件损坏/错误纠正/等。
在什么情况下可能会发生文件损坏?我应该在我的二进制格式中建立对损坏的鲁棒性吗?或者我应该用某种纠错代码包装我的非鲁棒性的字节流? (有什么建议吗?我正在使用 Java)或者我不应该担心这个吗?
编辑:初步的二进制格式,正如我现在所拥有的,包含一堆可变长度的段,所以我有点担心,如果我确实有数据损坏,那么在读回它时,我可以不同步,并且无法恢复+我丢失了文件的其余部分。
I need to serialize some data in a binary format for efficiency (datalog where 10-100MB files are typical), and I'm working out the formatting details. I'm wondering if realistically I need to worry about file corruption / error correction / etc.
What are circumstances where file corruption can happen? Should I be building robustness to corruption into my binary format? Or should I wrap my nonrobust-to-corruption stream of bytes with some kind of error correcting code? (any suggestions? I'm using Java) Or should I just not worry about this?
edit: preliminary binary format, as I have it right now, contains a bunch of variable-length segments, so I am slightly worried that if I do ever have data corruption then upon reading it back, I could get out of sync, and cannot recover + I lose the rest of the file.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您至少应该添加校验和。 BER 在现代硬盘驱动器上表现良好,但对于其他介质则不然。写入期间断电通常会损坏文件末尾。如果数据很重要,您将需要纠错代码、三重和无缓冲写入等来提交事务。
EXE 没有纠错功能,而单个位的更改可能会产生严重的后果。
如果要通过 TCP 传输文件,您可以假设零错误。
You should at least add checksum. BER is good on modern hard drives, but this is not so for other media. Power loss during write usually corrupts file ends. If the data is important, you will need error correction codes, tripple and unbuffered writes, etc to commit transactions.
EXE do not have error correction, while single bit change can have drastic consequences.
If a file is to be transferred over TCP, you may assume zero errors.
我曾经见过一两次通过互联网传输的文件被损坏的情况。您可以使用校验和(例如 SHA256)进行错误检测。
I have seen it happen once or twice that a file transferred over the Internet became corrupted. You can do error detection using a checksum, such as SHA256.
您可能对HDF5 中错误检测代码的说明感兴趣。校验和的位置和类型取决于您访问和更新数据的方式以及检测错误的有用块。
You might be interested in the notes on error detecting codes in HDF5. Where and what kind of checksum depends on how you are accessing and updating the data as well as what is a useful chunk to detect an error in.
我使用了 Reed-Solomon 编码系统。 有一个相当简单的方法-使用Java 实现 在 Google zxing< /a> 库。
I went with a Reed-Solomon encoding system. There's a fairly easy-to-use Java implementation of it in Java in the Google zxing library.