使用 Java 读取结构化二进制文件的最佳方法
我必须使用 Java 读取旧格式的二进制文件。
简而言之,该文件有一个由多个整数、字节和固定长度字符数组组成的标头,后面跟着一个也由整数和字符组成的记录列表。
在任何其他语言中,我都会创建 struct
(C/C++) 或 record
(Pascal/Delphi),它们是标头和记录的逐字节表示。 然后,我将 sizeof(header)
字节读入标头变量,并对记录执行相同的操作。
像这样的事情:(Delphi)
type
THeader = record
Version: Integer;
Type: Byte;
BeginOfData: Integer;
ID: array[0..15] of Char;
end;
...
procedure ReadData(S: TStream);
var
Header: THeader;
begin
S.ReadBuffer(Header, SizeOf(THeader));
...
end;
用 Java 做类似事情的最佳方法是什么? 我是否必须单独读取每个值,或者是否有其他方法可以进行这种“块读取”?
I have to read a binary file in a legacy format with Java.
In a nutshell the file has a header consisting of several integers, bytes and fixed-length char arrays, followed by a list of records which also consist of integers and chars.
In any other language I would create struct
s (C/C++) or record
s (Pascal/Delphi) which are byte-by-byte representations of the header and the record. Then I'd read sizeof(header)
bytes into a header variable and do the same for the records.
Something like this: (Delphi)
type
THeader = record
Version: Integer;
Type: Byte;
BeginOfData: Integer;
ID: array[0..15] of Char;
end;
...
procedure ReadData(S: TStream);
var
Header: THeader;
begin
S.ReadBuffer(Header, SizeOf(THeader));
...
end;
What is the best way to do something similar with Java? Do I have to read every single value on its own or is there any other way to do this kind of "block-read"?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
据我所知,Java 强制您以字节形式读取文件,而不是能够阻止读取。 如果您序列化 Java 对象,情况就会不同。
显示的其他示例使用 DataInputStream 类一个文件,但您也可以使用快捷方式: RandomAccessFile class:
请注意,您可以将响应对象转换为类,如果这样会更容易的话。
To my knowledge, Java forces you to read a file as bytes rather than being able to block read. If you were serializing Java objects, it'd be a different story.
The other examples shown use the DataInputStream class with a File, but you can also use a shortcut: The RandomAccessFile class:
Note that you could turn the responce objects into a class, if that would make it easier.
如果您要使用 Preon,那么您所要做的就是:
一旦您拥有了这个,您使用一行创建编解码器:
并且您可以像这样使用编解码器:
If you would be using Preon, then all you would have to do is this:
Once you have this, you create Codec using a single line:
And you use the Codec like this:
您可以按如下方式使用 DataInputStream 类:
获得这些值后,您可以随意使用它们。 在 API 中查找 java.io.DataInputStream 类以获取更多信息。
You could use the DataInputStream class as follows:
Once you get these values you can do with them as you please. Look up the java.io.DataInputStream class in the API for more info.
我可能误解了你,但在我看来,你正在创建内存中的结构,你希望它能够准确地表示你想要从硬盘读取的内容,然后将整个内容复制到内存中,然后操纵那里?
如果情况确实如此,那么你正在玩一个非常危险的游戏。 至少在 C 中,该标准不会强制执行结构体成员的填充或对齐等操作。 更不用说诸如大/小字节序或奇偶校验位之类的事情了......所以即使您的代码碰巧运行它也是非常不可移植且有风险的 - 您依赖于编译器的创建者不会改变对未来版本的想法。
最好创建一个自动机来验证从 HD 读取的结构(每个字节)是否有效,并填充内存中的结构(如果确实没问题)。 尽管您获得了平台和编译器的独立性,但您可能会损失一些毫秒(不像现代操作系统进行大量磁盘读取缓存那样看起来那么多)。 另外,您的代码可以轻松移植到另一种语言。
帖子编辑:在某种程度上我同情你。 在 DOS/Win3.11 的美好时光里,我曾经创建过一个 C 程序来读取 BMP 文件。 并使用了完全相同的技术。 一切都很好,直到我尝试为 Windows 编译它 - 哎呀!! Int 现在是 32 位长,而不是 16 位! 当我尝试在 Linux 上进行编译时,发现 gcc 的位域分配规则与 Microsoft C(6.0!)非常不同。 我不得不求助于宏技巧来使其便携......
I may have misunderstood you, but it seems to me you're creating in-memory structures you hope will be a byte-per-byte accurate representation of what you want to read from hard-disk, then copy the whole stuff onto memory and manipulate thence?
If that's indeed the case, you're playing a very dangerous game. At least in C, the standard doesn't enforce things like padding or aligning of members of a struct. Not to mention things like big/small endianness or parity bits... So even if your code happens to run it's very non-portable and risky - you depend on the compiler's creator not changing its mind on future versions.
Better to create an automaton to both validate the structure being read (byte per byte) from HD is valid, and filling an in-memory structure if it's indeed OK. You may loose some milliseconds (not so much as it may seem for modern OSes do a lot of disk read caching) though you gain platform and compiler independence. Plus, your code will be easily ported to another language.
Post Edit: In a way I sympathize with you. In the good-ol' days of DOS/Win3.11, I once created a C program to read BMP files. And used exactly the same technique. Everything was nice until I tried to compile it for Windows - oops!! Int was now 32 bits long, rather than 16! When I tried to compile on Linux, discovered gcc had very different rules for bit fields allocation than Microsoft C (6.0!). I had to resort to macro tricks to make it portable...
我使用了 Javolution 和 javastruct,两者都处理字节和对象之间的转换。
Javolution 提供表示 C 类型的类。 您所需要做的就是编写一个描述 C 结构的类。 例如,从C头文件来看,
应该翻译成:
然后调用
setByteBuffer
来初始化对象:javastruct 使用注释来定义 C 结构中的字段。
初始化一个对象:
I used Javolution and javastruct, both handles the conversion between bytes and objects.
Javolution provides classes that represent C types. All you need to do is to write a class that describes the C structure. For example, from the C header file,
should be translated into:
Then call
setByteBuffer
to initialize the object:javastruct uses annotation to define fields in a C structure.
To initialize an object:
我猜 FileInputStream 可以让您以字节为单位读取。 因此,使用 FileInputStream 打开文件并读取 sizeof(header)。 我假设标头具有固定的格式和大小。 我没有看到在最初的帖子中提到的,但假设是这种情况,因为如果标头有可选的参数和不同的大小,它会变得更加复杂。
一旦获得信息,就可以有一个标头类,您可以在其中分配已读取的缓冲区的内容。 然后以类似的方式解析记录。
I guess FileInputStream lets you read in bytes. So, opening the file with FileInputStream and read in the sizeof(header). I am assuming that the header has a fixed format and size. I don't see that mentioned in the initial post, but assuming that is the case as it would get much more complex if the header has optional args and different sizes.
Once you have the info, there can be a header class in which you assign the contents of the buffer that you've already read. And then parse the records in a similar fashion.
读取字节的链接
这是使用 ByteBuffer (Java NIO) http://exampledepot 。 com/egs/java.nio/ReadChannel.html
Here is a link to read byte using a ByteBuffer (Java NIO)
http://exampledepot.com/egs/java.nio/ReadChannel.html
正如其他人提到的,DataInputStream 和 Buffers 可能是您在 java 中处理二进制数据所需的低级 API。
但是,您可能想要类似 Construct 的内容(wiki 页面也有很好的示例:http://en.wikipedia.org/wiki/Construct_(python_library),但适用于 Java。
我不知道任何(Java 版本),但采用这种方法(在代码中声明性地指定结构)可能是正确的方法。 如果 Java 中有一个合适的 Fluent 接口,它可能与 DSL 非常相似。
编辑:一点谷歌搜索揭示了这一点:
http://javolution.org/api/javolution/ io/Struct.html
这可能就是您正在寻找的东西。 我不知道它是否有效或有什么好处,但它看起来是一个明智的起点。
As other people mention DataInputStream and Buffers are probably the low-level API's you are after for dealing with binary data in java.
However you probably want something like Construct (wiki page has good examples too: http://en.wikipedia.org/wiki/Construct_(python_library), but for Java.
I don't know of any (Java versions) off hand, but taking that approach (declaratively specifying the struct in code) would probably be the right way to go. With a suitable fluent interface in Java it would probably be quite similar to a DSL.
EDIT: bit of googling reveals this:
http://javolution.org/api/javolution/io/Struct.html
Which might be the kind of thing you are looking for. I have no idea whether it works or is any good, but it looks like a sensible place to start.
我将创建一个围绕 ByteBuffer 的对象 数据的表示并提供 getter 来直接从缓冲区读取。 通过这种方式,您可以避免将数据从缓冲区复制到原始类型。 此外,您可以使用 MappedByteBuffer 获取字节缓冲区。 如果您的二进制数据很复杂,您可以使用类对其进行建模,并为每个类提供缓冲区的切片版本。
同样有用的是从字节缓冲区读取无符号值的方法。
华泰
I would create an object that wraps around a ByteBuffer representation of the data and provide getters to read directly from the buffer. In this way, you avoid copying data from the buffer to primitive types. Furthermore, you could use a MappedByteBuffer to get the byte buffer. If your binary data is complex, you can model it using classes and give each class a sliced version of your buffer.
Also useful are the methods for reading unsigned values from byte buffers.
HTH
我已经编写了一种在 java 中执行此类操作的技术 - 类似于读取位字段的旧 C 习惯用法。 请注意,这只是一个开始,但可以扩展。
此处
I've written up a technique to do this sort of thing in java - similar to the old C-like idiom of reading bit-fields. Note it is just a start but could be expanded upon.
here
过去我使用DataInputStream按指定顺序读取任意类型的数据。 这将不允许您轻松解决大端/小端问题。
从 1.4 开始,java.nio.Buffer 系列可能是可行的方法,但看起来您的代码实际上可能更复杂。 这些类确实支持处理字节序问题。
In the past I used DataInputStream to read data of arbitrary types in a specified order. This will not allow you to easily account for big-endian/little-endian issues.
As of 1.4 the java.nio.Buffer family might be the way to go, but it seems that the your code might actually be more complicated. These classes do have support for handling endian issues.
不久前,我发现这篇文章关于使用反射和解析来读取二进制数据。 在本例中,作者使用反射来读取java二进制.class文件。 但如果您将数据读入类文件中,这可能会有所帮助。
A while ago I found this article on using reflection and parsing to read binary data. In this case, the author is using reflection to read the java binary .class files. But if you are reading the data into a class file, it may be of some help.