将文件中的 ASCII 文本与二进制内容分离
我有一个包含 ASCII 文本和二进制内容的文件。我想提取文本而无需解析二进制内容,因为二进制内容为 180MB。我可以简单地提取文本以进行进一步操作吗?最好的方法是什么?
ASCII 位于文件的最开头。
I have a file that has both ASCII text and binary content. I would like to extract the text without having to parse the binary content as the binary content is 180MB. Can I simply extract the text for further manipulation ... what would be the best way of going about it.
The ASCII is at the very beginning of the file.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
Java 中有 4 个库可以读取 FITS 文件此处:
There are 4 libraries to read FITS files in Java here:
我不知道有任何 Java 类会读取 ASCII 字符并忽略其余字符,但我在这里能想到的最简单的方法是使用
strings
实用程序(假设您使用的是 Unix-为基础的系统)。然后,您可以将输出通过管道传输到另一个文件,并对其执行任何您想要的操作。
编辑:有了所有 ASCII 都在开头的附加信息,以编程方式提取文本会更容易一些;尽管如此,这仍然比编写代码更快。
I am not aware of any Java classes that will read the ASCII characters and ignore the rest, but the easiest thing I can come up with here is to use the
strings
utility (assuming you are on a Unix-based system).You could then pipe the output to another file and do whatever you want with it.
Edit: with the additional information that all the ASCII comes at the beginning, it would be a little easier to extract the text programmatically; still, this is faster than writing code.
假设您可以知道 ASCII 内容的结尾在哪里,只需从文件中读取字符,直到找到它的结尾,然后关闭文件。
Assuming you can tell where the end of the ASCII content is, just read characters from the file until you find the end of it, and close the file.
假设有一些标记将文件分为二进制和 ASCII 部分(例如,“#END#”单独位于一行),您可以执行如下操作:
Supposing that there is some token which divides the file into the binary and ASCII components (say, "#END#" on a line all by itself), you can do sometihng like the following:
有一种方法可以检查特定字符是否符合您的标准(在这里,我已经介绍了在键盘上找到的字符)。一旦你击中了该方法返回 false 的字符,你就知道你已经击中了二进制文件。请注意,有效的 ASCII 字符也可能构成二进制文件的一部分,因此最后可能会出现一些额外的字符。
Have a method that checks whether a particular character meets your criteria (here, I've covered characters that are found on the keyboard). Once you hit a character for which the method returns false, you know you've hit the binary. Note that valid ASCII characters may also form part of the binary so you may end up with a few extra characters at the end.
FITS 文件的前 2880 字节是 ASCII 标头数据,代表36 80列
“卡片图像”。没有行终止符,只有一个 36x80 ASCII 数组,必要时用空格填充。二进制数据之前可能有额外的 2880 字节 ASCII 标头;您必须解析第一组标头才能知道需要多少 ASCII。
但我衷心赞同 Oscar Reyes 的建议,即使用现有的包来解码 FITS 文件!他提到的其中两个包由 NASA 戈达德太空飞行中心托管,该中心还负责维护 FITS 格式。这大概是你能得到的最权威的来源了。
The first 2880 bytes of a FITS file are ASCII header data, representing 36 80-column
"card images". There are no line terminator characters, just a 36x80 ASCII array, padded out with blanks if necessary. There may be additional 2880-byte ASCII headers preceding the binary data; you'd have to parse the first set of headers to know how much ASCII to expect.
But I heartily endorse Oscar Reyes' advice to use an existing package to decode FITS files! Two of the packages he mentioned are hosted by NASA's Goddard Space Flight Center, who are also responsible for maintaining the FITS format. That's about as definitive a source as you can get.