将文件中的 ASCII 文本与二进制内容分离

发布于 2024-08-21 15:19:38 字数 118 浏览 1 评论 0原文

我有一个包含 ASCII 文本和二进制内容的文件。我想提取文本而无需解析二进制内容,因为二进制内容为 180MB。我可以简单地提取文本以进行进一步操作吗?最好的方法是什么?

ASCII 位于文件的最开头。

I have a file that has both ASCII text and binary content. I would like to extract the text without having to parse the binary content as the binary content is 180MB. Can I simply extract the text for further manipulation ... what would be the best way of going about it.

The ASCII is at the very beginning of the file.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

不顾 2024-08-28 15:19:38

Java 中有 4 个库可以读取 FITS 文件此处

Java

nom.tam.fits

Java FITS 库已经开发出来,它为 FITS 图像和二进制表提供了高效的(至少对于 Java 而言)I/O。 Java 库支持所有基本 FITS 格式和 gzip 压缩文件。包括对访问数据子集的支持,并且可以使用 HIERARCH 约定。

eap.fits

包括用于查看和编辑 FITS 文件的小程序和应用程序。还包括用于读取和写入 FITS 数据的通用包。如果可选的 PGP jar 文件可用,它可以读取 PGP 加密文件。

jfits

jfits 库支持 FITS 图像以及 ASCII 和二进制表。支持关键字和数据的内嵌修改。

STIL

一个纯java通用表I/O库,可以读写FITS二进制表以及其他表格式。它非常高效,可以提供对比物理内存大得多的 FITS 表的快速顺序或随机读取访问。不支持 FITS 图像。

There are 4 libraries to read FITS files in Java here:

Java

nom.tam.fits classes

A Java FITS library has been developed which provides efficient -- at least for Java -- I/O for FITS images and binary tables. The Java libraries support all basic FITS formats and gzip compressed files. Support for access to data subsets is included and the HIERARCH convention may be used.

eap.fits

Includes an applet and application for viewing and editing FITS files. Also includes a general purpose package for reading and writing FITS data. It can read PGP encrypted files if the optional PGP jar file is available.

jfits

The jfits library supports FITS images and ASCII and binary tables. In-line modification of keywords and data is supported.

STIL

A pure java general purpose table I/O library which can read and write FITS binary tables amongst other table formats. It is efficient and can provide fast sequential or random read access to FITS tables much larger than physical memory. There is no support for FITS images.

情话墙 2024-08-28 15:19:38

我不知道有任何 Java 类会读取 ASCII 字符并忽略其余字符,但我在这里能想到的最简单的方法是使用 strings 实用程序(假设您使用的是 Unix-为基础的系统)。

概要
字符串 [-] [-a] [-o] [-t 格式] [-数字] [-n 数字]
[--] [文件...]

描述
字符串在二进制文件或标准文件中查找 ASCII 字符串
输入。
字符串对于识别随机对象文件和
许多其他
事物。字符串是 4(默认)或更多的任意序列
印刷
以换行符或空字符结尾的字符。除非 -
标志是
给定,字符串会在目标文件的所有部分中查找,除了

(__TEXT,__text) 部分。如果没有指定文件标准输入

阅读。

然后,您可以将输出通过管道传输到另一个文件,并对其执行任何您想要的操作。

编辑:有了所有 ASCII 都在开头的附加信息,以编程方式提取文本会更容易一些;尽管如此,这仍然比编写代码更快。

I am not aware of any Java classes that will read the ASCII characters and ignore the rest, but the easiest thing I can come up with here is to use the strings utility (assuming you are on a Unix-based system).

SYNOPSIS
strings [ - ] [ -a ] [ -o ] [ -t format ] [ -number ] [ -n number ]
[--] [file ...]

DESCRIPTION
Strings looks for ASCII strings in a binary file or standard
input.
Strings is useful for identifying random object files and
many other
things. A string is any sequence of 4 (the default) or more
printing
characters ending with a newline or a null. Unless the -
flag is
given, strings looks in all sections of the object files except
the
(__TEXT,__text) section. If no files are specified standard input
is
read.

You could then pipe the output to another file and do whatever you want with it.

Edit: with the additional information that all the ASCII comes at the beginning, it would be a little easier to extract the text programmatically; still, this is faster than writing code.

恏ㄋ傷疤忘ㄋ疼 2024-08-28 15:19:38

假设您可以知道 ASCII 内容的结尾在哪里,只需从文件中读取字符,直到找到它的结尾,然后关闭文件。

Assuming you can tell where the end of the ASCII content is, just read characters from the file until you find the end of it, and close the file.

醉殇 2024-08-28 15:19:38

假设有一些标记将文件分为二进制和 ASCII 部分(例如,“#END#”单独位于一行),您可以执行如下操作:

import java.io.*;

// ...

public static void main(String args[]) {
  try {
    FileInputStream f = new FileInputStream("object.bin");
    DataInputStream d = new DataInputStream(f);
    BufferedReader b = new BufferedReader(new InputStreamReader(d));

    String s = "";
    while ((s = b.readLine()) != "#END#") {
      // ASCII contents parsed here.
      System.out.println(s);
    }

    d.close();
  } catch (Exception e) {
      System.err.println("kablammo! " + e.getMessage());
  }
}

Supposing that there is some token which divides the file into the binary and ASCII components (say, "#END#" on a line all by itself), you can do sometihng like the following:

import java.io.*;

// ...

public static void main(String args[]) {
  try {
    FileInputStream f = new FileInputStream("object.bin");
    DataInputStream d = new DataInputStream(f);
    BufferedReader b = new BufferedReader(new InputStreamReader(d));

    String s = "";
    while ((s = b.readLine()) != "#END#") {
      // ASCII contents parsed here.
      System.out.println(s);
    }

    d.close();
  } catch (Exception e) {
      System.err.println("kablammo! " + e.getMessage());
  }
}
双马尾 2024-08-28 15:19:38

有一种方法可以检查特定字符是否符合您的标准(在这里,我已经介绍了在键盘上找到的字符)。一旦你击中了该方法返回 false 的字符,你就知道你已经击中了二进制文件。请注意,有效的 ASCII 字符也可能构成二进制文件的一部分,因此最后可能会出现一些额外的字符。

static boolean isAsciiCharacter(char c) {
    return (c >= ' ' && c <= '~') ||
            c == '\n' ||
            c == '\r';
}

Have a method that checks whether a particular character meets your criteria (here, I've covered characters that are found on the keyboard). Once you hit a character for which the method returns false, you know you've hit the binary. Note that valid ASCII characters may also form part of the binary so you may end up with a few extra characters at the end.

static boolean isAsciiCharacter(char c) {
    return (c >= ' ' && c <= '~') ||
            c == '\n' ||
            c == '\r';
}
不念旧人 2024-08-28 15:19:38

FITS 文件的前 2880 字节是 ASCII 标头数据,代表36 80列
“卡片图像”。没有行终止符,只有一个 36x80 ASCII 数组,必要时用空格填充。二进制数据之前可能有额外的 2880 字节 ASCII 标头;您必须解析第一组标头才能知道需要多少 ASCII。

但我衷心赞同 Oscar Reyes 的建议,即使用现有的包来解码 FITS 文件!他提到的其中两个包由 NASA 戈达德太空飞行中心托管,该中心还负责维护 FITS 格式。这大概是你能得到的最权威的来源了。

The first 2880 bytes of a FITS file are ASCII header data, representing 36 80-column
"card images". There are no line terminator characters, just a 36x80 ASCII array, padded out with blanks if necessary. There may be additional 2880-byte ASCII headers preceding the binary data; you'd have to parse the first set of headers to know how much ASCII to expect.

But I heartily endorse Oscar Reyes' advice to use an existing package to decode FITS files! Two of the packages he mentioned are hosted by NASA's Goddard Space Flight Center, who are also responsible for maintaining the FITS format. That's about as definitive a source as you can get.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文