如何在 Java 中干净地读取同时包含 ASCII 和其他编码的文件?

发布于 2024-08-03 06:35:51 字数 418 浏览 6 评论 0原文

我有一个自定义图像文件,其中第一个数据块是 ASCII 元数据。我需要能够使用 Java 读取文件的 ASCII 元数据部分,并知道它何时结束,以及另一种编码中的“原始图像数据”何时开始。

我正在考虑将所有文件读入一个 byte[],然后以某种方式开始从中读取字节并将它们转换为 ASCII,直到我到达 ascii 元数据部分的末尾,此时我将存储它数据。然后我可以按原样以不同的顺序重新排列原始二进制数据(无需读取)。然而,我可以考虑这样做的唯一方法是逐字节读取 ascii 内容并查找新行,然后连接新行之前的所有内容,看看这是否是表示该行开始的标记。原始图像数据。但是,必须有一种更好的方法来使用 readLine() 读取文件的 ascii 部分,然后能够立即从原始图像二进制文件开始,而无需在新的阅读器中重新打开文件并转到文件中的行。其他读者我发现了“开始图像”标签。

有什么想法吗?

I have a custom image file where the first block of data is ASCII meta data. I need to be able to read this ASCII meta-data part of the file with Java and know when it ends, and when the 'raw image data' in another encoding starts.

I was thinking of reading all of the file into a byte[], and then somehow either start reading bytes out of this and convert them to ASCII until I hit the end of the ascii meta-data section, at which point I would store this data. Then I could just rearrange the raw binary data in a different order as-is (no reading necessary). However, the only way I could think about doing this would be to read the ascii stuff byte-by-byte and look for new lines, and concat everything prior to a new line and see if that is the tag which signifies the beginning of the raw image data. However, there must be a better way of reading the ascii part of the file with readLine() and then be able to immediately start with the raw image binary without needed to reopen the file in a new reader and go to the line where in the other reader I found the 'begin image' tag.

Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

请别遗忘我 2024-08-10 06:35:51
  • 将文件打开为 FileInputStream (包装在 BufferedInputStream 中)
  • 创建一个 ByteArrayOutputStream
  • 逐字节读取输入流,查找“开始图像”使用字符串搜索算法的标记。将单个字节转换为 char(隐式使用 ASCII)
  • 同时,将您查看的每个字节写入 ByteArrayOutputStream
  • 一旦找到标签,您就可以可以开始从输入流中读取图像数据
  • 从 ByteArrayOutputStream 中获取字节数组,并使用 new String(array, "US-ASCII"); 将其转换为字符串

使用 Scanner 在输入流上,但您必须小心使用哪种模式以确保它无需启动即可找到标签读取图像数据(因为您想从您保留单独引用的底层输入流中读取该数据)。

编辑:不幸的是,扫描仪似乎隐式使用缓冲区也是如此,所以剩下的唯一选择是“手动”实现字符串搜索。

  • Open the file as FileInputStream (wrapped in a BufferedInputStream)
  • Create a ByteArrayOutputStream
  • Read the input stream byte by byte, looking for your "begin image" tag using a string searching algorithm. Cast individual bytes to char (that's using ASCII implicitly)
  • At the same time, write each byte you've looked at into the ByteArrayOutputStream
  • Once you've found the tag, you can start reading the image data from the input stream
  • Get the byte array from the ByteArrayOutputStream and convert it to a String using new String(array, "US-ASCII");

It might be possible to do the string searching easily by using a Scanner on the input stream, but you have to be careful which pattern you use to make sure it will find the tag without starting to read the image data (since you want to read that yourself from the underlying input stream you're keeping a separate reference to).

Edit: Unfortunately, it looks like Scanner implicitly uses a buffer as well, so the only option left is to implement the string search "manually".

撩动你心 2024-08-10 06:35:51

不确定您是否可以自己决定格式,但无论如何:

另一种策略是在文件的第一个位置写入一个整数值,其中包含用于 ascii 分区的字节数。
然后您可以只读取该数量的字节,并且还可以轻松地跳过 ascii 并直接进入二进制 blob。

此策略很有效,但您无法在不更改计数的情况下更改 ascii 文本字符的数量。

顺便说一句,请确保清理您的输入:不要尝试读取超出文件包含的数据或分配超出机器能力的内存。

就我个人而言,我还会使用文件的前几个字符来包含一些魔术代码,以便您可以对文件正在使用您的数据格式以及数据格式的版本进行最少的检查。

Not sure if you can decide the format yourself, but anyway:

An alternative strategy is to write an integer value at the first location of the file, which contains the number of bytes which are used for the ascii partition.
Then you could just read that amount of bytes, and it is also possible to easily skip the ascii and go directly to the binary blob.

This strategy is efficient, but you cannot change the amount of ascii text characters without changing the count.

By the way, make sure to sanitize your input: Don't try to read more data then the file contains or allocate more memory then the machine is capable of.

Personally I would also use the first couple of characters of the file to contain some magic code, so that you can have a minimal check that the file is using your data format, and what version of the data format.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文