BufferedReader 中的 GZIPInputStream 文件结束序列

发布于 2024-11-17 07:11:58 字数 1033 浏览 9 评论 0原文

我使用 Java BufferedReader 对象逐行读取 GZIPInputStream，该 GZIPInputStream 指向包含 1,000 行 ASCII 文本（采用典型 CSV 格式）的有效 GZIP 存档。代码如下所示：

BufferedReader buffer = new BufferedReader(new InputStreamReader(
                        new GZIPInputStream(new FileInputStream(file))));

其中 file 是指向存档的实际 File 对象。

我通过调用读取了所有文件

int count = 0;
String line = null;

while ((line = reader.readLine()) != null)
{
    count++;
}

，并且读取器按预期读取了该文件，但最后它绕过了第 #1000 行并再读取了一行（即结束循环后 count = 1001）。

在最后一行调用 line.length() 会报告大量 (4,000+) 个字符，所有这些字符都是不可打印的（Character.getNumericValue() 返回 - 1).

实际上，如果我执行 line.getBytes() ，生成的 byte[] 数组将具有相同数量的 NULL 字符（'\0'）。

这看起来像 BufferedReader 中的错误吗？

无论如何，有人可以建议一种解决方法来绕过这种行为吗？

编辑：更奇怪的行为：读取的第一行以文件名、几个 NULL 字符（'\0'）和事物行用户名和组名称为前缀，然后是实际文本！

编辑：我创建了一个非常简单的测试类，它至少在我的平台上重现了我上面描述的效果。

编辑：显然是误报，我得到的文件不是普通的 GZIP，而是焦油 GZIP，所以这解释了它，不需要进一步测试。谢谢大家！

原文

I use a Java BufferedReader object read, line-by-line, a GZIPInputStream that points to a valid GZIP archive that contains 1,000 lines of ASCII text, in typical CSV format. The code looks like this:

BufferedReader buffer = new BufferedReader(new InputStreamReader(
                        new GZIPInputStream(new FileInputStream(file))));

where file is the actual File object pointing to the archive.

I read through all the file by calling

int count = 0;
String line = null;

while ((line = reader.readLine()) != null)
{
    count++;
}

and the reader goes over the file as expected, but at the end it bypasses line #1000 and reads one more line (i.e., count = 1001 after ending the loop).

Calling line.length() on the last line reports a large number (4,000+) of characters, all of which are non-printable (Character.getNumericValue() returns -1).

Actually, if I do line.getBytes() the resulting byte[] array has an equal number of NULL characters ('\0').

Does this seem like a bug in BufferedReader?

In any case, can anyone please suggest a workaround to bypass this behavior?

EDIT: More weird behavior: The first line read is prefixed by the filename, several NULL characters ('\0') and things line username and group name, then the actual text follows!

EDIT: I have created a very simple test class that reproduces the effect I described above, at least on my platform.

EDIT: Apparently false alarm, the file I was getting was not plain GZIP but tarred GZIP, so this explains it, no need for further testing. Thanks everyone!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冷月断魂刀 2024-11-24 07:11:58

我想我发现了你的问题。

我尝试用您在问题中的来源重现它，并得到以下输出：

-------------------------------------
        Reading PLAIN file
-------------------------------------

Printable part of line 1:       This, is, line, number, 1

Line start (<= 25 characters): This__is__line__number__1

No NULL characters in line 1

Other information on line 1:
        Length: 25
        Bytes: 25
        First byte: 84

Printable part of line 10:      This, is, line, number, 10

Line start (<= 26 characters): This__is__line__number__10

No NULL characters in line 10

Other information on line 10:
        Length: 26
        Bytes: 26
        First byte: 84

File lines read: 10

-------------------------------------
        Reading GZIP file
-------------------------------------

Printable part of line 1:       This, is, line, number, 1

Line start (<= 25 characters): This__is__line__number__1

No NULL characters in line 1

Other information on line 1:
        Length: 25
        Bytes: 25
        First byte: 84

Printable part of line 10:      This, is, line, number, 10

Line start (<= 26 characters): This__is__line__number__10

No NULL characters in line 10

Other information on line 10:
        Length: 26
        Bytes: 26
        First byte: 84

File lines read: 10

-------------------------------------
        TOTAL READ
-------------------------------------

Plain: 10, GZIP: 10

我认为这不是您所拥有的。为什么？您正在使用 tar.gz 文件。这是 tar 存档格式，另外gzip 压缩。 GZipInputStream 撤消 gzip 压缩，但对 tar 存档格式一无所知。

tar 通常用于将多个文件打包在一起 - 以未压缩的格式，但与一些元数据一起打包，这就是您所观察到的：

编辑：更奇怪的行为：读取的第一行以文件名为前缀，
几个 NULL 字符（'\0'）和事物行用户名和组名，然后
实际文本如下！

如果您有 tar 文件，则需要使用 tar 解码器。 How do I extract a tar file in Java? 提供了一些链接（例如使用 Ant 中的 Tar 任务），还有JTar。

如果您只想发送一个文件，最好直接使用 gzip 格式（这就是我在测试中所做的）。

但是除了您期望 gzip-stream 读取 tar 格式之外，没有任何错误。

I think I found your problem.

I tried to reproduce it with your source in the question, and got this output:

-------------------------------------
        Reading PLAIN file
-------------------------------------

Printable part of line 1:       This, is, line, number, 1

Line start (<= 25 characters): This__is__line__number__1

No NULL characters in line 1

Other information on line 1:
        Length: 25
        Bytes: 25
        First byte: 84

Printable part of line 10:      This, is, line, number, 10

Line start (<= 26 characters): This__is__line__number__10

No NULL characters in line 10

Other information on line 10:
        Length: 26
        Bytes: 26
        First byte: 84

File lines read: 10

-------------------------------------
        Reading GZIP file
-------------------------------------

Printable part of line 1:       This, is, line, number, 1

Line start (<= 25 characters): This__is__line__number__1

No NULL characters in line 1

Other information on line 1:
        Length: 25
        Bytes: 25
        First byte: 84

Printable part of line 10:      This, is, line, number, 10

Line start (<= 26 characters): This__is__line__number__10

No NULL characters in line 10

Other information on line 10:
        Length: 26
        Bytes: 26
        First byte: 84

File lines read: 10

-------------------------------------
        TOTAL READ
-------------------------------------

Plain: 10, GZIP: 10

I think this is not what you are having. Why? You are using a tar.gz file. This is the tar archive format, and additionally the gzip compression. The GZipInputStream undoes the gzip compression, but knows nothing about the tar archive format.

tar is normally used to pack multiple files together - in an uncompressed format, but together with some metadata, which is what you observe:

EDIT: More weird behavior: The first line read is prefixed by the filename,
several NULL characters ('\0') and things line username and group name, then
the actual text follows!

If you have a tar file, you need to use a tar decoder. How do I extract a tar file in Java? gives some links (like using the Tar task from Ant), also there is JTar.

If you want to send only one file, better use the gzip format directly (this was what I did in my test).

But there is no bug anywhere, apart from you expecting the gzip-stream to read the tar format.

回复收藏 0 原文

~没有更多了~