BufferedReader 中的 GZIPInputStream 文件结束序列
我使用 Java BufferedReader 对象逐行读取 GZIPInputStream,该 GZIPInputStream 指向包含 1,000 行 ASCII 文本(采用典型 CSV 格式)的有效 GZIP 存档。代码如下所示:
BufferedReader buffer = new BufferedReader(new InputStreamReader(
new GZIPInputStream(new FileInputStream(file))));
其中 file 是指向存档的实际 File 对象。
我通过调用读取了所有文件
int count = 0;
String line = null;
while ((line = reader.readLine()) != null)
{
count++;
}
,并且读取器按预期读取了该文件,但最后它绕过了第 #1000 行并再读取了一行(即结束循环后 count = 1001)。
在最后一行调用 line.length() 会报告大量 (4,000+) 个字符,所有这些字符都是不可打印的(Character.getNumericValue() 返回 - 1).
实际上,如果我执行 line.getBytes() ,生成的 byte[] 数组将具有相同数量的 NULL 字符('\0')。
这看起来像 BufferedReader 中的错误吗?
无论如何,有人可以建议一种解决方法来绕过这种行为吗?
编辑:更奇怪的行为:读取的第一行以文件名、几个 NULL 字符('\0')和事物行用户名和组名称为前缀,然后是实际文本!
编辑:我创建了一个非常简单的测试类,它至少在我的平台上重现了我上面描述的效果。
编辑:显然是误报,我得到的文件不是普通的 GZIP,而是焦油 GZIP,所以这解释了它,不需要进一步测试。谢谢大家!
I use a Java BufferedReader object read, line-by-line, a GZIPInputStream that points to a valid GZIP archive that contains 1,000 lines of ASCII text, in typical CSV format. The code looks like this:
BufferedReader buffer = new BufferedReader(new InputStreamReader(
new GZIPInputStream(new FileInputStream(file))));
where file is the actual File object pointing to the archive.
I read through all the file by calling
int count = 0;
String line = null;
while ((line = reader.readLine()) != null)
{
count++;
}
and the reader goes over the file as expected, but at the end it bypasses line #1000 and reads one more line (i.e., count = 1001 after ending the loop).
Calling line.length() on the last line reports a large number (4,000+) of characters, all of which are non-printable (Character.getNumericValue() returns -1).
Actually, if I do line.getBytes() the resulting byte[] array has an equal number of NULL characters ('\0').
Does this seem like a bug in BufferedReader?
In any case, can anyone please suggest a workaround to bypass this behavior?
EDIT: More weird behavior: The first line read is prefixed by the filename, several NULL characters ('\0') and things line username and group name, then the actual text follows!
EDIT: I have created a very simple test class that reproduces the effect I described above, at least on my platform.
EDIT: Apparently false alarm, the file I was getting was not plain GZIP but tarred GZIP, so this explains it, no need for further testing. Thanks everyone!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我想我发现了你的问题。
我尝试用您在问题中的来源重现它,并得到以下输出:
我认为这不是您所拥有的。为什么?您正在使用
tar.gz
文件。这是tar
存档格式,另外gzip
压缩。 GZipInputStream 撤消 gzip 压缩,但对tar
存档格式一无所知。tar 通常用于将多个文件打包在一起 - 以未压缩的格式,但与一些元数据一起打包,这就是您所观察到的:
如果您有
tar
文件,则需要使用 tar 解码器。 How do I extract a tar file in Java? 提供了一些链接(例如使用 Ant 中的 Tar 任务),还有JTar。如果您只想发送一个文件,最好直接使用
gzip
格式(这就是我在测试中所做的)。但是除了您期望 gzip-stream 读取 tar 格式之外,没有任何错误。
I think I found your problem.
I tried to reproduce it with your source in the question, and got this output:
I think this is not what you are having. Why? You are using a
tar.gz
file. This is thetar
archive format, and additionally thegzip
compression. The GZipInputStream undoes the gzip compression, but knows nothing about thetar
archive format.tar is normally used to pack multiple files together - in an uncompressed format, but together with some metadata, which is what you observe:
If you have a
tar
file, you need to use a tar decoder. How do I extract a tar file in Java? gives some links (like using the Tar task from Ant), also there is JTar.If you want to send only one file, better use the
gzip
format directly (this was what I did in my test).But there is no bug anywhere, apart from you expecting the gzip-stream to read the tar format.