InputStreamReader缓冲问题
不幸的是,我正在从一个具有两种字符编码类型的文件中读取数据。
有一个标题和一个正文。标头始终采用 ASCII 格式,并定义正文编码的字符集。
标头不是固定长度,必须通过解析器运行以确定其内容/长度。
该文件也可能非常大,因此我需要避免将整个内容放入内存中。
所以我从一个输入流开始。我最初使用带有 ASCII 的 InputStreamReader 对其进行包装,并对标头进行解码并提取正文的字符集。一切都好。
然后,我使用正确的字符集创建一个新的 InputStreamReader,将其放在同一个 InputStream 上并开始尝试读取正文。
不幸的是,javadoc 证实了这一点,InputStreamReader 可能会选择预读以提高效率。因此,标题的阅读会影响部分/全部正文。
有人对解决这个问题有什么建议吗?手动创建一个 CharsetDecoder 并一次输入一个字节会是一个好主意(可能包含在自定义 Reader 实现中吗?)
提前致谢。
编辑:我的最终解决方案是编写一个没有缓冲的InputStreamReader,以确保我可以解析标头而无需咀嚼正文的一部分。虽然这不是非常有效,但我用 BufferedInputStream 包装原始 InputStream,因此这不会成为问题。
// An InputStreamReader that only consumes as many bytes as is necessary
// It does not do any read-ahead.
public class InputStreamReaderUnbuffered extends Reader
{
private final CharsetDecoder charsetDecoder;
private final InputStream inputStream;
private final ByteBuffer byteBuffer = ByteBuffer.allocate( 1 );
public InputStreamReaderUnbuffered( InputStream inputStream, Charset charset )
{
this.inputStream = inputStream;
charsetDecoder = charset.newDecoder();
}
@Override
public int read() throws IOException
{
boolean middleOfReading = false;
while ( true )
{
int b = inputStream.read();
if ( b == -1 )
{
if ( middleOfReading )
throw new IOException( "Unexpected end of stream, byte truncated" );
return -1;
}
byteBuffer.clear();
byteBuffer.put( (byte)b );
byteBuffer.flip();
CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );
// although this is theoretically possible this would violate the unbuffered nature
// of this class so we throw an exception
if ( charBuffer.length() > 1 )
throw new IOException( "Decoded multiple characters from one byte!" );
if ( charBuffer.length() == 1 )
return charBuffer.get();
middleOfReading = true;
}
}
public int read( char[] cbuf, int off, int len ) throws IOException
{
for ( int i = 0; i < len; i++ )
{
int ch = read();
if ( ch == -1 )
return i == 0 ? -1 : i;
cbuf[ i ] = (char)ch;
}
return len;
}
public void close() throws IOException
{
inputStream.close();
}
}
I am reading data from a file that has, unfortunately, two types of character encoding.
There is a header and a body. The header is always in ASCII and defines the character set that the body is encoded in.
The header is not fixed length and must be run through a parser to determine its content/length.
The file may also be quite large so I need to avoid bring the entire content into memory.
So I started off with a single InputStream. I wrap it initially with an InputStreamReader with ASCII and decode the header and extract the character set for the body. All good.
Then I create a new InputStreamReader with the correct character set, drop it over the same InputStream and start trying to read the body.
Unfortunately it appears, javadoc confirms this, that InputStreamReader may choose to read-ahead for effeciency purposes. So the reading of the header chews some/all of the body.
Does anyone have any suggestions for working round this issue? Would creating a CharsetDecoder manually and feeding in one byte at a time but a good idea (possibly wrapped in a custom Reader implementation?)
Thanks in advance.
EDIT: My final solution was to write a InputStreamReader that has no buffering to ensure I can parse the header without chewing part of the body. Although this is not terribly efficient I wrap the raw InputStream with a BufferedInputStream so it won't be an issue.
// An InputStreamReader that only consumes as many bytes as is necessary
// It does not do any read-ahead.
public class InputStreamReaderUnbuffered extends Reader
{
private final CharsetDecoder charsetDecoder;
private final InputStream inputStream;
private final ByteBuffer byteBuffer = ByteBuffer.allocate( 1 );
public InputStreamReaderUnbuffered( InputStream inputStream, Charset charset )
{
this.inputStream = inputStream;
charsetDecoder = charset.newDecoder();
}
@Override
public int read() throws IOException
{
boolean middleOfReading = false;
while ( true )
{
int b = inputStream.read();
if ( b == -1 )
{
if ( middleOfReading )
throw new IOException( "Unexpected end of stream, byte truncated" );
return -1;
}
byteBuffer.clear();
byteBuffer.put( (byte)b );
byteBuffer.flip();
CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );
// although this is theoretically possible this would violate the unbuffered nature
// of this class so we throw an exception
if ( charBuffer.length() > 1 )
throw new IOException( "Decoded multiple characters from one byte!" );
if ( charBuffer.length() == 1 )
return charBuffer.get();
middleOfReading = true;
}
}
public int read( char[] cbuf, int off, int len ) throws IOException
{
for ( int i = 0; i < len; i++ )
{
int ch = read();
if ( ch == -1 )
return i == 0 ? -1 : i;
cbuf[ i ] = (char)ch;
}
return len;
}
public void close() throws IOException
{
inputStream.close();
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
为什么不使用 2 个
InputStream
?一个用于读取标题,另一个用于读取正文。第二个
InputStream
应该跳过
标头字节。Why don't you use 2
InputStream
s? One for reading the header and another for the body.The second
InputStream
shouldskip
the header bytes.这是伪代码。
InputStream
,但不要包装周围的
Reader
。将它们存储到
ByteArrayOutputStream
。ByteArrayOutputStream
并解码header,这次包裹
ByteArrayInputStream
进入带有 ASCII 字符集的
Reader
中。输入,并读取该字节数
到另一个 ByteArrayOutputStream 中。
从第二个
ByteArrayOutputStream
并包装它带有
Reader
的字符集来自标头。
Here is the pseudo code.
InputStream
, but do not wrap aReader
around it.store them into
ByteArrayOutputStream
.ByteArrayInputStream
fromByteArrayOutputStream
and decodeheader, this time wrap
ByteArrayInputStream
into
Reader
with ASCII charset.input, and read that number of bytes
into another
ByteArrayOutputStream
.ByteArrayInputStream
from the second
ByteArrayOutputStream
and wrap itwith
Reader
with charset from theheader.
我建议使用新的
InputStreamReader
从头开始重新读取流。也许假设支持InputStream.mark
。I suggest rereading the stream from the start with a new
InputStreamReader
. Perhaps assume thatInputStream.mark
is supported.我的第一个想法是关闭流并重新打开它,使用
InputStream#skip
跳过标头,然后将流提供给新的InputStreamReader
。如果您真的不想重新打开该文件,可以使用 文件描述符将多个流获取到文件,尽管您可能必须使用channels 在文件中具有多个位置(因为您不能假设可以使用
reset
重置位置,可能不支持)。My first thought is to close the stream and reopen it, using
InputStream#skip
to skip past the header before giving the stream to the newInputStreamReader
.If you really, really don't want to reopen the file, you could use file descriptors to get more than one stream to the file, although you may have to use channels to have multiple positions within the file (since you can't assume you can reset the position with
reset
, it may not be supported).甚至更简单:
正如您所说,您的标头始终采用 ASCII 格式。因此,直接从 InputStream 读取标头,完成后,使用正确的编码创建 Reader 并从中读取
It's even easier:
As you said, your header is always in ASCII. So read the header directly from the InputStream, and when you're done with it, create the Reader with the correct encoding and read from it
如果您包装 InputStream 并将所有读取一次限制为 1 个字节,则似乎禁用了 InputStreamReader 内部的缓冲。
这样我们就不必重写InputStreamReader逻辑。
构建:
If you wrap the InputStream and limit all reads to just 1 byte at a time, it seems to disable the buffering inside of InputStreamReader.
This way we don't have to rewrite the InputStreamReader logic.
To construct: