InputStreamReader缓冲问题

发布于 2024-08-28 08:28:49 字数 2543 浏览 7 评论 0原文

不幸的是,我正在从一个具有两种字符编码类型的文件中读取数据。

有一个标题和一个正文。标头始终采用 ASCII 格式,并定义正文编码的字符集。

标头不是固定长度,必须通过解析器运行以确定其内容/长度。

该文件也可能非常大,因此我需要避免将整个内容放入内存中。

所以我从一个输入流开始。我最初使用带有 ASCII 的 InputStreamReader 对其进行包装,并对标头进行解码并提取正文的字符集。一切都好。

然后,我使用正确的字符集创建一个新的 InputStreamReader,将其放在同一个 InputStream 上并开始尝试读取正文。

不幸的是,javadoc 证实了这一点,InputStreamReader 可能会选择预读以提高效率。因此,标题的阅读会影响部分/全部正文。

有人对解决这个问题有什么建议吗?手动创建一个 CharsetDecoder 并一次输入一个字节会是一个好主意(可能包含在自定义 Reader 实现中吗?)

提前致谢。

编辑:我的最终解决方案是编写一个没有缓冲的InputStreamReader,以确保我可以解析标头而无需咀嚼正文的一部分。虽然这不是非常有效,但我用 BufferedInputStream 包装原始 InputStream,因此这不会成为问题。

// An InputStreamReader that only consumes as many bytes as is necessary
// It does not do any read-ahead.
public class InputStreamReaderUnbuffered extends Reader
{
    private final CharsetDecoder charsetDecoder;
    private final InputStream inputStream;
    private final ByteBuffer byteBuffer = ByteBuffer.allocate( 1 );

    public InputStreamReaderUnbuffered( InputStream inputStream, Charset charset )
    {
        this.inputStream = inputStream;
        charsetDecoder = charset.newDecoder();
    }

    @Override
    public int read() throws IOException
    {
        boolean middleOfReading = false;

        while ( true )
        {
            int b = inputStream.read();

            if ( b == -1 )
            {
                if ( middleOfReading )
                    throw new IOException( "Unexpected end of stream, byte truncated" );

                return -1;
            }

            byteBuffer.clear();
            byteBuffer.put( (byte)b );
            byteBuffer.flip();

            CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );

            // although this is theoretically possible this would violate the unbuffered nature
            // of this class so we throw an exception
            if ( charBuffer.length() > 1 )
                throw new IOException( "Decoded multiple characters from one byte!" );

            if ( charBuffer.length() == 1 )
                return charBuffer.get();

            middleOfReading = true;
        }
    }

    public int read( char[] cbuf, int off, int len ) throws IOException
    {
        for ( int i = 0; i < len; i++ )
        {
            int ch = read();

            if ( ch == -1 )
                return i == 0 ? -1 : i;

            cbuf[ i ] = (char)ch;
        }

        return len;
    }

    public void close() throws IOException
    {
        inputStream.close();
    }
}

I am reading data from a file that has, unfortunately, two types of character encoding.

There is a header and a body. The header is always in ASCII and defines the character set that the body is encoded in.

The header is not fixed length and must be run through a parser to determine its content/length.

The file may also be quite large so I need to avoid bring the entire content into memory.

So I started off with a single InputStream. I wrap it initially with an InputStreamReader with ASCII and decode the header and extract the character set for the body. All good.

Then I create a new InputStreamReader with the correct character set, drop it over the same InputStream and start trying to read the body.

Unfortunately it appears, javadoc confirms this, that InputStreamReader may choose to read-ahead for effeciency purposes. So the reading of the header chews some/all of the body.

Does anyone have any suggestions for working round this issue? Would creating a CharsetDecoder manually and feeding in one byte at a time but a good idea (possibly wrapped in a custom Reader implementation?)

Thanks in advance.

EDIT: My final solution was to write a InputStreamReader that has no buffering to ensure I can parse the header without chewing part of the body. Although this is not terribly efficient I wrap the raw InputStream with a BufferedInputStream so it won't be an issue.

// An InputStreamReader that only consumes as many bytes as is necessary
// It does not do any read-ahead.
public class InputStreamReaderUnbuffered extends Reader
{
    private final CharsetDecoder charsetDecoder;
    private final InputStream inputStream;
    private final ByteBuffer byteBuffer = ByteBuffer.allocate( 1 );

    public InputStreamReaderUnbuffered( InputStream inputStream, Charset charset )
    {
        this.inputStream = inputStream;
        charsetDecoder = charset.newDecoder();
    }

    @Override
    public int read() throws IOException
    {
        boolean middleOfReading = false;

        while ( true )
        {
            int b = inputStream.read();

            if ( b == -1 )
            {
                if ( middleOfReading )
                    throw new IOException( "Unexpected end of stream, byte truncated" );

                return -1;
            }

            byteBuffer.clear();
            byteBuffer.put( (byte)b );
            byteBuffer.flip();

            CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );

            // although this is theoretically possible this would violate the unbuffered nature
            // of this class so we throw an exception
            if ( charBuffer.length() > 1 )
                throw new IOException( "Decoded multiple characters from one byte!" );

            if ( charBuffer.length() == 1 )
                return charBuffer.get();

            middleOfReading = true;
        }
    }

    public int read( char[] cbuf, int off, int len ) throws IOException
    {
        for ( int i = 0; i < len; i++ )
        {
            int ch = read();

            if ( ch == -1 )
                return i == 0 ? -1 : i;

            cbuf[ i ] = (char)ch;
        }

        return len;
    }

    public void close() throws IOException
    {
        inputStream.close();
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

梦途 2024-09-04 08:28:50

为什么不使用 2 个InputStream?一个用于读取标题,另一个用于读取正文。

第二个InputStream应该跳过标头字节。

Why don't you use 2 InputStreams? One for reading the header and another for the body.

The second InputStream should skip the header bytes.

小…红帽 2024-09-04 08:28:50

这是伪代码。

  1. 使用InputStream,但不要包装
    周围的Reader
  2. 读取包含标头和的字节
    将它们存储到
    ByteArrayOutputStream
  3. 从以下位置创建 ByteArrayInputStream
    ByteArrayOutputStream 并解码
    header,这次包裹ByteArrayInputStream
    进入带有 ASCII 字符集的 Reader 中。
  4. 计算非ascii的长度
    输入,并读取该字节数
    到另一个 ByteArrayOutputStream 中。
  5. 创建另一个 ByteArrayInputStream
    从第二个
    ByteArrayOutputStream 并包装它
    带有 Reader 的字符集来自
    标头。

Here is the pseudo code.

  1. Use InputStream, but do not wrap a
    Reader around it.
  2. Read bytes containing header and
    store them into
    ByteArrayOutputStream.
  3. Create ByteArrayInputStream from
    ByteArrayOutputStream and decode
    header, this time wrap ByteArrayInputStream
    into Reader with ASCII charset.
  4. Compute the length of non-ascii
    input, and read that number of bytes
    into another ByteArrayOutputStream.
  5. Create another ByteArrayInputStream
    from the second
    ByteArrayOutputStream and wrap it
    with Reader with charset from the
    header.
ペ泪落弦音 2024-09-04 08:28:50

我建议使用新的 InputStreamReader 从头开始​​重新读取流。也许假设支持InputStream.mark

I suggest rereading the stream from the start with a new InputStreamReader. Perhaps assume that InputStream.mark is supported.

素罗衫 2024-09-04 08:28:50

我的第一个想法是关闭流并重新打开它,使用 InputStream#skip 跳过标头,然后将流提供给新的 InputStreamReader

如果您真的不想重新打开该文件,可以使用 文件描述符将多个流获取到文件,尽管您可能必须使用channels 在文件中具有多个位置(因为您不能假设可以使用 reset 重置位置,可能不支持)。

My first thought is to close the stream and reopen it, using InputStream#skip to skip past the header before giving the stream to the new InputStreamReader.

If you really, really don't want to reopen the file, you could use file descriptors to get more than one stream to the file, although you may have to use channels to have multiple positions within the file (since you can't assume you can reset the position with reset, it may not be supported).

反目相谮 2024-09-04 08:28:50

甚至更简单:

正如您所说,您的标头始终采用 ASCII 格式。因此,直接从 InputStream 读取标头,完成后,使用正确的编码创建 Reader 并从中读取

private Reader reader;
private InputStream stream;

public void read() {
    int c = 0;
    while ((c = stream.read()) != -1) {
        // Read encoding
        if ( headerFullyRead ) {
            reader = new InputStreamReader( stream, encoding );
            break;
        }
    }
    while ((c = reader.read()) != -1) {
        // Handle rest of file
    }
}

It's even easier:

As you said, your header is always in ASCII. So read the header directly from the InputStream, and when you're done with it, create the Reader with the correct encoding and read from it

private Reader reader;
private InputStream stream;

public void read() {
    int c = 0;
    while ((c = stream.read()) != -1) {
        // Read encoding
        if ( headerFullyRead ) {
            reader = new InputStreamReader( stream, encoding );
            break;
        }
    }
    while ((c = reader.read()) != -1) {
        // Handle rest of file
    }
}
扮仙女 2024-09-04 08:28:50

如果您包装 InputStream 并将所有读取一次限制为 1 个字节,则似乎禁用了 InputStreamReader 内部的缓冲。

这样我们就不必重写InputStreamReader逻辑。

public class OneByteReadInputStream extends InputStream
{
    private final InputStream inputStream;

    public OneByteReadInputStream(InputStream inputStream)
    {
        this.inputStream = inputStream;
    }

    @Override
    public int read() throws IOException
    {
        return inputStream.read();
    }

    @Override
    public int read(byte[] b, int off, int len) throws IOException
    {
        return super.read(b, off, 1);
    }
}

构建:

new InputStreamReader(new OneByteReadInputStream(inputStream));

If you wrap the InputStream and limit all reads to just 1 byte at a time, it seems to disable the buffering inside of InputStreamReader.

This way we don't have to rewrite the InputStreamReader logic.

public class OneByteReadInputStream extends InputStream
{
    private final InputStream inputStream;

    public OneByteReadInputStream(InputStream inputStream)
    {
        this.inputStream = inputStream;
    }

    @Override
    public int read() throws IOException
    {
        return inputStream.read();
    }

    @Override
    public int read(byte[] b, int off, int len) throws IOException
    {
        return super.read(b, off, 1);
    }
}

To construct:

new InputStreamReader(new OneByteReadInputStream(inputStream));
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文