当前位置：文江博客话题详情

InputStreamReader缓冲问题

发布于 2024-08-28 08:28:49 字数 2543 浏览 7 评论 0原文

不幸的是，我正在从一个具有两种字符编码类型的文件中读取数据。

有一个标题和一个正文。标头始终采用 ASCII 格式，并定义正文编码的字符集。

标头不是固定长度，必须通过解析器运行以确定其内容/长度。

该文件也可能非常大，因此我需要避免将整个内容放入内存中。

所以我从一个输入流开始。我最初使用带有 ASCII 的 InputStreamReader 对其进行包装，并对标头进行解码并提取正文的字符集。一切都好。

然后，我使用正确的字符集创建一个新的 InputStreamReader，将其放在同一个 InputStream 上并开始尝试读取正文。

不幸的是，javadoc 证实了这一点，InputStreamReader 可能会选择预读以提高效率。因此，标题的阅读会影响部分/全部正文。

有人对解决这个问题有什么建议吗？手动创建一个 CharsetDecoder 并一次输入一个字节会是一个好主意（可能包含在自定义 Reader 实现中吗？）

提前致谢。

编辑：我的最终解决方案是编写一个没有缓冲的InputStreamReader，以确保我可以解析标头而无需咀嚼正文的一部分。虽然这不是非常有效，但我用 BufferedInputStream 包装原始 InputStream，因此这不会成为问题。

// An InputStreamReader that only consumes as many bytes as is necessary
// It does not do any read-ahead.
public class InputStreamReaderUnbuffered extends Reader
{
    private final CharsetDecoder charsetDecoder;
    private final InputStream inputStream;
    private final ByteBuffer byteBuffer = ByteBuffer.allocate( 1 );

    public InputStreamReaderUnbuffered( InputStream inputStream, Charset charset )
    {
        this.inputStream = inputStream;
        charsetDecoder = charset.newDecoder();
    }

    @Override
    public int read() throws IOException
    {
        boolean middleOfReading = false;

        while ( true )
        {
            int b = inputStream.read();

            if ( b == -1 )
            {
                if ( middleOfReading )
                    throw new IOException( "Unexpected end of stream, byte truncated" );

                return -1;
            }

            byteBuffer.clear();
            byteBuffer.put( (byte)b );
            byteBuffer.flip();

            CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );

            // although this is theoretically possible this would violate the unbuffered nature
            // of this class so we throw an exception
            if ( charBuffer.length() > 1 )
                throw new IOException( "Decoded multiple characters from one byte!" );

            if ( charBuffer.length() == 1 )
                return charBuffer.get();

            middleOfReading = true;
        }
    }

    public int read( char[] cbuf, int off, int len ) throws IOException
    {
        for ( int i = 0; i < len; i++ )
        {
            int ch = read();

            if ( ch == -1 )
                return i == 0 ? -1 : i;

            cbuf[ i ] = (char)ch;
        }

        return len;
    }

    public void close() throws IOException
    {
        inputStream.close();
    }
}

原文

I am reading data from a file that has, unfortunately, two types of character encoding.

There is a header and a body. The header is always in ASCII and defines the character set that the body is encoded in.

The header is not fixed length and must be run through a parser to determine its content/length.

The file may also be quite large so I need to avoid bring the entire content into memory.

So I started off with a single InputStream. I wrap it initially with an InputStreamReader with ASCII and decode the header and extract the character set for the body. All good.

Then I create a new InputStreamReader with the correct character set, drop it over the same InputStream and start trying to read the body.

Unfortunately it appears, javadoc confirms this, that InputStreamReader may choose to read-ahead for effeciency purposes. So the reading of the header chews some/all of the body.

Does anyone have any suggestions for working round this issue? Would creating a CharsetDecoder manually and feeding in one byte at a time but a good idea (possibly wrapped in a custom Reader implementation?)

Thanks in advance.

EDIT: My final solution was to write a InputStreamReader that has no buffering to ensure I can parse the header without chewing part of the body. Although this is not terribly efficient I wrap the raw InputStream with a BufferedInputStream so it won't be an issue.

// An InputStreamReader that only consumes as many bytes as is necessary
// It does not do any read-ahead.
public class InputStreamReaderUnbuffered extends Reader
{
    private final CharsetDecoder charsetDecoder;
    private final InputStream inputStream;
    private final ByteBuffer byteBuffer = ByteBuffer.allocate( 1 );

    public InputStreamReaderUnbuffered( InputStream inputStream, Charset charset )
    {
        this.inputStream = inputStream;
        charsetDecoder = charset.newDecoder();
    }

    @Override
    public int read() throws IOException
    {
        boolean middleOfReading = false;

        while ( true )
        {
            int b = inputStream.read();

            if ( b == -1 )
            {
                if ( middleOfReading )
                    throw new IOException( "Unexpected end of stream, byte truncated" );

                return -1;
            }

            byteBuffer.clear();
            byteBuffer.put( (byte)b );
            byteBuffer.flip();

            CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );

            // although this is theoretically possible this would violate the unbuffered nature
            // of this class so we throw an exception
            if ( charBuffer.length() > 1 )
                throw new IOException( "Decoded multiple characters from one byte!" );

            if ( charBuffer.length() == 1 )
                return charBuffer.get();

            middleOfReading = true;
        }
    }

    public int read( char[] cbuf, int off, int len ) throws IOException
    {
        for ( int i = 0; i < len; i++ )
        {
            int ch = read();

            if ( ch == -1 )
                return i == 0 ? -1 : i;

            cbuf[ i ] = (char)ch;
        }

        return len;
    }

    public void close() throws IOException
    {
        inputStream.close();
    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦途 2024-09-04 08:28:50

为什么不使用 2 个InputStream？一个用于读取标题，另一个用于读取正文。

第二个InputStream应该跳过标头字节。

回复收藏 0 原文

小…红帽 2024-09-04 08:28:50

这是伪代码。

使用InputStream，但不要包装
周围的Reader。
读取包含标头和的字节
将它们存储到
ByteArrayOutputStream。
从以下位置创建 ByteArrayInputStream
ByteArrayOutputStream 并解码
header，这次包裹ByteArrayInputStream
进入带有 ASCII 字符集的 Reader 中。
计算非ascii的长度
输入，并读取该字节数
到另一个 ByteArrayOutputStream 中。
创建另一个 ByteArrayInputStream
从第二个
ByteArrayOutputStream 并包装它
带有 Reader 的字符集来自
标头。

回复收藏 0 原文

ペ泪落弦音 2024-09-04 08:28:50

我建议使用新的 InputStreamReader 从头开始重新读取流。也许假设支持InputStream.mark。

回复收藏 0 原文

素罗衫 2024-09-04 08:28:50

我的第一个想法是关闭流并重新打开它，使用 InputStream#skip 跳过标头，然后将流提供给新的 InputStreamReader。

如果您真的不想重新打开该文件，可以使用文件描述符将多个流获取到文件，尽管您可能必须使用channels 在文件中具有多个位置（因为您不能假设可以使用 reset 重置位置，可能不支持）。

回复收藏 0 原文

反目相谮 2024-09-04 08:28:50

甚至更简单：

正如您所说，您的标头始终采用 ASCII 格式。因此，直接从 InputStream 读取标头，完成后，使用正确的编码创建 Reader 并从中读取

private Reader reader;
private InputStream stream;

public void read() {
    int c = 0;
    while ((c = stream.read()) != -1) {
        // Read encoding
        if ( headerFullyRead ) {
            reader = new InputStreamReader( stream, encoding );
            break;
        }
    }
    while ((c = reader.read()) != -1) {
        // Handle rest of file
    }
}

It's even easier:

As you said, your header is always in ASCII. So read the header directly from the InputStream, and when you're done with it, create the Reader with the correct encoding and read from it

private Reader reader;
private InputStream stream;

public void read() {
    int c = 0;
    while ((c = stream.read()) != -1) {
        // Read encoding
        if ( headerFullyRead ) {
            reader = new InputStreamReader( stream, encoding );
            break;
        }
    }
    while ((c = reader.read()) != -1) {
        // Handle rest of file
    }
}

回复收藏 0 原文

扮仙女 2024-09-04 08:28:50

如果您包装 InputStream 并将所有读取一次限制为 1 个字节，则似乎禁用了 InputStreamReader 内部的缓冲。

这样我们就不必重写InputStreamReader逻辑。

public class OneByteReadInputStream extends InputStream
{
    private final InputStream inputStream;

    public OneByteReadInputStream(InputStream inputStream)
    {
        this.inputStream = inputStream;
    }

    @Override
    public int read() throws IOException
    {
        return inputStream.read();
    }

    @Override
    public int read(byte[] b, int off, int len) throws IOException
    {
        return super.read(b, off, 1);
    }
}

构建：

new InputStreamReader(new OneByteReadInputStream(inputStream));

If you wrap the InputStream and limit all reads to just 1 byte at a time, it seems to disable the buffering inside of InputStreamReader.

This way we don't have to rewrite the InputStreamReader logic.

public class OneByteReadInputStream extends InputStream
{
    private final InputStream inputStream;

    public OneByteReadInputStream(InputStream inputStream)
    {
        this.inputStream = inputStream;
    }

    @Override
    public int read() throws IOException
    {
        return inputStream.read();
    }

    @Override
    public int read(byte[] b, int off, int len) throws IOException
    {
        return super.read(b, off, 1);
    }
}

To construct:

new InputStreamReader(new OneByteReadInputStream(inputStream));

回复收藏 0 原文

~没有更多了~

关于作者

与他有关

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

InputStreamReader缓冲问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

InputStreamReader缓冲问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。