为什么 US-ASCII 编码接受非 US-ASCII 字符？

发布于 2024-10-15 17:16:13 字数 831 浏览 2 评论 0原文

考虑以下代码：

public class ReadingTest {

    public void readAndPrint(String usingEncoding) throws Exception {
        ByteArrayInputStream bais = new ByteArrayInputStream(new byte[]{(byte) 0xC2, (byte) 0xB5}); // 'micro' sign UTF-8 representation
        InputStreamReader isr = new InputStreamReader(bais, usingEncoding);
        char[] cbuf = new char[2];
        isr.read(cbuf);
        System.out.println(cbuf[0]+" "+(int) cbuf[0]);
    }

    public static void main(String[] argv) throws Exception {
        ReadingTest w = new ReadingTest();
        w.readAndPrint("UTF-8");
        w.readAndPrint("US-ASCII");
    }
}

观察到的输出：

µ 181
? 65533

为什么第二次调用 readAndPrint()（使用 US-ASCII 的调用）成功？我希望它会抛出错误，因为输入不是此编码中的正确字符。 Java API 或 JLS 中的哪个位置强制执行此行为？

原文

Consider the following code:

public class ReadingTest {

    public void readAndPrint(String usingEncoding) throws Exception {
        ByteArrayInputStream bais = new ByteArrayInputStream(new byte[]{(byte) 0xC2, (byte) 0xB5}); // 'micro' sign UTF-8 representation
        InputStreamReader isr = new InputStreamReader(bais, usingEncoding);
        char[] cbuf = new char[2];
        isr.read(cbuf);
        System.out.println(cbuf[0]+" "+(int) cbuf[0]);
    }

    public static void main(String[] argv) throws Exception {
        ReadingTest w = new ReadingTest();
        w.readAndPrint("UTF-8");
        w.readAndPrint("US-ASCII");
    }
}

Observed output:

µ 181
? 65533

Why does the second call of readAndPrint() (the one using US-ASCII) succeed? I would expect it to throw an error, since the input is not a proper character in this encoding. What is the place in the Java API or JLS which mandates this behavior?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

只为守护你 2024-10-22 17:16:13

在输入流中查找不可解码字节时的默认操作是将其替换为 Unicode 字符 U+FFFD 替换字符。

如果您想更改它，可以传递 CharacterDecoder 到具有不同 CodingErrorAction 配置：

CharsetDecoder decoder = Charset.forName(usingEncoding).newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
InputStreamReader isr = new InputStreamReader(bais, decoder);

The default operation when finding un-decodable bytes in the input-stream is to replace them with the Unicode Character U+FFFD REPLACEMENT CHARACTER.

If you want to change that, you can pass a CharacterDecoder to the InputStreamReader which has a different CodingErrorAction configured:

CharsetDecoder decoder = Charset.forName(usingEncoding).newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
InputStreamReader isr = new InputStreamReader(bais, decoder);

回复收藏 0 原文