Java中的八进制转义导致错误的字节值,编码问题?
根据此文档( http://java.sun.com/ docs/books/jls/third_edition/html/lexical.html , 3.10.6) OctalEscape 将转换为 unicode 字符。现在我遇到问题,以下代码将导致 2 字节 Unicode 字符包含错误信息。
for (byte b : "\222".getBytes()) {
System.out.format("%02x ", b);
}
结果是“c2 92”。我只解释了“92”,因为这将是从八进制 222 到十六进制 (92) 的转换值。 如果我用字符测试它,则字节信息是正确的。
System.out.format("%02x ", (byte)'\222');
结果是“92”为一个字节” 在使用 Java/c 1.6.0_18 的 Linux 上,我的默认编码是“UTF-8”。
我的问题的背景是,我正在寻找一种方法将八进制转义字符串从输入编码 Cp1252 转换为 UTF-8。由于八进制转义字符串转换为 2 个字节,因此此操作失败。 有人知道为什么总是在 char 数组中添加一个额外的字节“c2”吗?简单的计数表明,数组中只有一个字符。
System.out.println("\222".toCharArray().length); // will result in "1"
谢谢你的提示。
更新: 正如 BalusC 提到的,八进制转义值被解释为 UTF-8 值,这会产生问题。只要这个值保存在源代码(UTF-8)中,我就不可能用其他编码读取这个字符串。我说得对吗?如果我读取 Cp1252 编码的文件,我必须使用正确的字符集声明 InputReader 的字符集,并对 UTF-8 进行编码以处理读取的内容并将其保存为 UTF-8。
According to this documentation ( http://java.sun.com/docs/books/jls/third_edition/html/lexical.html , 3.10.6) an OctalEscape will be converted to an unicode character. Now I have the problem, that the following code will result in a 2 byte Unicode character with wrong informations.
for (byte b : "\222".getBytes()) {
System.out.format("%02x ", b);
}
The result is "c2 92". I was expacting only "92", because this would be the converted value from 222 octal to hex (92).
If I test this with a character, the byte informations are correct.
System.out.format("%02x ", (byte)'\222');
The result is "92" for one byte"
My default encoding is "UTF-8" on Linux with Java/c 1.6.0_18.
The background of my question is, that I'm looking for a method to convert an octal escaped string from the input encoding Cp1252 to UTF-8. This fails because of the conversion of an octal escaped string to 2 bytes.
Does somebody know why there is always an extra byte "c2" been added to the char array? A simple count shows, that there is only one character in the array.
System.out.println("\222".toCharArray().length); // will result in "1"
Thank you for your hints.
Update:
As BalusC mentioned the octal escaped value is interpreted as UTF-8 value, which yield the problem. As long as this value is saved in the source code (UTF-8) I have no possibility to read in this string with an other encoding. I'm right? If I read an Cp1252 encoded file, I have to declare the charset of the InputReader with the correct charset and do an encoding to UTF-8 to process and save the read content as UTF-8.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
String#没有指定编码的 getBytes()
调用将使用平台默认编码将字符转换为字节。由于c2
是 的两字节字符的典型第一个字节多字节 UTF-8 序列,显然您正在使用 UTF-8 作为平台默认编码。如果您想获取 CP1252 字节,则需要在String#getBytes(String charsetName)
方法。根据您的更新更新:
这是正确的。您需要使用与保存文件时相同的编码来读取文件,否则您可能会面临 mojibake。
只需使用
InputStreamReader
将文件读取为 CP1252。当读取为字符(字符串)时,Java 会将其隐式存储为 Unicode (UTF-16)。您可以将数据视为 Unicode。无需引入中间 UTF-8 文件步骤。如果要保存文件,请使用OutputStreamWriter
和所需的字符集,这可能与 CP1252 不同。请记住,任何未被字符集覆盖的字符最终都会成为?
。另请参阅:
The
String#getBytes()
call without a specified encoding will use the platform default encoding to convert characters to bytes. Sincec2
is a typical first byte of a two-byte character of the multibyte UTF-8 sequence, you're apparently using UTF-8 as platform default encoding. If you want to get CP1252 bytes, then you need to specify that explicitly in theString#getBytes(String charsetName)
method.Update as per your update:
That's correct. You need to read the file using the same encoding as the file was saved in, otherwise you may risk to end up with mojibake.
Just read the file as CP1252 using
InputStreamReader
. When read as characters (strings), Java will store it implicitly as Unicode (UTF-16). You can treat the data as Unicode. There's no need to introduce an intermediating UTF-8 file step. If you want to save the file, useOutputStreamWriter
with the desired charset, this can be different from CP1252. Only keep in mind that any character which isn't covered by the charset will end up as?
.See also:
Java 中的所有字符和字符串都是 UTF-16。因此,您已输入控制字符 U+0092 PRIVATE USE TWO 并对其进行编码到 UTF-8(编码为 UTF-8 时该字符占用两个字节)。以 UTF-16 以外的任何方式编码的字符必须由字节数组表示。
U+2019: '
我猜你打算对字符进行转码 U+2019 右单引号马克。在 windows-1252 中,其字节值为 92。我不想让大家失望,但是当编码为 UTF-8 时,这将最终成为多字节序列
E2 80 99
。另请注意,U+2019 不能用 Java 中的八进制转义序列表示,因为它的值超过 U+00FF。您必须使用 Unicode 转义序列
\u2019
。我在这里写了一篇关于不同语言转码的博客文章 以及 Java 源文件中的编码 这里。All chars and strings in Java are UTF-16. So, you have entered the control character U+0092 PRIVATE USE TWO and encoded it to UTF-8 (this character takes two bytes when encoded as UTF-8). Characters encoded as anything other than UTF-16 must be represented by byte arrays.
U+2019: ’
I'm guessing you intend to transcode the character U+2019 RIGHT SINGLE QUOTATION MARK. In windows-1252, this has a byte value of
92
. I hate to disappoint, but when encoded as UTF-8 this is going to end up as the multi-byte sequenceE2 80 99
.Also note that U+2019 can't be represented by octal escape sequences in Java as it has a value over U+00FF. You'd have to use the Unicode escape sequence
\u2019
. I wrote a blog post about transcoding in different languages here and encoding in Java source files here.