Java 1.6 Windows-1252 编码因 3 个字符而失败
编辑:我确信这个问题有点荒谬。感谢那些回复的人。我可能会发布一个更具体的后续问题。
今天,我正在研究一些编码问题,并编写了这个单元测试来隔离基本重现案例:
int badCount = 0;
for (int i = 1; i < 255; i++) {
String str = "Hi " + new String(new char[] { (char) i });
String toLatin1 = new String(str.getBytes("UTF-8"), "latin1");
assertEquals(str, new String(toLatin1.getBytes("latin1"), "UTF-8"));
String toWin1252 = new String(str.getBytes("UTF-8"), "Windows-1252");
String fromWin1252 = new String(toWin1252.getBytes("Windows-1252"), "UTF-8");
if (!str.equals(fromWin1252)) {
System.out.println("Can't encode: " + i + " - " + str +
" - encodes as: " + fromWin1252);
badCount++;
}
}
System.out.println("Bad count: " + badCount);
输出:
Can't encode: 129 - Hi ? - encodes as: Hi ?? Can't encode: 141 - Hi ? - encodes as: Hi ?? Can't encode: 143 - Hi ? - encodes as: Hi ?? Can't encode: 144 - Hi ? - encodes as: Hi ?? Can't encode: 157 - Hi ? - encodes as: Hi ?? Can't encode: 193 - Hi Á - encodes as: Hi ?? Can't encode: 205 - Hi Í - encodes as: Hi ?? Can't encode: 207 - Hi Ï - encodes as: Hi ?? Can't encode: 208 - Hi ? - encodes as: Hi ?? Can't encode: 221 - Hi ? - encodes as: Hi ?? Bad count: 10
Mac OS 10.6.2 上的 JDK 1.6.0_07
My观察:
Latin1 对称编码所有 254 个字符。 Windows-1252 没有。这三个可打印字符(193、205、207)在 Latin1 和 Windows-1252 中是相同的代码,因此我预计不会出现任何问题。
谁能解释这种行为?这是 JDK 的错误吗?
——詹姆斯
EDIT: I've been convinced that this question is somewhat non-sensical. Thanks to those who responded. I may post a follow-up question that is more specific.
Today I was investing some encoding problems and wrote this unit test to isolate a base repro case:
int badCount = 0;
for (int i = 1; i < 255; i++) {
String str = "Hi " + new String(new char[] { (char) i });
String toLatin1 = new String(str.getBytes("UTF-8"), "latin1");
assertEquals(str, new String(toLatin1.getBytes("latin1"), "UTF-8"));
String toWin1252 = new String(str.getBytes("UTF-8"), "Windows-1252");
String fromWin1252 = new String(toWin1252.getBytes("Windows-1252"), "UTF-8");
if (!str.equals(fromWin1252)) {
System.out.println("Can't encode: " + i + " - " + str +
" - encodes as: " + fromWin1252);
badCount++;
}
}
System.out.println("Bad count: " + badCount);
The output:
Can't encode: 129 - Hi ? - encodes as: Hi ?? Can't encode: 141 - Hi ? - encodes as: Hi ?? Can't encode: 143 - Hi ? - encodes as: Hi ?? Can't encode: 144 - Hi ? - encodes as: Hi ?? Can't encode: 157 - Hi ? - encodes as: Hi ?? Can't encode: 193 - Hi Á - encodes as: Hi ?? Can't encode: 205 - Hi Í - encodes as: Hi ?? Can't encode: 207 - Hi Ï - encodes as: Hi ?? Can't encode: 208 - Hi ? - encodes as: Hi ?? Can't encode: 221 - Hi ? - encodes as: Hi ?? Bad count: 10
JDK 1.6.0_07 on Mac OS 10.6.2
My observation:
Latin1 symmetrically encodes all 254 characters. Windows-1252 does not. The three printable characters (193, 205, 207) are the same codes in Latin1 and Windows-1252, so I wouldn't expect any issues.
Can anyone explain this behavior? Is this a JDK bug?
-- James
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在我看来,测试程序存在严重缺陷,因为它在没有语义的字符串之间进行了有效无用的转换。
如果您想检查所有字节值是否都是给定编码的有效值,那么类似这样的内容可能更像是这样:
请注意,此测试程序使用 1 中的(usnigned)字节值来测试输入到 255。问题中的代码使用 1 到 255 之间的 char 值(相当于此范围内的 Unicode 代码点)。
尝试打印示例中程序处理的实际字节数组,您会看到您实际上并没有检查所有字节值,并且您的一些“坏”匹配是其他匹配的重复项。
使用
"Windows-1252"
作为参数运行此命令会产生以下输出:这告诉我们
Windows-1252
不接受字节值 129, 1441, 143, 144和 157 作为有效值。 (注意:我在这里讨论的是无符号字节值。上面的代码显示-127、-115,...因为Java只知道无符号字节)。有关 Windows-1252 的维基百科文章 似乎通过以下说明验证了这一观察结果:
In my opinion the testing program is deeply flawed, because it makes effectively useless transformations between Strings with no semantic meaning.
If you want to check if all byte values are valid values for a given encoding, then something like this might be more like it:
Note that this testing program tests inputs using the (usnigned) byte values from 1 to 255. The code in the question uses the char values (equivalent to Unicode codepoints in this range) from 1 to 255.
Try printing the actual byte arrays handled by the program in the example and you see that you're not actually checking all byte values and that some of your "bad" matches are duplicates of others.
Running this with
"Windows-1252"
as the argument produces this output:Which tells us that
Windows-1252
doesn't accept the byte values 129, 1441, 143, 144 and 157 as valid values. (Note: I'm talking about unsigned byte values here. The code above shows -127, -115, ... because Java only knows unsigned bytes).The Wikipedia article on Windows-1252 seems to verify this observation by stating this:
您的代码所做的(
String->byte[]->String
,两次)几乎与转码相反,并且根本没有任何意义(它实际上是保证丢失数据)。转码的意思是byte[]->String->byte[]
:当然,当输入包含目标编码不支持的字符时,它会丢失数据。
What your code does (
String->byte[]->String
, twice) is pretty much the opposite of transcoding, and makes no sense at all (it's virtually guaranteed to lose data). Transcoding meansbyte[]->String->byte[]
:And of course, it will lose data when the input contains characters that the target encoding does not support.