Java 1.6 Windows-1252 编码因 3 个字符而失败

发布于 2024-08-19 05:00:20 字数 1535 浏览 10 评论 0原文

编辑:我确信这个问题有点荒谬。感谢那些回复的人。我可能会发布一个更具体的后续问题。

今天,我正在研究一些编码问题,并编写了这个单元测试来隔离基本重现案例:

int badCount = 0;
for (int i = 1; i < 255; i++) {
    String str = "Hi " + new String(new char[] { (char) i });

    String toLatin1  = new String(str.getBytes("UTF-8"), "latin1");
    assertEquals(str, new String(toLatin1.getBytes("latin1"), "UTF-8"));

    String toWin1252 = new String(str.getBytes("UTF-8"), "Windows-1252");
    String fromWin1252 = new String(toWin1252.getBytes("Windows-1252"), "UTF-8");

    if (!str.equals(fromWin1252)) {
        System.out.println("Can't encode: " + i + " - " + str + 
                           " - encodes as: " + fromWin1252);
        badCount++;
    }
}

System.out.println("Bad count: " + badCount);

输出:

    Can't encode: 129 - Hi ? - encodes as: Hi ??
    Can't encode: 141 - Hi ? - encodes as: Hi ??
    Can't encode: 143 - Hi ? - encodes as: Hi ??
    Can't encode: 144 - Hi ? - encodes as: Hi ??
    Can't encode: 157 - Hi ? - encodes as: Hi ??
    Can't encode: 193 - Hi Á - encodes as: Hi ??
    Can't encode: 205 - Hi Í - encodes as: Hi ??
    Can't encode: 207 - Hi Ï - encodes as: Hi ??
    Can't encode: 208 - Hi ? - encodes as: Hi ??
    Can't encode: 221 - Hi ? - encodes as: Hi ??
    Bad count: 10

Mac OS 10.6.2 上的 JDK 1.6.0_07

My观察:

Latin1 对称编码所有 254 个字符。 Windows-1252 没有。这三个可打印字符(193、205、207)在 Latin1 和 Windows-1252 中是相同的代码,因此我预计不会出现任何问题。

谁能解释这种行为?这是 JDK 的错误吗?

——詹姆斯

EDIT: I've been convinced that this question is somewhat non-sensical. Thanks to those who responded. I may post a follow-up question that is more specific.

Today I was investing some encoding problems and wrote this unit test to isolate a base repro case:

int badCount = 0;
for (int i = 1; i < 255; i++) {
    String str = "Hi " + new String(new char[] { (char) i });

    String toLatin1  = new String(str.getBytes("UTF-8"), "latin1");
    assertEquals(str, new String(toLatin1.getBytes("latin1"), "UTF-8"));

    String toWin1252 = new String(str.getBytes("UTF-8"), "Windows-1252");
    String fromWin1252 = new String(toWin1252.getBytes("Windows-1252"), "UTF-8");

    if (!str.equals(fromWin1252)) {
        System.out.println("Can't encode: " + i + " - " + str + 
                           " - encodes as: " + fromWin1252);
        badCount++;
    }
}

System.out.println("Bad count: " + badCount);

The output:

    Can't encode: 129 - Hi ? - encodes as: Hi ??
    Can't encode: 141 - Hi ? - encodes as: Hi ??
    Can't encode: 143 - Hi ? - encodes as: Hi ??
    Can't encode: 144 - Hi ? - encodes as: Hi ??
    Can't encode: 157 - Hi ? - encodes as: Hi ??
    Can't encode: 193 - Hi Á - encodes as: Hi ??
    Can't encode: 205 - Hi Í - encodes as: Hi ??
    Can't encode: 207 - Hi Ï - encodes as: Hi ??
    Can't encode: 208 - Hi ? - encodes as: Hi ??
    Can't encode: 221 - Hi ? - encodes as: Hi ??
    Bad count: 10

JDK 1.6.0_07 on Mac OS 10.6.2

My observation:

Latin1 symmetrically encodes all 254 characters. Windows-1252 does not. The three printable characters (193, 205, 207) are the same codes in Latin1 and Windows-1252, so I wouldn't expect any issues.

Can anyone explain this behavior? Is this a JDK bug?

-- James

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

生寂 2024-08-26 05:00:20

在我看来,测试程序存在严重缺陷,因为它在没有语义的字符串之间进行了有效无用的转换。

如果您想检查所有字节值是否都是给定编码的有效值,那么类似这样的内容可能更像是这样:

public static void tryEncoding(final String encoding) throws UnsupportedEncodingException {
    int badCount = 0;
    for (int i = 1; i < 255; i++) {
        byte[] bytes = new byte[] { (byte) i };

        String toString = new String(bytes, encoding);
        byte[] fromString = toString.getBytes(encoding);

        if (!Arrays.equals(bytes, fromString)) {
            System.out.println("Can't encode: " + i + " - in: " + Arrays.toString(bytes) + "/ out: "
                    + Arrays.toString(fromString) + " - result: " + toString);
            badCount++;
        }
    }

    System.out.println("Bad count: " + badCount);
}

请注意,此测试程序使用 1 中的(usnigned)字节值来测试输入到 255。问题中的代码使用 1 到 255 之间的 char 值(相当于此范围内的 Unicode 代码点)。

尝试打印示例中程序处理的实际字节数组,您会看到您实际上并没有检查所有字节值,并且您的一些“坏”匹配是其他匹配的重复项。

使用 "Windows-1252" 作为参数运行此命令会产生以下输出:

Can't encode: 129 - in: [-127]/ out: [63] - result: �
Can't encode: 141 - in: [-115]/ out: [63] - result: �
Can't encode: 143 - in: [-113]/ out: [63] - result: �
Can't encode: 144 - in: [-112]/ out: [63] - result: �
Can't encode: 157 - in: [-99]/ out: [63] - result: �
Bad count: 5

这告诉我们 Windows-1252 不接受字节值 129, 1441, 143, 144和 157 作为有效值。 (注意:我在这里讨论的是无符号字节值。上面的代码显示-127、-115,...因为Java只知道无符号字节)。

有关 Windows-1252 的维基百科文章 似乎通过以下说明验证了这一观察结果:

根据微软和Unicode联盟网站上的信息,位置81、8D、8F、90和9D未使用

In my opinion the testing program is deeply flawed, because it makes effectively useless transformations between Strings with no semantic meaning.

If you want to check if all byte values are valid values for a given encoding, then something like this might be more like it:

public static void tryEncoding(final String encoding) throws UnsupportedEncodingException {
    int badCount = 0;
    for (int i = 1; i < 255; i++) {
        byte[] bytes = new byte[] { (byte) i };

        String toString = new String(bytes, encoding);
        byte[] fromString = toString.getBytes(encoding);

        if (!Arrays.equals(bytes, fromString)) {
            System.out.println("Can't encode: " + i + " - in: " + Arrays.toString(bytes) + "/ out: "
                    + Arrays.toString(fromString) + " - result: " + toString);
            badCount++;
        }
    }

    System.out.println("Bad count: " + badCount);
}

Note that this testing program tests inputs using the (usnigned) byte values from 1 to 255. The code in the question uses the char values (equivalent to Unicode codepoints in this range) from 1 to 255.

Try printing the actual byte arrays handled by the program in the example and you see that you're not actually checking all byte values and that some of your "bad" matches are duplicates of others.

Running this with "Windows-1252" as the argument produces this output:

Can't encode: 129 - in: [-127]/ out: [63] - result: �
Can't encode: 141 - in: [-115]/ out: [63] - result: �
Can't encode: 143 - in: [-113]/ out: [63] - result: �
Can't encode: 144 - in: [-112]/ out: [63] - result: �
Can't encode: 157 - in: [-99]/ out: [63] - result: �
Bad count: 5

Which tells us that Windows-1252 doesn't accept the byte values 129, 1441, 143, 144 and 157 as valid values. (Note: I'm talking about unsigned byte values here. The code above shows -127, -115, ... because Java only knows unsigned bytes).

The Wikipedia article on Windows-1252 seems to verify this observation by stating this:

According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused

蓝天 2024-08-26 05:00:20

您的代码所做的(String->byte[]->String,两次)几乎与转码相反,并且根本没有任何意义(它实际上是保证丢失数据)。转码的意思是byte[]->String->byte[]

public byte[] transcode(byte[] input, String inputEnc, String targetEnc)
{
    return new String(input, inputEnc).getBytes(targetEnc);
}

当然,当输入包含目标编码不支持的字符时,它会丢失数据。

What your code does (String->byte[]->String, twice) is pretty much the opposite of transcoding, and makes no sense at all (it's virtually guaranteed to lose data). Transcoding means byte[]->String->byte[]:

public byte[] transcode(byte[] input, String inputEnc, String targetEnc)
{
    return new String(input, inputEnc).getBytes(targetEnc);
}

And of course, it will lose data when the input contains characters that the target encoding does not support.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文