UTF-8 和 ISO 8859-1 之间的转换:

发布于 2025-01-06 10:27:45 字数 667 浏览 2 评论 0原文

我在SO中找到了以下代码。这真的有效吗?

String xml = new String("áéíóúñ");
byte[] latin1 = xml.getBytes("UTF-8");
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");

我的意思是,第二行中的 latin1 是 UTF-8 编码的,但第三行中是 ISO-8859-1 编码的?这能行得通吗?

并不是说我不想批评引用的代码,我只是感到困惑,因为我遇到了一些非常相似的遗留代码,它们似乎有效,但我无法解释原因。

编辑:我想在原来的 帖子 中,第 2 行中的“UTF-8”只是一个拼写错误。但我不确定...

编辑2:在我最初发布后,有人编辑了上面的代码并将第二行更改为 byte[] latin1 = xml.getBytes("ISO-8859-1");。我不知道是谁干的,也不知道他为什么这么做,但显然这件事搞砸了。向所有看到错误版本代码的人表示抱歉。我不知道是谁编辑的。上面的代码现在是正确的。

I found the following code in SO. Does this really work?

String xml = new String("áéíóúñ");
byte[] latin1 = xml.getBytes("UTF-8");
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");

I mean, latin1 is UTF-8-encoded in the second line, but read als ISO-8859-1-encoded in the third? Can this ever work?

Not that I did not want to criticize the cited code, I am just confused since I ran into some legacy code that is very similar, that seems to work, and I cannot explain why.

EDIT: I guess in the original post, "UTF-8" in line 2 was just a TYPO. But I am not sure ...

EDIT2: After my initial posting, someone edited the code above and changed the 2nd line to byte[] latin1 = xml.getBytes("ISO-8859-1");. I don't know who did that and why he did it, but clearly this messed up pretty much. Sorry to all who saw the wrong version of the code. I don't know who edited it. The code above is correct now.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

来日方长 2025-01-13 10:27:45

getBytes(Charset charset) 生成使用 charset 编码的字节数组,因此 latin1 是 UTF-8 编码的。

将 System.out.println(latin1.length); 作为第三行,它会告诉您字节数组长度是 12。这意味着它确实是 UTF-8 编码的。

new String(latin1, "ISO-8859-1") 不正确,因为 latin1 是 UTF-8 编码的,并且您告诉将其解析为 ISO-8859-1。这就是为什么它会生成一个由 12 个垃圾符号组成的字符串:àñààñoñ

当您使用 UTF-8 编码从 à à à àñ 获取字节时,它会生成一个 24 长字节数组。

我希望现在一切都清楚了。

getBytes(Charset charset) results in a byte array encoded using the charset, so latin1 is UTF-8 encoded.

Put System.out.println(latin1.length); as the third line and it will tell you that byte array length is 12. This means that it is really UTF-8 encoded.

new String(latin1, "ISO-8859-1") is incorrect because latin1 is UTF-8 encoded and you're telling to parse it as ISO-8859-1. That's why it produces a String made of 12 symbols of garbage: áéíóúñ.

When you're getting bytes from áéíóúñ using UTF-8 encoding it results in a 24 long byte array.

I hope everything is clear now.

耳钉梦 2025-01-13 10:27:45

这些字符都存在于两种字符编码中。只是 UTF-8ISO-8859-1 使用超出 ASCII 范围的每个字符的不同字节表示形式。

如果您使用了 UTF-8 中存在但 ISO-8859-1 中不存在的字符,那么它当然会失败。

Those characters are present in the both character encodings. It's just that UTF-8 and ISO-8859-1 uses each different byte representations of each character beyond the ASCII range.

If you used a character which is present in UTF-8, but not in ISO-8859-1, then it will of course fail.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文