“修复” Java中的字符串编码

发布于 2024-08-28 04:53:04 字数 191 浏览 8 评论 0原文

我有一个使用 UTF-8 编码从 byte[] 数组创建的 String
但是,它应该是使用另一种编码(Windows-1252)创建的。

有没有办法将此字符串转换回正确的编码?

我知道如果您可以访问原始字节数组,这很容易做到,但就我而言,为时已晚,因为它是由闭源库提供的。

I have a String created from a byte[] array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).

Is there a way to convert this String back to the right encoding?

I know it's easy to do if you have access to the original byte array, but it my case it's too late because it's given by a closed source library.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

Hello爱情风 2024-09-04 04:53:04

由于这是否可能似乎存在一些困惑,我认为我需要提供一个广泛的例子。

该问题声称(初始)输入是一个 byte[] ,其中包含 Windows-1252 编码数据。我将其称为 byte[] ib (“初始字节”)。

在这个例子中,我将选择德语单词“Bär”(意思是熊)作为输入:(

byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.

如果您的 JVM 不支持该编码,那么您可以使用 ISO-8859-1 代替,因为这三个字母(以及大多数其他)在这两种编码中处于相同位置)。

问题继续指出,其他一些代码(不在我们的影响范围内)已经使用 UTF-8 编码将 byte[] 转换为字符串(我将其称为 String “输入字符串”)。该String是可用于实现我们目标的唯一输入(如果ib可用,那将是微不足道的):

String is = new String(ib, "UTF-8");
System.out.println(is);

这显然会产生错误的输出“B�”。

目标是生成 ib(或该 byte[] 的正确解码),且可用。 。

现在有些人声称从 is 获取 UTF-8 编码字节将返回一个与初始数组具有相同值的数组:

byte[] utf8Again = is.getBytes("UTF-8");

但这会返回 UTF-8 编码两个字符 B 的组合,并且在重新解释为 Windows-1252 时肯定会返回错误的结果:

System.out.println(new String(utf8Again, "Windows-1252");

此行产生输出“B�”,这完全是错误(如果初始数组包含非单词“Bür”,则结果也是相同的输出)。

因此在这种情况下您无法撤消该操作,因为一些信息丢失了。

事实上,在情况下,这种错误编码是可以被撤销的。当所有可能的(或至少出现的)字节序列在该编码中都有效时,它更有可能起作用。由于 UTF-8 有几个字节序列根本不是有效值,因此您将会遇到问题。

As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.

The question claims that the (initial) input is a byte[] that contains Windows-1252 encoded data. I'll call that byte[] ib (for "initial bytes").

For this example I'll choose the German word "Bär" (meaning bear) as the input:

byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.

(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).

The question goes on to state that some other code (that is outside of our influence) already converted that byte[] to a String using the UTF-8 encoding (I'll call that String is for "input String"). That String is the only input that is available to achieve our goal (if ib were available, it would be trivial):

String is = new String(ib, "UTF-8");
System.out.println(is);

This obviously produces the incorrect output "B�".

The goal would be to produce ib (or the correct decoding of that byte[]) with only is available.

Now some people claim that getting the UTF-8 encoded bytes from that is will return an array with the same values as the initial array:

byte[] utf8Again = is.getBytes("UTF-8");

But that returns the UTF-8 encoding of the two characters B and and definitely returns the wrong result when re-interpreted as Windows-1252:

System.out.println(new String(utf8Again, "Windows-1252");

This line produces the output "B�", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).

So in this case you can't undo the operation, because some information was lost.

There are in fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you will have problems.

懒猫 2024-09-04 04:53:04

我尝试了这个,由于某种原因它工作了

修复编码问题的代码(它不能完美工作,我们很快就会看到):

 final Charset fromCharset = Charset.forName("windows-1252");
 final Charset toCharset = Charset.forName("UTF-8");
 String fixed = new String(input.getBytes(fromCharset), toCharset);
 System.out.println(input);
 System.out.println(fixed);

结果是:

 input: …Und ich beweg mich (aber heut nur langsam)
 fixed: …Und ich beweg mich (aber heut nur langsam)

这是另一个例子:

 input: Waun da wuan ned wa (feat. Wolfgang Kühn)
 fixed: Waun da wuan ned wa (feat. Wolfgang Kühn)

这是正在发生的事情以及为什么上面的技巧似乎有效:

  1. 原始文件是 UTF-8 编码的文本文件(逗号分隔)
  2. 该文件是用 Excel 导入的,但用户错误地输入了 Windows 1252 作为编码(这可能是他或她计算机上的默认编码)
  3. 用户认为导入是成功是因为 ASCII 范围内的所有字符看起来都没有问题。

现在,当我们尝试“反转”该过程时,会发生以下情况:

 // we start with this garbage, two characters we don't want!
 String input = "ü";

 final Charset cp1252 = Charset.forName("windows-1252");
 final Charset utf8 = Charset.forName("UTF-8");

 // lets convert it to bytes in windows-1252:
 // this gives you 2 bytes: c3 bc
 // "Ã" ==> c3
 // "¼" ==> bc
 bytes[] windows1252Bytes = input.getBytes(cp1252);

 // but in utf-8, c3 bc is "ü"
 String fixed = new String(windows1252Bytes, utf8);

 System.out.println(input);
 System.out.println(fixed);

上面的编码修复代码可以工作,但对于以下字符失败:(

假设唯一的字符使用 Windows 1252 中的 1 字节字符):

char    utf-8 bytes     |   string decoded as cp1252 -->   as cp1252 bytes 
”       e2 80 9d        |       â€�                        e2 80 3f
Á       c3 81           |       Ã�                         c3 3f
Í       c3 8d           |       Ã�                         c3 3f
Ï       c3 8f           |       Ã�                         c3 3f
Р      c3 90           |       �                         c3 3f
Ý       c3 9d           |       Ã�                         c3 3f

它确实适用于一些字符,例如这些:

Þ       c3 9e           |       Þ      c3 9e           Þ
ß       c3 9f           |       ß      c3 9f           ß
à       c3 a0           |       à      c3 a0           à
á       c3 a1           |       á      c3 a1           á
â       c3 a2           |       â      c3 a2           â
ã       c3 a3           |       ã      c3 a3           ã
ä       c3 a4           |       ä      c3 a4           ä
å       c3 a5           |       Ã¥      c3 a5           å
æ       c3 a6           |       æ      c3 a6           æ
ç       c3 a7           |       ç      c3 a7           ç

注意 - 我最初认为这与你的问题相关(并且因为我自己也在做同样的事情,所以我想我会分享我学到的东西),但看来我的问题有点不同的。也许这会帮助别人。

I tried this and it worked for some reason

Code to repair encoding problem (it doesn't work perfectly, which we will see shortly):

 final Charset fromCharset = Charset.forName("windows-1252");
 final Charset toCharset = Charset.forName("UTF-8");
 String fixed = new String(input.getBytes(fromCharset), toCharset);
 System.out.println(input);
 System.out.println(fixed);

The results are:

 input: …Und ich beweg mich (aber heut nur langsam)
 fixed: …Und ich beweg mich (aber heut nur langsam)

Here's another example:

 input: Waun da wuan ned wa (feat. Wolfgang Kühn)
 fixed: Waun da wuan ned wa (feat. Wolfgang Kühn)

Here's what is happening and why the trick above seems to work:

  1. The original file was a UTF-8 encoded text file (comma delimited)
  2. That file was imported with Excel BUT the user mistakenly entered Windows 1252 for the encoding (which was probably the default encoding on his or her computer)
  3. The user thought the import was successful because all of the characters in the ASCII range looked okay.

Now, when we try to "reverse" the process, here is what happens:

 // we start with this garbage, two characters we don't want!
 String input = "ü";

 final Charset cp1252 = Charset.forName("windows-1252");
 final Charset utf8 = Charset.forName("UTF-8");

 // lets convert it to bytes in windows-1252:
 // this gives you 2 bytes: c3 bc
 // "Ã" ==> c3
 // "¼" ==> bc
 bytes[] windows1252Bytes = input.getBytes(cp1252);

 // but in utf-8, c3 bc is "ü"
 String fixed = new String(windows1252Bytes, utf8);

 System.out.println(input);
 System.out.println(fixed);

The encoding fixing code above kind of works but fails for the following characters:

(Assuming the only characters used 1 byte characters from Windows 1252):

char    utf-8 bytes     |   string decoded as cp1252 -->   as cp1252 bytes 
”       e2 80 9d        |       â€�                        e2 80 3f
Á       c3 81           |       Ã�                         c3 3f
Í       c3 8d           |       Ã�                         c3 3f
Ï       c3 8f           |       Ã�                         c3 3f
Р      c3 90           |       �                         c3 3f
Ý       c3 9d           |       Ã�                         c3 3f

It does work for some of the characters, e.g. these:

Þ       c3 9e           |       Þ      c3 9e           Þ
ß       c3 9f           |       ß      c3 9f           ß
à       c3 a0           |       à      c3 a0           à
á       c3 a1           |       á      c3 a1           á
â       c3 a2           |       â      c3 a2           â
ã       c3 a3           |       ã      c3 a3           ã
ä       c3 a4           |       ä      c3 a4           ä
å       c3 a5           |       Ã¥      c3 a5           å
æ       c3 a6           |       æ      c3 a6           æ
ç       c3 a7           |       ç      c3 a7           ç

NOTE - I originally thought this was relevant to your question (and as I was working on the same thing myself I figured I'd share what I've learned), but it seems my problem was slightly different. Maybe this will help someone else.

浪菊怪哟 2024-09-04 04:53:04

你想做的事情是不可能的。一旦有了 Java String,有关字节数组的信息就会丢失。您可能会幸运地进行“手动转换”。创建所有 windows-1252 字符及其到 UTF-8 的映射的列表。然后迭代字符串中的所有字符,将它们转换为正确的编码。

编辑:
正如评论者所说,这行不通。当您将 Windows-1252 字节数组转换为 UTF-8 时,您必然会遇到编码异常。 (请参阅此处此处)。

What you want to do is impossible. Once you have a Java String, the information about the byte array is lost. You may have luck doing a "manual conversion". Create a list of all windows-1252 characters and their mapping to UTF-8. Then iterate over all characters in the string to convert them to the right encoding.

Edit:
As a commenter said this won't work. When you convert a Windows-1252 byte array as it if was UTF-8 you are bound to get encoding exceptions. (See here and here).

删除→记忆 2024-09-04 04:53:04

您可以使用此 教程

您需要的字符集应该在 rt.jar 中定义(根据 this )

You can use this tutorial

The charset you need should be defined in rt.jar (according to this)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文