配置更改后 MySQL 数据库中的 UTF-8 字符串变得混乱

发布于 2024-12-11 03:54:13 字数 1115 浏览 0 评论 0原文

我有一个带有字符串的 MySQL,我将其休眠了一段时间。现在我再次拿起它,我发现所有的特殊字符都搞砸了。我的 ISP 已将服务器移植到另一台计算机,我怀疑这可能是发生的时间。

该数据库由 PHP 脚本填充。所有内容都应该采用 UTF-8 格式,这就是数据库的设置。

然而,现在的字符串是这样的:

fàªte

这四个特殊字符应该是一个字符,ê,字符串应该是 fête< /代码>。

现在看起来这只是重新编码了两次,但这似乎不对。这四个十六进制字符是:

C3 83 C6 92 C3 82 C2 AA

这看起来非常像 UTF-8,所以如果我们解码它,我们会得到

C3 3F C2 AA

这不完全是 UTF-8(因为 3F),但让我们再次解码它:

FF AA

这不是 UTF-8。

ê 字符是 EA,在 UTF-8 中,即 C3 AA

另一个例子:西班牙语倒置问号 (¿) 为 C8 83 E2 80 9A C3 82 C2,解码为 C3 3F 82 BF< /code>,这又不是正确的 UTF-8(转换为 FF 82 BF)。 ¿ 的预期字符是 BF,即正确的 UTF-8 中的 C2 BF

这里发生了什么?人物是怎么乱七八糟的?更重要的是,我该如何解决它?

(旁注 - 新服务器要求我编写 mysql_set_charset("utf8"); ,否则字符串也会变得混乱,尽管是“UTF-8 as latin1”方式,而不是这种奇怪的方式如上所示。)

TL;DR:

  • MySQL 数据库通过 PHP 脚本以 UTF-8 填充
  • 休眠多年,服务器已迁移。
  • 现在字符混乱了,见上文。

I have a MySQL with strings that I left dormant for a while. Now that I picked it up again, I noticed that all the special characters are screwed up. My ISP has ported the server to a different machine, I suspect that this might be when it happened.

The database was populated by a PHP script. Everything was supposed to be in UTF-8, that's what the database is set to.

However, this is what a string looks like now:

fête

Those four special characters are supposed to be one character, ê, the string is meant to be fête.

Now it looks like this is just re-encoded twice, but that doesn't seem right. Those four characters in hex are:

C3 83 C6 92 C3 82 C2 AA

This looks very much like UTF-8, so if we decode it, we get

C3 3F C2 AA

This isn't quite UTF-8 (because of the 3F), but let's decode it again:

FF AA

This is not UTF-8.

The ê character is EA, in UTF-8, that would be C3 AA.

Another example: The Spanish upside-down question mark (¿) is there as C8 83 E2 80 9A C3 82 C2, which decodes to C3 3F 82 BF, which isn't proper UTF-8 again (translates to FF 82 BF). The expected character for ¿ is BF, i.e. C2 BF in proper UTF-8.

What happened here? How did the characters get messed up? More importantly, how do I fix it?

(Side note - the new server requires me to write mysql_set_charset("utf8"); or else strings get messed up too, although in the "UTF-8 as latin1" fashion, not in this weird fashion as seen above.)

TL;DR:

  • MySQL database was populated in UTF-8 through PHP script
  • Lay dormant for years, server got migrated.
  • Now characters are messed up, see above.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

千仐 2024-12-18 03:54:13
C3 83 C6 92 C3 82 C2 AA

这看起来非常像 UTF-8,所以如果我们解码它,我们会得到

C3 3F C2 AA

如果将字节序列视为 UTF-8,然后将其编码为 ISO-8859-1,就会得到这样的结果。 3F?,它已作为替换字符包含在内,因为 UTF-8 C6 92 是 U+0192 f。但它确实存在于 Windows 代码页 1252 西欧中,该编码与 ISO-8859-1 非常相似;在那里,它是字节 0x83。

C3 83 C2 AA

经过另一轮 treat-as-UTF-8-bytes-and-encode-to-cp1252 ,你会得到:

C3 AA

最后,ê 的 UTF-8。

请注意,即使您将非 XML HTML 页面显式提供为 ISO-8859-1,由于令人讨厌的历史原因,浏览器实际上也会使用 cp1252 编码。

不幸的是 MySQL 没有 cp1252 编码; latin1(正确地)是 ISO-8859-1。因此,您将无法通过转储为 latin1 然后重新加载为 utf8(两次)来修复数据。您必须使用文本编辑器处理脚本,该文本编辑器可以另存为(或例如在Python file(path, 'rb').read().decode('utf-8').encode(' cp1252').decode('utf-8').encode('cp1252'))。

C3 83 C6 92 C3 82 C2 AA

This looks very much like UTF-8, so if we decode it, we get

C3 3F C2 AA

That's what you get if you treat the sequence of bytes as UTF-8, then encode it as ISO-8859-1. 3F is ?, which has been included as a replacement character, because UTF-8 C6 92 is U+0192 ƒ which does not exist in ISO-8859-1. But it does exist in Windows code page 1252 Western European, an encoding very similar to ISO-8859-1; there, it's byte 0x83.

C3 83 C2 AA

Go through another round of treat-as-UTF-8-bytes-and-encode-to-cp1252 and you get:

C3 AA

which is, finally, UTF-8 for ê.

Note that even if you serve a non-XML HTML page explicitly as ISO-8859-1, browsers will actually use the cp1252 encoding, due to nasty historical reasons.

Unfortunately MySQL doesn't have a cp1252 encoding; latin1 is (correctly) ISO-8859-1. So you won't be able to fix up the data by dumping as latin1 then reloading as utf8 (twice). You'd have to process the script with a text editor that can save as either (or eg in Python file(path, 'rb').read().decode('utf-8').encode('cp1252').decode('utf-8').encode('cp1252')).

一抹淡然 2024-12-18 03:54:13

我怀疑您可能将字符作为 UTF8 字符串存储在 latin1 (或类似)数据库中。这就是为什么你会遇到“双重编码”问题。将数据库的 CHARSET 设置为 UTF8 应该可以解决这个问题。转储/导入数据可能也是必要的,沿着这些思路:

$ mysqldump --default-character-set=latin1  --skip-set-charset --databases xxx > xxx.sql
$ mysql --default-character-set=utf8 < xxx.sql

但这只是建议,可能有效,但在您的具体情况下不必这样做。

I suspect you might have your characters stored as UTF8 strings in a latin1 (or similar) database. That's why you've got 'double encoding' problem. Making the database's CHARSET UTF8 should fix it. Dumping/importing the data might be necessary as well, something along those lines:

$ mysqldump --default-character-set=latin1  --skip-set-charset --databases xxx > xxx.sql
$ mysql --default-character-set=utf8 < xxx.sql

But that's just suggestion, might work but don't have to in your specific case.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文