如何更正 MySQL utf8_general_ci 字段中的双编码 UTF-8 字符串?

发布于 2024-11-06 23:00:58 字数 530 浏览 0 评论 0原文

我必须重新设计一个类,其中(除其他外)UTF-8 字符串被错误地双重编码:

$string = iconv('ISO-8859-1', 'UTF-8', $string);
:
$string = utf8_encode($string);

这些错误的字符串已被保存到 MySQL 数据库中的多个表字段中。所有受影响的字段都使用排序规则 utf8_general_ci

通常我会设置一个小的 PHP 补丁脚本,循环遍历受影响的表,选择记录,通过在双编码字段上使用 utf8_decode() 来纠正错误记录并更新它们。

由于这次我有很多巨大的表,并且该错误仅影响德语变音符号(äöüßäÖÜ),我想知道是否有比这更智能/更快的解决方案。

像下面这样的纯 MySQL 解决方案安全且值得推荐吗?

 UPDATE `table` SET `col` = REPLACE(`col`, 'ä', 'ä');

还有其他解决方案/最佳实践吗?

I have to redesign a class where (amongst other things) UTF-8 strings are double-encoded wrongly:

$string = iconv('ISO-8859-1', 'UTF-8', $string);
:
$string = utf8_encode($string);

These faulty strings have been saved into multiple table fields all over a MySQL database. All fields being affected use collation utf8_general_ci.

Usually I'd setup a little PHP patch script, looping thru the affected tables, SELECTing the records, correct the faulty records by using utf8_decode() on the double-encoded fields and UPDATE them.

As I got many and huge tables this time, and the error only affects german umlauts (äöüßÄÖÜ), I'm wondering if there's a solution smarter/faster than that.

Are pure MySQL solutions like the following safe and recommendable?

 UPDATE `table` SET `col` = REPLACE(`col`, 'ä', 'ä');

Any other solutions/best practices?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

尹雨沫 2024-11-13 23:00:58

更改表以将列字符集更改为 Latin-1。您现在将拥有单独编码的 UTF-8 字符串,但位于排序规则应为 Latin-1 的字段中。

然后你要做的就是,通过二进制字符集将列字符集更改回 UTF-8 - 这样 MySQL 就不会在任何时候转换字符。

ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET latin1
ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET binary
ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET utf8

(这是正确的语法 iirc;将适当的列类型放在 ... 所在的位置)

Alter the table to change the column character set to Latin-1. You will now have singly-encoded UTF-8 strings, but sitting in a field whose collation is supposed to be Latin-1.

What you do then is, change the column character set back to UTF-8 via the binary character set - that way MySQL doesn't convert the characters at any point.

ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET latin1
ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET binary
ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET utf8

(is the correct syntax iirc; put the appropriate column type in where ... is)

再见回来 2024-11-13 23:00:58

我尝试了发布的解决方案,但我的数据库不断出现错误。最终我偶然发现了以下解决方案(我相信是在一个论坛上,但我不记得在哪里):

UPDATE table_name SET col_name = CONVERT(CONVERT(CONVERT(col_name USING latin1) USING binary) USING utf8);

并且它很有效。希望这可以帮助那些像我一样从绝望的谷歌搜索中偶然发现这里的人。

注意:当然,这是假设您的双编码字符问题源于从latin1到utf8的过度有用的MySQL转换,但我相信这就是大多数“损坏的字符”发生的地方。这基本上执行与上面提到的相同的转换,返回到 latin1,然后是二进制,然后是 utf8(使用二进制步骤作为防止重新编码已编码的 latin1 实体的方法)

I tried the posted solutions, but my DB kept spitting up errors. Eventually I stumbled upon the following solution (in a forum I believe, but I can't remember where):

UPDATE table_name SET col_name = CONVERT(CONVERT(CONVERT(col_name USING latin1) USING binary) USING utf8);

and it worked a treat. Hope this helps anyone who stumbled here from desperate google searching like me.

NOTE: This is of course assuming your double encoded character issues originate from an overly helpful MySQL conversion from latin1 to utf8, but I believe that's where most of these "corrupted characters" happen. This basically does the same conversion as mentioned above back to latin1, then binary, then to utf8 (using the binary step as a way to prevent the re-encoding of the already encoded latin1 entities)

知足的幸福 2024-11-13 23:00:58

我发现以下方法更简单:

mysqldump -h DB_HOST -u DB_USER -p --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql

然后删除所有表并使用以下命令重新导入:

mysql -h DB_HOST -u DB_USER -p --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql

在此 URL 中找到提示:
http://blog. hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/

I found the following approach simpler:

mysqldump -h DB_HOST -u DB_USER -p --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql

Then drop all tables and re-import with following command:

mysql -h DB_HOST -u DB_USER -p --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql

Tip was found at this URL:
http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/

孤千羽 2024-11-13 23:00:58

MySql 能够识别字符集,因此您可以在 SQL 中进行转换。但对于这种情况,我可能更愿意只用 PHP 编写脚本,因为无论如何它都是一次性任务。

请记住,MySql 中的列具有字符集属性。排序规则(理论上)与字符集正交。虽然 utf8_general_ci 排序规则暗示字符集是 utf8,但这并不是给定的。理论上,您可以将 utf8 排序规则与 latin1 编码混合使用(结果会产生垃圾)。

如果您决定在 SQL 中执行此操作,请查看此处:

http:// /dev.mysql.com/doc/refman/5.0/en/charset-convert.html

MySql is charset aware, so you can convert in SQL. But for this case, I would probably prefer to just script it in PHP, as it's a one-off task anyway.

Keep in mind that columns in MySql have a charset property. The collation is (in theory) orthogonal to the charset. While a utf8_general_ci collation would imply that the charset is utf8, it's not a given. You could in theory mix a utf8 collation with a latin1 encoding (And get garbage as a result).

If you decide to do this in SQL, look here:

http://dev.mysql.com/doc/refman/5.0/en/charset-convert.html

好菇凉咱不稀罕他 2024-11-13 23:00:58

MySQL 提供正则表达式匹配,但没有正则表达式替换,因此您通常最好迭代 php 中的每一行,根据需要进行转换,并在行发生更改时更新该行。

MySQL provides a regexp match but no regexp replace, so you're usually better off iterating through each row in php, converting as needed, and updating the row if it has been changed.

触ぅ动初心 2024-11-13 23:00:58

使用 mysqldump 生成转储,更改编码声明(位于第一个命令中),然后重新加载到另一个数据库中。

您还可以在转储上使用 iconv 对其进行转码。

您可以选择 INTO OUTFILE,使用 php 或 iconv 处理文件,然后 LOAD DATA INFILE。

Generate a dump using mysqldump, change the encoding declaration (it's in the first commands), and reload in another database.

You can also use iconv on your dump to transcode it.

You can SELECT INTO OUTFILE, massage the file using php or iconv, then LOAD DATA INFILE.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文