如何检测 Latin1 编码列中的 UTF-8 字符 - MySQL

发布于 2025-01-05 20:56:59 字数 374 浏览 1 评论 0原文

我即将承担将数据库从 Latin1 转换为 UTF-8 的繁琐且充满陷阱的任务。

此时,我只想检查表中存储的数据类型,因为这将确定我应该使用什么方法来转换数据。

具体来说,我想检查 Latin1 列中是否有 UTF-8 字符,最好的方法是什么?如果只有几行受到影响,那么我可以手动修复此问题。

选项 1. 执行 MySQL 转储并使用 Perl 搜索 UTF-8 字符?

选项 2. 使用 MySQL CHAR_LENGTH 查找包含多字节字符的行? 例如 从客户端中选择名称 WHERE LENGTH(name) != CHAR_LENGTH(name); 这够了吗?

目前我已将 Mysql 客户端编码切换为 UTF-8。

I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8.

At this point I simply want to check what sort of data I have stored in my tables, as that will determine what approach I should use to convert the data.

Specifically, I want to check if I have UTF-8 characters in the Latin1 columns, what would be the best way to do this? If only a few rows are affected, then I can just fix this manually.

Option 1. Perform a MySQL dump and use Perl to search for UTF-8 characters?

Option 2. Use MySQL CHAR_LENGTH to find rows with multi-byte characters?
e.g. SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);
Is this enough?

At the moment I have switched my Mysql client encoding to UTF-8.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

ι不睡觉的鱼゛ 2025-01-12 20:56:59

字符编码与时区一样,是一个持续存在的问题根源。

您可以做的是查找任何“高位 ASCII”字符,因为这些字符要么是 LATIN1 重音字符或符号,要么是 UTF-8 多字节字符的第一个。除非你作弊,否则辨别差异并不容易。

要确定哪种编码是正确的,您只需 SELECT 两个不同的版本并进行直观比较。这是一个例子:

SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1, 
       CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8 
FROM users 
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')

这变得异常复杂,因为 MySQL 正则表达式引擎似乎忽略了诸如 \x80 之类的内容,并且需要使用 UNHEX() 方法来代替。

这会产生如下结果:

latin1                utf8
----------------------------------------
Björn                Björn

Character encoding, like time zones, is a constant source of problems.

What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.

To figure out what encoding is correct, you just SELECT two different versions and compare visually. Here's an example:

SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1, 
       CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8 
FROM users 
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')

This is made unusually complicated because the MySQL regexp engine seems to ignore things like \x80 and makes it necessary to use the UNHEX() method instead.

This produces results like this:

latin1                utf8
----------------------------------------
Björn                Björn
逆蝶 2025-01-12 20:56:59

由于您的问题并不完全清楚,让我们假设一些情况:

  1. 迄今为止错误的连接:您一直使用 latin1 编码错误地连接到数据库,但在数据库中存储了 UTF-8 数据(在这种情况下,列的编码无关紧要)。这就是我在此处描述的情况。在这种情况下,很容易修复:通过 latin1 连接将数据库内容转储到文件中。这会将不正确存储的数据转换为不正确存储的 UTF-8,这是迄今为止的工作方式(请阅读前面链接的文章以了解详细信息)。然后,您可以通过正确设置的 utf8 连接将数据重新导入数据库,并且它将按应有的方式存储。
  2. 迄今为止错误的列编码:UTF-8 数据通过 utf8 连接插入到 latin1 列中。那样的话就算了,数据就消失了。任何非 latin1 字符都应替换为 ?
  3. 到目前为止,一切都很好,从此添加了对 UTF-8 的支持: 您已将 Latin-1 数据正确存储在 latin1 列中,并通过 latin1 连接插入,但希望将其扩展为也允许 UTF-8 数据。在这种情况下,只需将列编码更改为 utf8 即可。 MySQL 将为您转换现有数据。然后,当您插入 UTF-8 数据时,只需确保您的数据库连接设置为 utf8 即可。

Since your question is not completely clear, let's assume some scenarios:

  1. Hitherto wrong connection: You've been connecting to your database incorrectly using the latin1 encoding, but have stored UTF-8 data in the database (the encoding of the column is irrelevant in this case). This is the case I described here. In this case, it's easy to fix: Dump the database contents to a file through a latin1 connection. This will translate the incorrectly stored data into incorrectly correctly stored UTF-8, the way it has worked so far (read the aforelinked article for the gory details). You can then reimport the data into the database through a correctly set utf8 connection, and it will be stored as it should be.
  2. Hitherto wrong column encoding: UTF-8 data was inserted into a latin1 column through a utf8 connection. In that case forget it, the data is gone. Any non-latin1 character should be replaced by a ?.
  3. Hitherto everything fine, henceforth added support for UTF-8: You have Latin-1 data correctly stored in a latin1 column, inserted through a latin1 connection, but want to expand that to also allow UTF-8 data. In that case just change the column encoding to utf8. MySQL will convert the existing data for you. Then just make sure your database connection is set to utf8 when you insert UTF-8 data.
一江春梦 2025-01-12 20:56:59

github 上有一个脚本可以帮助解决此类问题。

There is a script on github to help with this sort of a thing.

写给空气的情书 2025-01-12 20:56:59

我将创建数据库转储并查找所有有效的 UTF8 序列。从哪里得到它取决于你得到什么。 SO 有多个关于识别无效 UTF8 的问题;你基本上可以颠倒逻辑。

编辑:基本上,任何完全由 7 位 ASCII 组成的字段都是安全的,任何包含无效 UTF-8 序列的字段都可以假定为 Latin-1。剩余的数据应该被检查 - 如果你幸运的话,一些明显的替换将修复绝对多数(用 Latin-1 ö 替换 ö 等)。

I would create a dump of the database and grep for all valid UTF8 sequences. Where to take it from there depends on what you get. There are multiple questions on SO about identifying invalid UTF8; you can basically just reverse the logic.

Edit: So basically, any field consisting entirely of 7-bit ASCII is safe, and any field containing an invalid UTF-8 sequence can be assumed to be Latin-1. The remaining data should be inspected - if you are lucky, a handful of obvious substitutions will fix the absolute majority (replace ö with Latin-1 ö, etc).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文