如何检测 Latin1 编码列中的 UTF-8 字符 - MySQL
我即将承担将数据库从 Latin1 转换为 UTF-8 的繁琐且充满陷阱的任务。
此时,我只想检查表中存储的数据类型,因为这将确定我应该使用什么方法来转换数据。
具体来说,我想检查 Latin1 列中是否有 UTF-8 字符,最好的方法是什么?如果只有几行受到影响,那么我可以手动修复此问题。
选项 1. 执行 MySQL 转储并使用 Perl 搜索 UTF-8 字符?
选项 2. 使用 MySQL CHAR_LENGTH 查找包含多字节字符的行? 例如 从客户端中选择名称 WHERE LENGTH(name) != CHAR_LENGTH(name);
这够了吗?
目前我已将 Mysql 客户端编码切换为 UTF-8。
I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8.
At this point I simply want to check what sort of data I have stored in my tables, as that will determine what approach I should use to convert the data.
Specifically, I want to check if I have UTF-8 characters in the Latin1 columns, what would be the best way to do this? If only a few rows are affected, then I can just fix this manually.
Option 1. Perform a MySQL dump and use Perl to search for UTF-8 characters?
Option 2. Use MySQL CHAR_LENGTH to find rows with multi-byte characters?
e.g. SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);
Is this enough?
At the moment I have switched my Mysql client encoding to UTF-8.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
字符编码与时区一样,是一个持续存在的问题根源。
您可以做的是查找任何“高位 ASCII”字符,因为这些字符要么是 LATIN1 重音字符或符号,要么是 UTF-8 多字节字符的第一个。除非你作弊,否则辨别差异并不容易。
要确定哪种编码是正确的,您只需
SELECT
两个不同的版本并进行直观比较。这是一个例子:这变得异常复杂,因为 MySQL 正则表达式引擎似乎忽略了诸如
\x80
之类的内容,并且需要使用UNHEX()
方法来代替。这会产生如下结果:
Character encoding, like time zones, is a constant source of problems.
What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.
To figure out what encoding is correct, you just
SELECT
two different versions and compare visually. Here's an example:This is made unusually complicated because the MySQL regexp engine seems to ignore things like
\x80
and makes it necessary to use theUNHEX()
method instead.This produces results like this:
由于您的问题并不完全清楚,让我们假设一些情况:
?
。Since your question is not completely clear, let's assume some scenarios:
?
.github 上有一个脚本可以帮助解决此类问题。
There is a script on github to help with this sort of a thing.
我将创建数据库转储并查找所有有效的 UTF8 序列。从哪里得到它取决于你得到什么。 SO 有多个关于识别无效 UTF8 的问题;你基本上可以颠倒逻辑。
编辑:基本上,任何完全由 7 位 ASCII 组成的字段都是安全的,任何包含无效 UTF-8 序列的字段都可以假定为 Latin-1。剩余的数据应该被检查 - 如果你幸运的话,一些明显的替换将修复绝对多数(用 Latin-1 ö 替换 ö 等)。
I would create a dump of the database and grep for all valid UTF8 sequences. Where to take it from there depends on what you get. There are multiple questions on SO about identifying invalid UTF8; you can basically just reverse the logic.
Edit: So basically, any field consisting entirely of 7-bit ASCII is safe, and any field containing an invalid UTF-8 sequence can be assumed to be Latin-1. The remaining data should be inspected - if you are lucky, a handful of obvious substitutions will fix the absolute majority (replace ö with Latin-1 ö, etc).