UTF-8 字符出现问题;我看到的不是我存储的
我尝试使用 UTF-8 并遇到了麻烦。
我尝试过很多事情;这是我得到的结果:
??????
而不是亚洲字符。即使对于欧洲文本,我也得到了Se?or
来表示Señor
。- 奇怪的胡言乱语(Mojibake?),例如用于
新浪新闻
的Señor
或新浪新闻
。 - 黑钻石,例如Se�or。
- 最后,我遇到了数据丢失或至少被截断的情况:
Se
为Señor
。 - 即使我的文本看起来正确,它也没有正确排序。
我做错了什么?如何修复代码?我可以恢复数据吗?如果可以,如何恢复?
I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
????
instead of Asian characters. Even for European text, I gotSe?or
forSeñor
.- Strange gibberish (Mojibake?) such as
Señor
or新浪新闻
for新浪新闻
. - Black diamonds, such as Se�or.
- Finally, I got into a situation where the data was lost, or at least truncated:
Se
forSeñor
. - Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这个问题困扰着该网站的参与者以及许多其他人。
您列出了
CHARACTER SET
问题的五种主要情况。最佳实践
展望未来,最好使用
CHARACTER SET utf8mb4
和COLLATION utf8mb4_unicode_520_ci
。 (管道中有更新版本的 Unicode 排序规则。)utf8mb4
是utf8
的超集,因为它处理 Emoji 所需的 4 字节 utf8 代码和一些中国人。在 MySQL 之外,“UTF-8”指的是所有大小的编码,因此实际上与 MySQL 的
utf8mb4
相同,而不是utf8
。下面我将尝试使用这些拼写和大小写来区分 MySQL 内部和外部。
您应该做什么的概述
CHARACTER SET utf8mb4
(使用SHOW CREATE TABLE
检查。)在开头HTML
全程采用 UTF-8
有关计算机语言的更多详细信息(及其以下部分)
测试数据
查看使用工具或
SELECT
获得的数据不可信。太多这样的客户端,尤其是浏览器,尝试补偿不正确的编码,并显示正确的文本,即使数据库被破坏也是如此。
因此,选择一个包含一些非英语文本的表和列,然后执行
正确存储的 UTF-8 的十六进制将为
20
4x
、5x
、6x
或7x
Cxyy
Dxyy
Exyyzz
F0yyzzww
所见问题的具体原因和修复
截断文本(
Se
forSeñor
):带问号的黑钻石(
Se�or
表示Señor
);存在以下情况之一:
情况 1(原始字节不是 UTF-8):
INSERT
和SELECT
的连接(或SET NAMES
)不是 utf8/utf8mb4。解决这个问题。CHARACTER SET utf8
(或utf8mb4)。情况 2(原始字节为 UTF-8):
SELECT
的连接(或SET NAMES
)不是 utf8/utf8mb4。解决这个问题。CHARACTER SET utf8
(或utf8mb4)。仅当浏览器设置为
时才会出现黑色菱形。
问号(常规问号,不是黑菱形)(
Se?or
表示Señor
):CHARACTER SET utf8
(或utf8mb4)。解决这个问题。 (使用SHOW CREATE TABLE
。)Mojibake(
Señor
表示Señor
):(这个讨论也适用于双重编码,它不一定可见。)
INSERTing
和SELECTing
文本时的连接需要指定utf8或utf8mb4。解决这个问题。CHARACTER SET utf8
(或utf8mb4)。解决这个问题。开头。
如果数据看起来正确,但排序不正确,那么
要么你选择了错误的排序规则,
或者没有适合您需要的排序规则,
或者您有双重编码。
双重编码可以通过执行上述
SELECT .. HEX ..
来确认。This problem plagues the participants of this site, and many others.
You have listed the five main cases of
CHARACTER SET
troubles.Best Practice
Going forward, it is best to use
CHARACTER SET utf8mb4
andCOLLATION utf8mb4_unicode_520_ci
. (There is a newer version of the Unicode collation in the pipeline.)utf8mb4
is a superset ofutf8
in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's
utf8mb4
, notutf8
.I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
<form accept-charset="UTF-8">
.CHARACTER SET utf8mb4
(Check withSHOW CREATE TABLE
.)<meta charset=UTF-8>
at the beginning of HTMLUTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with
SELECT
cannot be trusted.Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
The HEX for correctly stored UTF-8 will be
20
4x
,5x
,6x
, or7x
Cxyy
Dxyy
Exyyzz
F0yyzzww
Specific causes and fixes of the problems seen
Truncated text (
Se
forSeñor
):Black Diamonds with question marks (
Se�or
forSeñor
);one of these cases exists:
Case 1 (original bytes were not UTF-8):
SET NAMES
) for theINSERT
and theSELECT
was not utf8/utf8mb4. Fix this.CHARACTER SET utf8
(or utf8mb4).Case 2 (original bytes were UTF-8):
SET NAMES
) for theSELECT
was not utf8/utf8mb4. Fix this.CHARACTER SET utf8
(or utf8mb4).Black diamonds occur only when the browser is set to
<meta charset=UTF-8>
.Question Marks (regular ones, not black diamonds) (
Se?or
forSeñor
):CHARACTER SET utf8
(or utf8mb4). Fix this. (UseSHOW CREATE TABLE
.)Mojibake (
Señor
forSeñor
):(This discussion also applies to Double Encoding, which is not necessarily visible.)
INSERTing
andSELECTing
text needs to specify utf8 or utf8mb4. Fix this.CHARACTER SET utf8
(or utf8mb4). Fix this.<meta charset=UTF-8>
.If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the
SELECT .. HEX ..
described above.That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were
Señor
.Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here: 5 different fixes for 5 different situations; pick carefully
Related: Illegal mix of collations
服务器迁移后,我的两个项目遇到了类似的问题。在搜索并尝试了很多解决方案之后,我遇到了这个:
将这一行添加到我的配置文件后,一切正常!
我找到了 MySQLi 的解决方案 -PHP mysqli set_charset() 函数 - 当我想解决 HTML 查询插入问题时。
I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.
我也在寻找同样的问题。我花了近一个月的时间才找到合适的解决方案。
首先,您必须将数据库中所有最新的 CHARACTER 和 COLLATION 更新为 utf8mb4 或至少支持 UTF-8 数据。
对于 Java:
在建立 JDBC 连接时,将其添加到连接 URL useUnicode=yes&characterEncoding=UTF-8 作为参数,它将起作用。
对于 Python:
在查询数据库之前,尝试在光标上强制执行此操作。
如果它不起作用,请愉快地寻找正确的解决方案。
I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
If it does not work, happy hunting for the right solution.
将代码 IDE 语言设置为 UTF-8
添加
到您收集数据表单的网页标题。检查您的 MySQL 表定义如下所示:
如果您使用 PDO,确保
您已经有一个存在上述问题的大型数据库,您可以尝试使用 SIDU 使用正确的字符集导出,然后使用 UTF-8 导入回来。
Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
If you are using PDO, make sure
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.
根据服务器的设置方式,您必须相应地更改编码。你所说的 utf8 应该效果最好。但是,如果您遇到奇怪的字符,将网页编码更改为 ANSI 可能会有所帮助。
当我设置 PHP MySQLi 时,这对我很有帮助。这可能会帮助您了解更多信息:Notepad++ 中的 ANSI 到 UTF-8
Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++