UTF-8 字符出现问题;我看到的不是我存储的

发布于 2025-01-09 04:40:46 字数 461 浏览 0 评论 0原文

我尝试使用 UTF-8 并遇到了麻烦。

我尝试过很多事情;这是我得到的结果:

  • ?????? 而不是亚洲字符。即使对于欧洲文本,我也得到了 Se?or 来表示 Señor
  • 奇怪的胡言乱语(Mojibake?),例如用于 新浪新闻Señor新浪新闻
  • 黑钻石,例如Se�or。
  • 最后,我遇到了数据丢失或至少被截断的情况:SeSeñor
  • 即使我的文本看起来正确,它也没有正确排序

我做错了什么?如何修复代码?我可以恢复数据吗?如果可以,如何恢复?

I tried to use UTF-8 and ran into trouble.

I have tried so many things; here are the results I have gotten:

  • ???? instead of Asian characters. Even for European text, I got Se?or for Señor.
  • Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
  • Black diamonds, such as Se�or.
  • Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
  • Even when I got text to look right, it did not sort correctly.

What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

还不是爱你 2025-01-16 04:40:46

这个问题困扰着该网站的参与者以及许多其他人。

您列出了 CHARACTER SET 问题的五种主要情况。

最佳实践

展望未来,最好使用CHARACTER SET utf8mb4COLLATION utf8mb4_unicode_520_ci。 (管道中有更新版本的 Unicode 排序规则。)

utf8mb4utf8 的超集,因为它处理 Emoji 所需的 4 字节 utf8 代码和一些中国人。

在 MySQL 之外,“UTF-8”指的是所有大小的编码,因此实际上与 MySQL 的 utf8mb4 相同,而不是 utf8

下面我将尝试使用这些拼写和大小写来区分 MySQL 内部和外部。

您应该做什么的概述

  • 将您的编辑器等设置为 UTF-8。
  • HTML 表单应以
    开头。
  • 将字节编码为 UTF-8。
  • 将 UTF-8 设置为客户端中使用的编码。
  • 将列/表声明为 CHARACTER SET utf8mb4(使用 SHOW CREATE TABLE 检查。)
  • 在开头HTML
  • 存储例程获取当前的字符集/排序规则。他们可能需要重建。

全程采用 UTF-8

有关计算机语言的更多详细信息(及其以下部分)

测试数据

查看使用工具或 SELECT 获得的数据不可信。
太多这样的客户端,尤其是浏览器,尝试补偿不正确的编码,并显示正确的文本,即使数据库被破坏也是如此。
因此,选择一个包含一些非英语文本的表和列,然后执行

SELECT col, HEX(col) FROM tbl WHERE ...

正确存储的 UTF-8 的十六进制将为

  • 空格(任何语言):20
  • 对于英语:4x5x6x7x
  • 对于西欧的大部分地区,重音字母应为 Cxyy
  • 西里尔文、希伯来文和波斯语/阿拉伯语:Dxyy
  • 亚洲大部分地区:Exyyzz
  • 表情符号和一些中文:F0yyzzww
  • 更多详细信息

所见问题的具体原因和修复

截断文本(Se for Señor):

  • 要存储的字节未编码为 utf8mb4。解决这个问题。
  • 另外,检查读取时的连接是否为UTF-8。

带问号的黑钻石Se�or 表示Señor);
存在以下情况之一:

情况 1(原始字节不是 UTF-8):

  • 要存储的字节未编码为 utf8。解决这个问题。
  • INSERTSELECT 的连接(或 SET NAMES)不是 utf8/utf8mb4。解决这个问题。
  • 另外,检查数据库中的列是否为CHARACTER SET utf8(或utf8mb4)。

情况 2(原始字节 UTF-8):

  • SELECT 的连接(或SET NAMES)不是 utf8/utf8mb4。解决这个问题。
  • 另外,检查数据库中的列是否为CHARACTER SET utf8(或utf8mb4)。

仅当浏览器设置为 时才会出现黑色菱形。

问号(常规问号,不是黑菱形)(Se?or 表示 Señor):

  • 要存储的字节未编码为 utf8/ utf8mb4。解决这个问题。
  • 数据库中的列不是CHARACTER SET utf8(或utf8mb4)。解决这个问题。 (使用SHOW CREATE TABLE。)
  • 此外,检查读取期间的连接是否为UTF-8。

MojibakeSeñor 表示 Señor):
(这个讨论也适用于双重编码,它不一定可见。)

  • 要存储的字节需要进行UTF-8编码。解决这个问题。
  • INSERTingSELECTing文本时的连接需要指定utf8或utf8mb4。解决这个问题。
  • 该列需要声明为CHARACTER SET utf8(或utf8mb4)。解决这个问题。
  • HTML 应以 开头。

如果数据看起来正确,但排序不正确,那么
要么你选择了错误的排序规则,
或者没有适合您需要的排序规则,
或者您有双重编码

双重编码可以通过执行上述SELECT .. HEX ..来确认。

é should come back C3A9, but instead shows C383C2A9
The Emoji

This problem plagues the participants of this site, and many others.

You have listed the five main cases of CHARACTER SET troubles.

Best Practice

Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)

utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.

Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.

I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.

Overview of what you should do

  • Have your editor, etc. set to UTF-8.
  • HTML forms should start like <form accept-charset="UTF-8">.
  • Have your bytes encoded as UTF-8.
  • Establish UTF-8 as the encoding being used in the client.
  • Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
  • <meta charset=UTF-8> at the beginning of HTML
  • Stored Routines acquire the current charset/collation. They may need rebuilding.

UTF-8 all the way through

More details for computer languages (and its following sections)

Test the data

Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do

SELECT col, HEX(col) FROM tbl WHERE ...

The HEX for correctly stored UTF-8 will be

  • For a blank space (in any language): 20
  • For English: 4x, 5x, 6x, or 7x
  • For most of Western Europe, accented letters should be Cxyy
  • Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
  • Most of Asia: Exyyzz
  • Emoji and some of Chinese: F0yyzzww
  • More details

Specific causes and fixes of the problems seen

Truncated text (Se for Señor):

  • The bytes to be stored are not encoded as utf8mb4. Fix this.
  • Also, check that the connection during reading is UTF-8.

Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:

Case 1 (original bytes were not UTF-8):

  • The bytes to be stored are not encoded as utf8. Fix this.
  • The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
  • Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).

Case 2 (original bytes were UTF-8):

  • The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
  • Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).

Black diamonds occur only when the browser is set to <meta charset=UTF-8>.

Question Marks (regular ones, not black diamonds) (Se?or for Señor):

  • The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
  • The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
  • Also, check that the connection during reading is UTF-8.

Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)

  • The bytes to be stored need to be UTF-8-encoded. Fix this.
  • The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
  • The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
  • HTML should start with <meta charset=UTF-8>.

If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.

Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.

é should come back C3A9, but instead shows C383C2A9
The Emoji ???? should come back F09F91BD, but comes back C3B0C5B8E28098C2BD

That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.

Fixing the Data, where possible

For Truncation and Question Marks, the data is lost.

For Mojibake / Double Encoding, ...

For Black Diamonds, ...

The Fixes are listed here: 5 different fixes for 5 different situations; pick carefully

Related: Illegal mix of collations

红尘作伴 2025-01-16 04:40:46

服务器迁移后,我的两个项目遇到了类似的问题。在搜索并尝试了很多解决方案之后,我遇到了这个:

mysqli_set_charset($con,"utf8mb4");

将这一行添加到我的配置文件后,一切正常!

我找到了 MySQLi 的解决方案 -PHP mysqli set_charset() 函数 - 当我想解决 HTML 查询插入问题时。

I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:

mysqli_set_charset($con,"utf8mb4");

After adding this line to my configuration file, everything works fine!

I found this solution for MySQLiPHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.

吃兔兔 2025-01-16 04:40:46

我也在寻找同样的问题。我花了近一个月的时间才找到合适的解决方案。

首先,您必须将数据库中所有最新的 CHARACTER 和 COLLATION 更新为 utf8mb4 或至少支持 UTF-8 数据。

对于 Java:

在建立 JDBC 连接时,将其添加到连接 URL useUnicode=yes&characterEncoding=UTF-8 作为参数,它将起作用。

对于 Python:

在查询数据库之前,尝试在光标上强制执行此操作。

cursor.execute("SET NAMES utf8mb4")
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")

如果它不起作用,请愉快地寻找正确的解决方案。

I was also searching for the same issue. It took me nearly one month to find the appropriate solution.

First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.

For Java:

while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.

For Python:

Before querying into the database, try enforcing this over the cursor

cursor.execute("SET NAMES utf8mb4")
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")

If it does not work, happy hunting for the right solution.

打小就很酷 2025-01-16 04:40:46
  1. 将代码 IDE 语言设置为 UTF-8

  2. 添加

    到您收集数据表单的网页标题。

  3. 检查您的 MySQL 表定义如下所示:

     创建表 your_table (
       ...
     ) 引擎=InnoDB 默认字符集=utf8
    
  4. 如果您使用 PDO,确保

    $options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'设置名称 utf8');
    $dbL = 新 PDO($pdo, $user, $pass, $options);
    

您已经有一个存在上述问题的大型数据库,您可以尝试使用 SIDU 使用正确的字符集导出,然后使用 UTF-8 导入回来。

  1. Set your code IDE language to UTF-8

  2. Add <meta charset="utf-8"> to your webpage header where you collect data form.

  3. Check your MySQL table definition looks like this:

     CREATE TABLE your_table (
       ...
     ) ENGINE=InnoDB DEFAULT CHARSET=utf8
    
  4. If you are using PDO, make sure

    $options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
    $dbL = new PDO($pdo, $user, $pass, $options);
    

If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.

泅人 2025-01-16 04:40:46

根据服务器的设置方式,您必须相应地更改编码。你所说的 utf8 应该效果最好。但是,如果您遇到奇怪的字符,将网页编码更改为 ANSI 可能会有所帮助。

当我设置 PHP MySQLi 时,这对我很有帮助。这可能会帮助您了解更多信息:Notepad++ 中的 ANSI 到 UTF-8

Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.

This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文