UTF-8字符的麻烦;我看到的不是我存储的

发布于 2025-02-04 21:19:24 字数 509 浏览 3 评论 0 原文

我试图使用UTF-8并遇到麻烦。

我尝试了很多事情。 here are the results I have gotten:

  • ???? instead of Asian characters.即使对于欧洲文本,我也得到了 se?或señor
  • 奇怪的gibberish(mojibake?),例如seã±或 æ–°°ªª°–° - » for >
  • 黑色钻石,例如se。
  • 最后,我陷入了丢失数据或至少被截断的情况: se for señor
  • 即使我收到的文字正确,它也不能正确地 sort

我在做什么错?如何修复代码?我可以恢复 data ,如果是的话?

I tried to use UTF-8 and ran into trouble.

I have tried so many things; here are the results I have gotten:

  • ???? instead of Asian characters. Even for European text, I got Se?or for Señor.
  • Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
  • Black diamonds, such as Se�or.
  • Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
  • Even when I got text to look right, it did not sort correctly.

What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

天涯离梦残月幽梦 2025-02-11 21:19:24

这个问题困扰着该网站的参与者以及其他许多人。

您已经列出了字符集的五个主要情况麻烦。

Best Practice

Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)

utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji还有一些中国人。

在MySQL之外,“ UTF-8”是指所有大小编码,因此有效地与MySQL的 utf8MB4 ,而不是 utf8

我将尝试使用这些拼写和资本化来区分以下内容的内部与MySQL之外的区分。

Overview of what you should do

  • Have your editor, etc. set to UTF-8.
  • HTML表单应像<表格Accept-charset =“ UTF-8”>
  • 将您的字节编码为UTF-8。
  • 将UTF-8建立为客户端中使用的编码。
  • Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
  • at the beginning of HTML
  • Stored Routines acquire the current charset/collation.他们可能需要重建。

utf-8> utf-8一直通过

More details for computer languages (and its following sections)

Test the data

Viewing the data with a tool or with SELECT cannot be trusted.
太多这样的客户端,尤其是浏览器,试图弥补不正确的编码,即使数据库被填充,也向您展示了正确的文本。
So, pick a table and column that has some non-English text and do

SELECT col, HEX(col) FROM tbl WHERE ...

The HEX for correctly stored UTF-8 will be

  • For a blank space (in any language): 20
  • For English: 4x, 5x, 6x, or 7x
  • For most of Western Europe, accented letters should be Cxyy
  • Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
  • Most of Asia: Exyyzz
  • Emoji and some of Chinese: F0yyzzww
  • More details

Specific causes and fixes of the problems seen

Truncated text (Se for Señor):

  • The bytes to be stored are not encoded as utf8mb4.修复此。
  • 另外,检查阅读过程中的连接是否为UTF-8。

黑色钻石带有问号( se.或code> señor);
one of these cases exists:

Case 1 (original bytes were not UTF-8):

  • The bytes to be stored are not encoded as utf8.修复此。
  • insert 的连接(或设置名称)不是utf8/utf8mb4。修复此。
  • 另外,检查数据库中的列是否为字符SET UTF8 (或UTF8MB4)。

Case 2 (original bytes were UTF-8):

  • The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4.修复此。
  • 另外,检查数据库中的列是否为字符SET UTF8 (或UTF8MB4)。

黑色钻石仅在浏览器设置为< meta charset = utf-8> 时才发生。

Question Marks (regular ones, not black diamonds) (Se?or for Señor):

  • The bytes to be stored are not encoded as utf8/ UTF8MB4。修复此。
  • 数据库中的列不是字符集utf8 (或UTF8MB4)。修复此。 (Use SHOW CREATE TABLE.)
  • Also, check that the connection during reading is UTF-8.

mojibake seâ±或 señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)

  • The bytes to be stored need to be UTF-8-encoded.修复此。
  • 插入选择文本需要指定UTF8或UTF8MB4时的连接。修复此。
  • 该列需要声明字符SET UTF8 (或UTF8MB4)。修复此。
  • HTML应以< meta charset = utf-8> 开始。

如果数据看起来正确,但无法正确排序,则
您选择了错误的整理,
或者没有适合您需求的整理
或者您有双重编码

双重编码可以通过执行选择.. hex .. 上述确认。

é should come back C3A9, but instead shows C383C2A9
The Emoji

This problem plagues the participants of this site, and many others.

You have listed the five main cases of CHARACTER SET troubles.

Best Practice

Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)

utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.

Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.

I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.

Overview of what you should do

  • Have your editor, etc. set to UTF-8.
  • HTML forms should start like <form accept-charset="UTF-8">.
  • Have your bytes encoded as UTF-8.
  • Establish UTF-8 as the encoding being used in the client.
  • Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
  • <meta charset=UTF-8> at the beginning of HTML
  • Stored Routines acquire the current charset/collation. They may need rebuilding.

UTF-8 all the way through

More details for computer languages (and its following sections)

Test the data

Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do

SELECT col, HEX(col) FROM tbl WHERE ...

The HEX for correctly stored UTF-8 will be

  • For a blank space (in any language): 20
  • For English: 4x, 5x, 6x, or 7x
  • For most of Western Europe, accented letters should be Cxyy
  • Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
  • Most of Asia: Exyyzz
  • Emoji and some of Chinese: F0yyzzww
  • More details

Specific causes and fixes of the problems seen

Truncated text (Se for Señor):

  • The bytes to be stored are not encoded as utf8mb4. Fix this.
  • Also, check that the connection during reading is UTF-8.

Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:

Case 1 (original bytes were not UTF-8):

  • The bytes to be stored are not encoded as utf8. Fix this.
  • The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
  • Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).

Case 2 (original bytes were UTF-8):

  • The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
  • Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).

Black diamonds occur only when the browser is set to <meta charset=UTF-8>.

Question Marks (regular ones, not black diamonds) (Se?or for Señor):

  • The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
  • The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
  • Also, check that the connection during reading is UTF-8.

Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)

  • The bytes to be stored need to be UTF-8-encoded. Fix this.
  • The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
  • The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
  • HTML should start with <meta charset=UTF-8>.

If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.

Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.

é should come back C3A9, but instead shows C383C2A9
The Emoji ???? should come back F09F91BD, but comes back C3B0C5B8E28098C2BD

That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.

Fixing the Data, where possible

For Truncation and Question Marks, the data is lost.

For Mojibake / Double Encoding, ...

For Black Diamonds, ...

The Fixes are listed here: 5 different fixes for 5 different situations; pick carefully

Related: Illegal mix of collations

孤单情人 2025-02-11 21:19:24

服务器迁移后,我的两个项目也有类似的问题。 After searching and trying a lot of solutions, I came across with this one:

mysqli_set_charset($con,"utf8mb4");

After adding this line to my configuration file, everything works fine!

我找到了 mysqli - php mysqli set_charset()function - 当我想从HTML查询中求解插件时。

I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:

mysqli_set_charset($con,"utf8mb4");

After adding this line to my configuration file, everything works fine!

I found this solution for MySQLiPHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.

金橙橙 2025-02-11 21:19:24

continue

I was also searching for the same issue. It took me nearly one month to find the appropriate solution.

First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.

For Java:

while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.

For Python:

Before querying into the database, try enforcing this over the cursor

cursor.execute("SET NAMES utf8mb4")
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")

If it does not work, happy hunting for the right solution.

诗笺 2025-02-11 21:19:24
  1. 将您的代码IDE语言设置为UTF-8

  2. add&lt; meta charset =“ utf-8”&gt;到您收集数据表格的网页标头。

  3. 检查您的MySQL表定义如下:

     创建表your_table(
       ...
     )引擎= innodb默认charset = utf8
     
  4. 如果您正在使用

      $ options = array(pdo :: mysql_attr_init_command =&gt;'set name utf8');
    $ dbl = new PDO($ PDO,$ user,$ pass,$ options);
     


如果您正在 ,您可以尝试使用正确的Charset导出SIDU,并用UTF-8导入。

  1. Set your code IDE language to UTF-8

  2. Add <meta charset="utf-8"> to your webpage header where you collect data form.

  3. Check your MySQL table definition looks like this:

     CREATE TABLE your_table (
       ...
     ) ENGINE=InnoDB DEFAULT CHARSET=utf8
    
  4. If you are using PDO, make sure

    $options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
    $dbL = new PDO($pdo, $user, $pass, $options);
    

If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.

花间憩 2025-02-11 21:19:24

根据服务器的设置方式,您必须相应地更改编码。 UTF8您所说的应该做得最好。但是,如果您获得了怪异的字符,则如果将编码的网页更改为ANSI,可能会有所帮助。

当我设置一个php mysqli 时,这对我有帮助。这可能会帮助您更多地了解: ansi> ansi> ansi to Notepad ++

Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.

This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文