通过Django将UTF8字符保存在拉丁-1 mysql表中时，如何预测错误？

发布于 2025-01-17 12:33:36 字数 2595 浏览 4 评论 0原文

我的设置是在AWS Amazon Linux实例上使用MySQL 5.7使用Python3+，Django 3.2。当我最初创建数据库和表时，我没有指定特定的charset/编码。因此，我阅读了以下帖子，并确定我的表和列目前是拉丁语1：我如何查看哪个字符设置mySQL数据库 /表 /列是？

我还阅读了这篇文章，以尝试了解客户端使用的差异编码以及表/数据库正在使用的内容 - 这允许客户端在带有Latin1 CharSet的MySQL表中保存非latin1 chars： mysql似乎导致数据存储为UTF8

这是一些代码来显示我要做的事情：

# make a new object
mydata = Dataset()
# set the description. This has a few different non-latin1 characters:
#    smart quotes, long dash, dots over the i
mydata.description = "“naïve—T-cells”"

# this returns an error to prove to myself that there are non-latin1 chars in the string
mydata.description.encode("latin-1")
# Traceback (most recent call last):
#  File "<console>", line 1, in <module>
# UnicodeEncodeError: 'latin-1' codec cant encode character '\u201c' in position 0: 
#      ordinal not in range(256)

# this works though (ie this string can be encoded using cp1252)
mydata.description.encode("cp1252")
# >>>    b'\x93na\xefve\x97T-cells\x94'

# And, it is fine to save it to the mysql table (which has latin1 charset, but I 
# believe this works since the client can handle non-latin1 as I read from above link)
# no error for this:
mydata.save()

# now I try again but with a different non-latin1 character (greater than or equal sign)
mydata.description = "≥4"

# both of these give an error as expected, since the >= character isnt in either charset
mydata.description.encode("latin-1")
mydata.description.encode("cp1252")

# I cant save this non-latin1 char to the database:
mydata.save()
# django.db.utils.OperationalError: (1366, "Incorrect string value: '\\xE2\\x89\\xA54' for column 'description' at row 1")

我的问题是：为什么某些非latin1 chars可以保存而没有问题，而是其他非latin1 chars却导致了其他非latin1 chars。当我尝试插入它们时，“操作性错误不正确”？

我可能可以通过更改MySQL表（ django charset and Encoding ）来解决问题。，但是我将应用程序与几个不同的客户一起部署，因此这是一个挑战（轻描淡写）。取而代之的是，我想在数据加载过程中创建一个步骤，该过程检查是否会检查无效的字符，而不是丢弃错误，以便用户可以在加载之前对文档进行更改。

因此，我的实际问题是：我怎么知道哪些非latin1字符会引起问题，哪些还可以？是否允许保存所有CP1252字符，但不允许使用CP1252之外的任何东西？

如何检查我的Django客户端正在使用的编码是什么？在设置中的数据库选项中没有与CharSet或设置名称有关的任何内容。

（我我想通过向用户通知不良炭的通知来防止错误。

原文

My setup is using python3+, django 3.2 with mysql 5.7 on an AWS Amazon linux instance. When I originally created my database and tables, I did not specify a particular charset/encoding. So, I read the following post and determined that my tables and columns are currently latin1: How do I see what character set a MySQL database / table / column is?

I have also read this post to try and understand the differences between what the client uses as encoding and what the table/database is using -- this allows the client to save non-latin1 chars in a mysql table with latin1 charset:
MySQL 'set names latin1' seems to cause data to be stored as utf8

Here is some code to show what I am trying to do:

# make a new object
mydata = Dataset()
# set the description. This has a few different non-latin1 characters:
#    smart quotes, long dash, dots over the i
mydata.description = "“naïve—T-cells”"

# this returns an error to prove to myself that there are non-latin1 chars in the string
mydata.description.encode("latin-1")
# Traceback (most recent call last):
#  File "<console>", line 1, in <module>
# UnicodeEncodeError: 'latin-1' codec cant encode character '\u201c' in position 0: 
#      ordinal not in range(256)

# this works though (ie this string can be encoded using cp1252)
mydata.description.encode("cp1252")
# >>>    b'\x93na\xefve\x97T-cells\x94'

# And, it is fine to save it to the mysql table (which has latin1 charset, but I 
# believe this works since the client can handle non-latin1 as I read from above link)
# no error for this:
mydata.save()

# now I try again but with a different non-latin1 character (greater than or equal sign)
mydata.description = "≥4"

# both of these give an error as expected, since the >= character isnt in either charset
mydata.description.encode("latin-1")
mydata.description.encode("cp1252")

# I cant save this non-latin1 char to the database:
mydata.save()
# django.db.utils.OperationalError: (1366, "Incorrect string value: '\\xE2\\x89\\xA54' for column 'description' at row 1")

My question is: why do some non-latin1 chars get saved without a problem, but other non-latin1 chars cause an "OperationalError Incorrect string value" when I try to insert them?

I could probably solve the problem by changing the charset on the mysql tables (Django charset and encoding), but I have my app deployed with several different customers and so this is kind of a challenge (understatement). Instead, I would like to create a step in the data loading process which checks for invalid characters rather than throwing an error so that the user can make the change to the document before loading.

So, my practical question is: how do I know which non-latin1 characters will cause a problem and which are ok? Are all cp1252 characters allowed to be saved but anything beyond cp1252 not allowed?

How can I check what encoding my django client is using? (I don't have anything related to charset or set names in my DATABASE Options in settings.py)

Note: I don't want anything to alter the tables or require a migration. I want to prevent the errors by informing the users about bad chars.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

翻了热茶 2025-01-24 12:33:36

转到“mysql”命令行工具。使用它来执行 SHOW CREATE TABLE tablename; 这将告诉您该表的列的字符集（和排序规则）。

SET NAMES latin1; 声明客户端编码是latin1，而不是cp1252，不是UTF-8等。

\x93na\xefve\x97T-cells\x94< /code> 是 cp1256 或 latin1，表示“naïve—T 细胞”。因此，SET 应该有所帮助。

latin1 hex:       936E61EF766597542D63656C6C7394
utf8 hex:         E2809C6E61C3AF7665E28094542D63656C6C73E2809D
'double-encoded': C3A2E282ACC5936E61C383C2AF7665C3A2E282ACE2809D542D63656C6C73C3A2E282ACC29D

（我在链接中的回答是指第 7 项中的“双重编码”。）

E289A5 is utf8 for `≥`, which _cannot_ be properly encoded in latin1.

因此，如果您在客户端中看到 ≥，那么它不是 latin1，并且您问题中的一些内容需要进一步调查。下面是它可以工作的编码。

                    binary, utf8mb4, utf8  E289A5
                                    euckr  A1C3
                     gb18030, gb2312, gbk  A1DD
                                  keybcs2  F2
                             koi8r, koi8u  99
                          macce, macroman  B3

最重要的是，您应该对所有内容使用UTF-8（MySQL 的“utf8mb4”）。

Go to the "mysql" commandline tool. Use it to do SHOW CREATE TABLE tablename; That will tell you the charsets (and collations) for the columns of that table.

SET NAMES latin1; declares that the client encoding is latin1, not cp1252, not UTF-8, etc.

\x93na\xefve\x97T-cells\x94 is the cp1256 or latin1 for “naïve—T-cells”. Hence, the SET should have helped.

latin1 hex:       936E61EF766597542D63656C6C7394
utf8 hex:         E2809C6E61C3AF7665E28094542D63656C6C73E2809D
'double-encoded': C3A2E282ACC5936E61C383C2AF7665C3A2E282ACE2809D542D63656C6C73C3A2E282ACC29D

(My answer in the link was referring to "double encoding" in item 7.)

E289A5 is utf8 for `≥`, which _cannot_ be properly encoded in latin1.

So, if you are seeing ≥ in the client, then it is not latin1, and some of the things in your Question need further investigation. Here are then encodings where it will work.

                    binary, utf8mb4, utf8  E289A5
                                    euckr  A1C3
                     gb18030, gb2312, gbk  A1DD
                                  keybcs2  F2
                             koi8r, koi8u  99
                          macce, macroman  B3

The bottom line, is that you should use UTF-8 (MySQL's "utf8mb4") for everything.

回复收藏 0 原文

~没有更多了~