通过Django将UTF8字符保存在拉丁-1 mysql表中时,如何预测错误?
我的设置是在AWS Amazon Linux实例上使用MySQL 5.7使用Python3+,Django 3.2。当我最初创建数据库和表时,我没有指定特定的charset/编码。因此,我阅读了以下帖子,并确定我的表和列目前是拉丁语1:我如何查看哪个字符设置mySQL数据库 /表 /列是?
我还阅读了这篇文章,以尝试了解客户端使用的差异编码以及表/数据库正在使用的内容 - 这允许客户端在带有Latin1 CharSet的MySQL表中保存非latin1 chars: mysql似乎导致数据存储为UTF8
这是一些代码来显示我要做的事情:
# make a new object
mydata = Dataset()
# set the description. This has a few different non-latin1 characters:
# smart quotes, long dash, dots over the i
mydata.description = "“naïve—T-cells”"
# this returns an error to prove to myself that there are non-latin1 chars in the string
mydata.description.encode("latin-1")
# Traceback (most recent call last):
# File "<console>", line 1, in <module>
# UnicodeEncodeError: 'latin-1' codec cant encode character '\u201c' in position 0:
# ordinal not in range(256)
# this works though (ie this string can be encoded using cp1252)
mydata.description.encode("cp1252")
# >>> b'\x93na\xefve\x97T-cells\x94'
# And, it is fine to save it to the mysql table (which has latin1 charset, but I
# believe this works since the client can handle non-latin1 as I read from above link)
# no error for this:
mydata.save()
# now I try again but with a different non-latin1 character (greater than or equal sign)
mydata.description = "≥4"
# both of these give an error as expected, since the >= character isnt in either charset
mydata.description.encode("latin-1")
mydata.description.encode("cp1252")
# I cant save this non-latin1 char to the database:
mydata.save()
# django.db.utils.OperationalError: (1366, "Incorrect string value: '\\xE2\\x89\\xA54' for column 'description' at row 1")
我的问题是:为什么某些非latin1 chars可以保存而没有问题,而是其他非latin1 chars却导致了其他非latin1 chars。当我尝试插入它们时,“操作性错误不正确”?
我可能可以通过更改MySQL表( django charset and Encoding )来解决问题。 ,但是我将应用程序与几个不同的客户一起部署,因此这是一个挑战(轻描淡写)。取而代之的是,我想在数据加载过程中创建一个步骤,该过程检查是否会检查无效的字符,而不是丢弃错误,以便用户可以在加载之前对文档进行更改。
因此,我的实际问题是:我怎么知道哪些非latin1字符会引起问题,哪些还可以?是否允许保存所有CP1252字符,但不允许使用CP1252之外的任何东西?
如何检查我的Django客户端正在使用的编码是什么? 在设置中的数据库选项中没有与CharSet或设置名称有关的任何内容。
(我 我想通过向用户通知不良炭的通知来防止错误。
My setup is using python3+, django 3.2 with mysql 5.7 on an AWS Amazon linux instance. When I originally created my database and tables, I did not specify a particular charset/encoding. So, I read the following post and determined that my tables and columns are currently latin1: How do I see what character set a MySQL database / table / column is?
I have also read this post to try and understand the differences between what the client uses as encoding and what the table/database is using -- this allows the client to save non-latin1 chars in a mysql table with latin1 charset:
MySQL 'set names latin1' seems to cause data to be stored as utf8
Here is some code to show what I am trying to do:
# make a new object
mydata = Dataset()
# set the description. This has a few different non-latin1 characters:
# smart quotes, long dash, dots over the i
mydata.description = "“naïve—T-cells”"
# this returns an error to prove to myself that there are non-latin1 chars in the string
mydata.description.encode("latin-1")
# Traceback (most recent call last):
# File "<console>", line 1, in <module>
# UnicodeEncodeError: 'latin-1' codec cant encode character '\u201c' in position 0:
# ordinal not in range(256)
# this works though (ie this string can be encoded using cp1252)
mydata.description.encode("cp1252")
# >>> b'\x93na\xefve\x97T-cells\x94'
# And, it is fine to save it to the mysql table (which has latin1 charset, but I
# believe this works since the client can handle non-latin1 as I read from above link)
# no error for this:
mydata.save()
# now I try again but with a different non-latin1 character (greater than or equal sign)
mydata.description = "≥4"
# both of these give an error as expected, since the >= character isnt in either charset
mydata.description.encode("latin-1")
mydata.description.encode("cp1252")
# I cant save this non-latin1 char to the database:
mydata.save()
# django.db.utils.OperationalError: (1366, "Incorrect string value: '\\xE2\\x89\\xA54' for column 'description' at row 1")
My question is: why do some non-latin1 chars get saved without a problem, but other non-latin1 chars cause an "OperationalError Incorrect string value" when I try to insert them?
I could probably solve the problem by changing the charset on the mysql tables (Django charset and encoding), but I have my app deployed with several different customers and so this is kind of a challenge (understatement). Instead, I would like to create a step in the data loading process which checks for invalid characters rather than throwing an error so that the user can make the change to the document before loading.
So, my practical question is: how do I know which non-latin1 characters will cause a problem and which are ok? Are all cp1252 characters allowed to be saved but anything beyond cp1252 not allowed?
How can I check what encoding my django client is using? (I don't have anything related to charset or set names in my DATABASE Options in settings.py)
Note: I don't want anything to alter the tables or require a migration. I want to prevent the errors by informing the users about bad chars.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
转到“mysql”命令行工具。使用它来执行
SHOW CREATE TABLE tablename;
这将告诉您该表的列的字符集(和排序规则)。SET NAMES latin1;
声明客户端编码是latin1,而不是cp1252,不是UTF-8等。\x93na\xefve\x97T-cells\x94< /code> 是 cp1256 或 latin1,表示
“naïve—T 细胞”
。因此,SET
应该有所帮助。(我在链接中的回答是指第 7 项中的“双重编码”。)
因此,如果您在客户端中看到
≥
,那么它不是 latin1,并且您问题中的一些内容需要进一步调查。下面是它可以工作的编码。最重要的是,您应该对所有内容使用UTF-8(MySQL 的“utf8mb4”)。
Go to the "mysql" commandline tool. Use it to do
SHOW CREATE TABLE tablename;
That will tell you the charsets (and collations) for the columns of that table.SET NAMES latin1;
declares that the client encoding is latin1, not cp1252, not UTF-8, etc.\x93na\xefve\x97T-cells\x94
is the cp1256 or latin1 for“naïve—T-cells”
. Hence, theSET
should have helped.(My answer in the link was referring to "double encoding" in item 7.)
So, if you are seeing
≥
in the client, then it is not latin1, and some of the things in your Question need further investigation. Here are then encodings where it will work.The bottom line, is that you should use UTF-8 (MySQL's "utf8mb4") for everything.