Python utf-8重音问题
我在口音方面遇到一些问题。
我做了一个 python 脚本,它从某些输入(IMAP 获取)中获取单词“refeição”,这个单词是葡萄牙语的,我需要将其转换为人类可读的。解码后,它应该显示为“refeição”,但我没有得到这个结果...
>>> print a
refeição
>>> ENCODING = locale.getpreferredencoding()
>>> print ENCODING
UTF-8
>>> print a.encode(ENCODING)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
>>> a.decode('utf-8')
u'refei\xe7\xe3o'
>>> print a.decode('utf-8')
refeição
更新:
root@ticuna:/etc/scripts# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
此外,这些单词被插入到 mysql 数据库中,并且“不可读”字符显示在相同的数据库中终端中的方式。 表排序规则为 utf8_general_ci
I am having some problems with accents.
I did a python script that are getting the word "refeição" from some input (IMAP fetch), this word is in Portuguese and I need convert it to be human readable. After decode, it should appear like "refeição" but I am not getting this result...
>>> print a
refeição
>>> ENCODING = locale.getpreferredencoding()
>>> print ENCODING
UTF-8
>>> print a.encode(ENCODING)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
>>> a.decode('utf-8')
u'refei\xe7\xe3o'
>>> print a.decode('utf-8')
refeição
Updated:
root@ticuna:/etc/scripts# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Also, theses words are inserted in a mysql database and the "unreadable" characters are showing in the same way that is in terminal.
The table collation is utf8_general_ci
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
看起来您的终端窗口以单字节 ISO-8859-1 字符集(“latin-1”)显示文本,但您的 Python 解释器认为终端正在使用 UTF-8。从
u'refei\xe7\xe3o'
中我们可以看到Python具有正确的葡萄牙语字母内部表示。显然, print 命令随后将内部表示形式转换为 UTF-8 并将其发送到您的终端,当终端将该 UTF-8 解释为 ISO-8859-1 时,会产生乱码。解决方法是使您的区域设置与终端正在执行的操作相匹配 - 通过更改区域设置或确保您的终端为 utf-8。
It looks like your terminal window displays text in the single-byte ISO-8859-1 charset, ("latin-1"), but your python interpreter thinks the terminal is speaking UTF-8. We can see from
u'refei\xe7\xe3o'
that Python has the correct internal representation of the Portugese letters. Apparently, the print command then converts the internal representation to UTF-8 and sends it to your terminal, which produces gibberish when the terminal interprets that UTF-8 as ISO-8859-1.The fix is to make your locale match what your terminal is doing -- either by changing the locale or by making sure your terminal is utf-8.
作为解决办法,我正在删除所有重音。
这是我使用的代码:
基于这个答案:
最好的方法是什么删除 Python unicode 字符串中的重音符号?
As work around, I am removing all accents.
Here is the code that I used:
Based in this answer:
What is the best way to remove accents in a Python unicode string?