Python utf-8重音问题

发布于 2024-11-28 09:28:03 字数 1117 浏览 0 评论 0原文

我在口音方面遇到一些问题。

我做了一个 python 脚本,它从某些输入(IMAP 获取)中获取单词“refeição”,这个单词是葡萄牙语的,我需要将其转换为人类可读的。解码后,它应该显示为“refeição”,但我没有得到这个结果...

>>> print a 
refeição
>>> ENCODING = locale.getpreferredencoding()
>>> print ENCODING
UTF-8
>>> print a.encode(ENCODING)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
>>> a.decode('utf-8')
u'refei\xe7\xe3o'
>>> print a.decode('utf-8')
refeição

更新:

root@ticuna:/etc/scripts# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

此外,这些单词被插入到 mysql 数据库中,并且“不可读”字符显示在相同的数据库中终端中的方式。 表排序规则为 utf8_general_ci

I am having some problems with accents.

I did a python script that are getting the word "refeição" from some input (IMAP fetch), this word is in Portuguese and I need convert it to be human readable. After decode, it should appear like "refeição" but I am not getting this result...

>>> print a 
refeição
>>> ENCODING = locale.getpreferredencoding()
>>> print ENCODING
UTF-8
>>> print a.encode(ENCODING)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
>>> a.decode('utf-8')
u'refei\xe7\xe3o'
>>> print a.decode('utf-8')
refeição

Updated:

root@ticuna:/etc/scripts# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Also, theses words are inserted in a mysql database and the "unreadable" characters are showing in the same way that is in terminal.
The table collation is utf8_general_ci

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

冰葑 2024-12-05 09:28:03

看起来您的终端窗口以单字节 ISO-8859-1 字符集(“latin-1”)显示文本,但您的 Python 解释器认为终端正在使用 UTF-8。从u'refei\xe7\xe3o'中我们可以看到Python具有正确的葡萄牙语字母内部表示。显然, print 命令随后将内部表示形式转换为 UTF-8 并将其发送到您的终端,当终端将该 UTF-8 解释为 ISO-8859-1 时,会产生乱码。

解决方法是使您的区域设置与终端正在执行的操作相匹配 - 通过更改区域设置或确保您的终端为 utf-8。

It looks like your terminal window displays text in the single-byte ISO-8859-1 charset, ("latin-1"), but your python interpreter thinks the terminal is speaking UTF-8. We can see from u'refei\xe7\xe3o' that Python has the correct internal representation of the Portugese letters. Apparently, the print command then converts the internal representation to UTF-8 and sends it to your terminal, which produces gibberish when the terminal interprets that UTF-8 as ISO-8859-1.

The fix is to make your locale match what your terminal is doing -- either by changing the locale or by making sure your terminal is utf-8.

太阳男子 2024-12-05 09:28:03

作为解决办法,我正在删除所有重音。

这是我使用的代码:

def remove_accents(s):
   return ''.join((c for c in unicodedata.normalize('NFD', s.decode('utf-8')) if unicodedata.category(c) != 'Mn'))

基于这个答案:
最好的方法是什么删除 Python unicode 字符串中的重音符号?

As work around, I am removing all accents.

Here is the code that I used:

def remove_accents(s):
   return ''.join((c for c in unicodedata.normalize('NFD', s.decode('utf-8')) if unicodedata.category(c) != 'Mn'))

Based in this answer:
What is the best way to remove accents in a Python unicode string?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文