阅读带重音的文本 - Python

发布于 2024-09-18 07:49:45 字数 1278 浏览 11 评论 0原文

我用 python 做了一些脚本,连接到 GMAIL 并打印电子邮件文本...但是,我的电子邮件经常包含带有“口音”的单词。这就是我的问题...

例如,我收到的文本:“PLANO DE S=C3=9ADE”应打印为“PLANO DE SAÚDE”。

如何使我的电子邮件文本变得清晰易读?我可以用什么来转换这些带重音的字母?

谢谢,


Andrey 建议的代码在 Windows 上运行良好,但在 Linux 上我仍然得到错误的打印:

>>> b = 'PLANO DE S=C3=9ADE'
>>> s = b.decode('quopri').decode('utf-8')
>>> print s
PLANO DE SÃDE

Rafael,

谢谢,你对这个词的理解是正确的,它拼写错误。 但这里的问题还是一样。另一个例子: 正确的单词:obersevação

>>> b = 'Observa=C3=A7=C3=B5es'
>>> s = b.decode('quopri').decode('utf-8')
>>> print s
Observações

我正在使用 Debian 和 UTF-8 语言环境:

>>> :~$ locale
LANG=en_US.UTF-8

Andrey,

感谢您的宝贵时间。我同意你的解释,但这里仍然存在同样的问题。看看我的测试:

   s='Observa=C3=A7=C3=B5es'
   s2= s.decode('quopri').decode('utf-8')

   >>> print s

   Observa=C3=A7=C3=B5es

   >>> print s2

   Observações

   >>> import locale

   >>> ENCODING = locale.getpreferredencoding()

   >>> print s.encode(ENCODING)
   Observa=C3=A7=C3=B5es

   >>> print s2.encode(ENCODING)
   Observações

   >>> print ENCODING
   UTF-8

I did some script in python that connects to GMAIL and print a email text... But, often my emails has words with "accent". And there is my problem...

For example a text that I got: "PLANO DE S=C3=9ADE" should be printed as "PLANO DE SAÚDE".

How can I turn legible my email text? What can I use to convert theses letters with accent?

Thanks,


The code suggested by Andrey, works fine on windows, but on Linux I still getting the wrong print:

>>> b = 'PLANO DE S=C3=9ADE'
>>> s = b.decode('quopri').decode('utf-8')
>>> print s
PLANO DE SÃDE

Rafael,

Thanks, you are correct about the word, it was misspelled.
But the problem still the same here. Another example:
CORRECT WORD: obersevação

>>> b = 'Observa=C3=A7=C3=B5es'
>>> s = b.decode('quopri').decode('utf-8')
>>> print s
Observações

I am using Debian with UTF-8 locale:

>>> :~$ locale
LANG=en_US.UTF-8

Andrey,

Thanks for your time. I agree with your explanation, but still with same problem here. Take look in my test:

   s='Observa=C3=A7=C3=B5es'
   s2= s.decode('quopri').decode('utf-8')

   >>> print s

   Observa=C3=A7=C3=B5es

   >>> print s2

   Observações

   >>> import locale

   >>> ENCODING = locale.getpreferredencoding()

   >>> print s.encode(ENCODING)
   Observa=C3=A7=C3=B5es

   >>> print s2.encode(ENCODING)
   Observações

   >>> print ENCODING
   UTF-8

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

九命猫 2024-09-25 07:49:45

这种编码称为 Quoted-printable。在您的示例中,您有一个以 UTF-8 字节(Python 的 str)编码的字符串(Python 的 unicode),并以带引号的可打印字节编码。因此,获取字符串值的正确方法是:

>>> b = 'PLANO DE S=C3=9ADE'
>>> s = b.decode('quopri').decode('utf-8')
>>> print s
PLANO DE SÚDE

更新: 不过,控制台配置可能存在一些问题。 s 保存完全正确的 Unicode 字符串值(Python 类型 unicode)。但是当您使用print语句时,值必须转换为字节(Python的str)才能写入操作系统文件描述符编号1(标准输出管道)。因此,print 语句实现会检查您的控制台编码,然后进行一些猜测并打印结果。事实上,在 Python 2 中,从交互式 shell 打印、非交互式运行进程以及在将输出重定向到文件时运行进程的结果将有所不同。

在 Python 2 中输出编码字符串的最佳方法尚未达成一致。最有意义的两种方法是:

1)使用locale的编码猜测并手动编码字符串。

import locale
ENCODING = locale.getpreferredencoding()

print s.encode(ENCODING)

2) 使用编码选项(命令行、硬编码或其他)。

from getopt import getopt
ENCODING = 'UTF-8'
opts, args = getopt(sys.argv[1:], '', ['encoding='])
for opt, arg in opts:
    if opt == '--encoding':
        ENCODING = arg

print s.encode(ENCODING)

更新 2:如果没有任何帮助,并且您仍然确定控制台编码和字体设置为 UTF-8,请尝试以下

import sys, os
ENCODING = 'UTF-8'
stdout = os.fdopen(sys.stdout.fileno(), 'wb')
s = u'привет' # Don't forget to use a Unicode literal staring with u''
stdout.write(s.encode(ENCODING))

操作:此时您必须看到俄语单词控制台中的西里尔字符集 привет :)

如果是这种情况,那么您应该使用此二进制 stdout 而不是普通的 sys.stdout

This encoding is called Quoted-printable. In your example, you have a string (Python's unicode) encoded in UTF-8 bytes (Python's str) encoded in quoted printable bytes. So the right way to get a string value is:

>>> b = 'PLANO DE S=C3=9ADE'
>>> s = b.decode('quopri').decode('utf-8')
>>> print s
PLANO DE SÚDE

Update: There might be some issues with the console conding though. s holds a fully correct Unicode string value (of Python type unicode). But when you use the print statement, the value must be converted to bytes (Python's str) in order to be written to OS file descriptor number 1 (the standard output pipe). So the print statement implementation checks your console encoding, then makes some guesses and prints the results. In fact, in Python 2 the results will be different for printing from the interactive shell, running your process non-interactively and running your process while redirecting the output to a file.

The best way to output encoded strings in Python 2 is not agreed upon. Two ways that make most sense are:

1) Use locale's encoding guess and manually encode strings.

import locale
ENCODING = locale.getpreferredencoding()

print s.encode(ENCODING)

2) Use an encoding option (command-line, hard-coded or whatever).

from getopt import getopt
ENCODING = 'UTF-8'
opts, args = getopt(sys.argv[1:], '', ['encoding='])
for opt, arg in opts:
    if opt == '--encoding':
        ENCODING = arg

print s.encode(ENCODING)

Update 2: If nothing helps and you still sure that your console encoding and font are set to UTF-8, then try this:

import sys, os
ENCODING = 'UTF-8'
stdout = os.fdopen(sys.stdout.fileno(), 'wb')
s = u'привет' # Don't forget to use a Unicode literal staring with u''
stdout.write(s.encode(ENCODING))

At this point you must see the Russian word привет in cyrillic character set in your console :)

If this is the case, then you should use this binary stdout instead of normal sys.stdout.

谁的新欢旧爱 2024-09-25 07:49:45

你的字符串错了,看看:

'PLANO DE S=C3=9ADE' == 'PLANO DE S\xc3\x9aDE'

SAÚDE 中缺少的“A”在哪里?

如果将 'PLANO DE S=C3=9ADE' 解码为引用打印,您将只得到 'PLANO DE SÚDE'。

在 Linux (Ubuntu 9.10) 上运行此代码:

>>> b = 'PLANO DE S=C3=9ADE'
>>> s = b.decode('quopri').decode('utf-8')
>>> print s
PLANO DE SÚDE

Your string is wrong, look:

'PLANO DE S=C3=9ADE' == 'PLANO DE S\xc3\x9aDE'

Where is the missing "A" in SAÚDE?

If you decode 'PLANO DE S=C3=9ADE' as a quoted-printable, you will get only 'PLANO DE SÚDE'.

Running this code here on linux (Ubuntu 9.10):

>>> b = 'PLANO DE S=C3=9ADE'
>>> s = b.decode('quopri').decode('utf-8')
>>> print s
PLANO DE SÚDE
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文