Python电子邮件引用-可打印编码问题

发布于 2025-01-10 21:59:16 字数 1210 浏览 0 评论 0原文

我使用以下命令从 Gmail 中提取电子邮件:

def getMsgs():
 try:
    conn = imaplib.IMAP4_SSL("imap.gmail.com", 993)
  except:
    print 'Failed to connect'
    print 'Is your internet connection working?'
    sys.exit()
  try:
    conn.login(username, password)
  except:
    print 'Failed to login'
    print 'Is the username and password correct?'
    sys.exit()

  conn.select('Inbox')
  # typ, data = conn.search(None, '(UNSEEN SUBJECT "%s")' % subject)
  typ, data = conn.search(None, '(SUBJECT "%s")' % subject)
  for num in data[0].split():
    typ, data = conn.fetch(num, '(RFC822)')
    msg = email.message_from_string(data[0][1])
    yield walkMsg(msg)

def walkMsg(msg):
  for part in msg.walk():
    if part.get_content_type() != "text/plain":
      continue
    return part.get_payload()

但是,我收到的一些电子邮件几乎不可能从与编码相关的字符(例如“=”)中提取日期(使用正则表达式),这些字符随机落在各个文本字段的中间。这是一个出现在我想要提取的日期范围内的示例:

姓名:KIRSTI 电子邮件: [电子邮件受保护] 电话号码:+ 999 99995192 队伍总数: 4 总数, 0 儿童抵达/出发:10 月 9 日= , 2010年 - 2010年10月13日 - 2010年10月13日

有没有办法删除这些编码字符?

I am extracting emails from Gmail using the following:

def getMsgs():
 try:
    conn = imaplib.IMAP4_SSL("imap.gmail.com", 993)
  except:
    print 'Failed to connect'
    print 'Is your internet connection working?'
    sys.exit()
  try:
    conn.login(username, password)
  except:
    print 'Failed to login'
    print 'Is the username and password correct?'
    sys.exit()

  conn.select('Inbox')
  # typ, data = conn.search(None, '(UNSEEN SUBJECT "%s")' % subject)
  typ, data = conn.search(None, '(SUBJECT "%s")' % subject)
  for num in data[0].split():
    typ, data = conn.fetch(num, '(RFC822)')
    msg = email.message_from_string(data[0][1])
    yield walkMsg(msg)

def walkMsg(msg):
  for part in msg.walk():
    if part.get_content_type() != "text/plain":
      continue
    return part.get_payload()

However, some emails I get are nigh impossible for me to extract dates (using regex) from as encoding-related chars such as '=', randomly land in the middle of various text fields. Here's an example where it occurs in a date range I want to extract:

Name: KIRSTI Email:
[email protected] Phone #: + 999
99995192 Total in party: 4 total, 0
children Arrival/Departure: Oct 9=
,
2010 - Oct 13, 2010 - Oct 13, 2010

Is there a way to remove these encoding characters?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

蓦然回首 2025-01-17 21:59:16

如果您使用的是Python3.6或更高版本,则可以使用 email.message.Message.get_content() 方法自动解码文本。此方法取代了 get_payload(),但 get_payload() 仍然可用。

假设您有一个包含此电子邮件的字符串 s (基于 文档中的示例):

Subject: Ayons asperges pour le =?utf-8?q?d=C3=A9jeuner?=
From: =?utf-8?q?Pep=C3=A9?= Le Pew <[email protected]>
To: Penelope Pussycat <[email protected]>,
 Fabrette Pussycat <[email protected]>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

    Salut!

    Cela ressemble =C3=A0 un excellent recipie[1] d=C3=A9jeuner.

    [1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718

    --Pep=C3=A9
   =20

字符串中的非 ascii 字符已使用 quoted-printable 编码进行编码,如 Content-Transfer 中指定-编码标头。

创建邮件对象:

import email
from email import policy

msg = email.message_from_string(s, policy=policy.default)

这里需要设置策略;否则 policy.compat32 ,它返回没有 get_content 方法的旧 Message 实例。 policy.default 最终将成为默认策略,但从 Python3.7 开始,它仍然是 policy.compat32

get_content() 方法自动处理解码:

print(msg.get_content())

Salut!

Cela ressemble à un excellent recipie[1] déjeuner.

[1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718

--Pepé

如果您有多部分消息,则需要对各个部分调用 get_content(),如下所示:

for part in message.iter_parts():
    print(part.get_content())

If you are using Python3.6 or later, you can use the email.message.Message.get_content() method to decode the text automatically. This method supersedes get_payload(), though get_payload() is still available.

Say you have a string s containing this email message (based on the examples in the docs):

Subject: Ayons asperges pour le =?utf-8?q?d=C3=A9jeuner?=
From: =?utf-8?q?Pep=C3=A9?= Le Pew <[email protected]>
To: Penelope Pussycat <[email protected]>,
 Fabrette Pussycat <[email protected]>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

    Salut!

    Cela ressemble =C3=A0 un excellent recipie[1] d=C3=A9jeuner.

    [1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718

    --Pep=C3=A9
   =20

Non-ascii characters in the string have been encoded with the quoted-printable encoding, as specified in the Content-Transfer-Encoding header.

Create an email object:

import email
from email import policy

msg = email.message_from_string(s, policy=policy.default)

Setting the policy is required here; otherwise policy.compat32 is used, which returns a legacy Message instance that doesn't have the get_content method. policy.default will eventually become the default policy, but as of Python3.7 it's still policy.compat32.

The get_content() method handles decoding automatically:

print(msg.get_content())

Salut!

Cela ressemble à un excellent recipie[1] déjeuner.

[1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718

--Pepé

If you have a multipart message, get_content() needs to be called on the individual parts, like this:

for part in message.iter_parts():
    print(part.get_content())
养猫人 2025-01-17 21:59:16

您可以/应该使用 email.parser 模块来解码邮件消息,例如(快速而肮脏的例子!):

from email.parser import FeedParser
f = FeedParser()
f.feed("<insert mail message here, including all headers>")
rootMessage = f.close()

# Now you can access the message and its submessages (if it's multipart)
print rootMessage.is_multipart()

# Or check for errors
print rootMessage.defects

# If it's a multipart message, you can get the first submessage and then its payload
# (i.e. content) like so:
rootMessage.get_payload(0).get_payload(decode=True)

使用 Message.get_payload,该模块根据其编码自动解码内容(例如,您问题中引用的可打印内容)。

You could/should use the email.parser module to decode mail messages, for example (quick and dirty example!):

from email.parser import FeedParser
f = FeedParser()
f.feed("<insert mail message here, including all headers>")
rootMessage = f.close()

# Now you can access the message and its submessages (if it's multipart)
print rootMessage.is_multipart()

# Or check for errors
print rootMessage.defects

# If it's a multipart message, you can get the first submessage and then its payload
# (i.e. content) like so:
rootMessage.get_payload(0).get_payload(decode=True)

Using the "decode" parameter of Message.get_payload, the module automatically decodes the content, depending on its encoding (e.g. quoted printables as in your question).

木森分化 2025-01-17 21:59:16

这就是所谓的引用打印编码。您可能想使用类似 quopri.decodestring - http://docs .python.org/library/quopri.html

That's known as quoted-printable encoding. You probably want to use something like quopri.decodestring - http://docs.python.org/library/quopri.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文