Python电子邮件引用-可打印编码问题
我使用以下命令从 Gmail 中提取电子邮件:
def getMsgs():
try:
conn = imaplib.IMAP4_SSL("imap.gmail.com", 993)
except:
print 'Failed to connect'
print 'Is your internet connection working?'
sys.exit()
try:
conn.login(username, password)
except:
print 'Failed to login'
print 'Is the username and password correct?'
sys.exit()
conn.select('Inbox')
# typ, data = conn.search(None, '(UNSEEN SUBJECT "%s")' % subject)
typ, data = conn.search(None, '(SUBJECT "%s")' % subject)
for num in data[0].split():
typ, data = conn.fetch(num, '(RFC822)')
msg = email.message_from_string(data[0][1])
yield walkMsg(msg)
def walkMsg(msg):
for part in msg.walk():
if part.get_content_type() != "text/plain":
continue
return part.get_payload()
但是,我收到的一些电子邮件几乎不可能从与编码相关的字符(例如“=”)中提取日期(使用正则表达式),这些字符随机落在各个文本字段的中间。这是一个出现在我想要提取的日期范围内的示例:
姓名:KIRSTI 电子邮件: [电子邮件受保护] 电话号码:+ 999 99995192 队伍总数: 4 总数, 0 儿童抵达/出发:10 月 9 日= , 2010年 - 2010年10月13日 - 2010年10月13日
有没有办法删除这些编码字符?
I am extracting emails from Gmail using the following:
def getMsgs():
try:
conn = imaplib.IMAP4_SSL("imap.gmail.com", 993)
except:
print 'Failed to connect'
print 'Is your internet connection working?'
sys.exit()
try:
conn.login(username, password)
except:
print 'Failed to login'
print 'Is the username and password correct?'
sys.exit()
conn.select('Inbox')
# typ, data = conn.search(None, '(UNSEEN SUBJECT "%s")' % subject)
typ, data = conn.search(None, '(SUBJECT "%s")' % subject)
for num in data[0].split():
typ, data = conn.fetch(num, '(RFC822)')
msg = email.message_from_string(data[0][1])
yield walkMsg(msg)
def walkMsg(msg):
for part in msg.walk():
if part.get_content_type() != "text/plain":
continue
return part.get_payload()
However, some emails I get are nigh impossible for me to extract dates (using regex) from as encoding-related chars such as '=', randomly land in the middle of various text fields. Here's an example where it occurs in a date range I want to extract:
Name: KIRSTI Email:
[email protected] Phone #: + 999
99995192 Total in party: 4 total, 0
children Arrival/Departure: Oct 9=
,
2010 - Oct 13, 2010 - Oct 13, 2010
Is there a way to remove these encoding characters?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您使用的是Python3.6或更高版本,则可以使用
email.message.Message.get_content()
方法自动解码文本。此方法取代了get_payload()
,但get_payload()
仍然可用。假设您有一个包含此电子邮件的字符串
s
(基于 文档中的示例):字符串中的非 ascii 字符已使用
quoted-printable
编码进行编码,如Content-Transfer 中指定-编码
标头。创建邮件对象:
这里需要设置策略;否则
policy.compat32
,它返回没有 get_content 方法的旧 Message 实例。policy.default
最终将成为默认策略,但从 Python3.7 开始,它仍然是policy.compat32
。get_content()
方法自动处理解码:如果您有多部分消息,则需要对各个部分调用
get_content()
,如下所示:If you are using Python3.6 or later, you can use the
email.message.Message.get_content()
method to decode the text automatically. This method supersedesget_payload()
, thoughget_payload()
is still available.Say you have a string
s
containing this email message (based on the examples in the docs):Non-ascii characters in the string have been encoded with the
quoted-printable
encoding, as specified in theContent-Transfer-Encoding
header.Create an email object:
Setting the policy is required here; otherwise
policy.compat32
is used, which returns a legacy Message instance that doesn't have the get_content method.policy.default
will eventually become the default policy, but as of Python3.7 it's stillpolicy.compat32
.The
get_content()
method handles decoding automatically:If you have a multipart message,
get_content()
needs to be called on the individual parts, like this:您可以/应该使用
email.parser
模块来解码邮件消息,例如(快速而肮脏的例子!):使用
Message.get_payload
,该模块根据其编码自动解码内容(例如,您问题中引用的可打印内容)。You could/should use the
email.parser
module to decode mail messages, for example (quick and dirty example!):Using the "decode" parameter of
Message.get_payload
, the module automatically decodes the content, depending on its encoding (e.g. quoted printables as in your question).这就是所谓的引用打印编码。您可能想使用类似
quopri.decodestring
- http://docs .python.org/library/quopri.htmlThat's known as quoted-printable encoding. You probably want to use something like
quopri.decodestring
- http://docs.python.org/library/quopri.html