IMAP电子邮件获取的一些链接缺少

发布于 2025-02-12 16:05:56 字数 1053 浏览 1 评论 0原文

我正在使用imaplib和电子邮件从电子邮件中提取链接，但是结果缺少主链接，尽管其他链接在那里。

#Assume that I know the id of an email that I need to parse '599'
typ, email_data = mail.fetch('599', '(RFC822)')

msg = email.message_from_bytes(email_data[0][1])
print(msg.get_payload()[0].get_payload())

这是我的电子邮件，带有三个链接：

这是结果：

今天的亮点
Web api in = c2 = a0.net 6.0，带有角色和权限
本周，我正在辅导我的学生客户。我们去过通过使用Auth0来工作OU = R。 it = e2 = 80 = a6
jay（ https://medium.com/@second-link ）（ https://medium.com/@third-link ） = c2 = b73 min读取

链接两个和三个与电子邮件中的链接完全相同，但是如您所见，第一个链接丢失（在所有类似情况下），我不明白为什么。任何帮助将不胜感激。

添加默认策略无济于事。

message = email.message_from_bytes(msg_as_bytes, policy=policy.default)

原文

I'm extracting the links from the email using imaplib and email, but the result is missing the main link, although the others are there.

#Assume that I know the id of an email that I need to parse '599'
typ, email_data = mail.fetch('599', '(RFC822)')

msg = email.message_from_bytes(email_data[0][1])
print(msg.get_payload()[0].get_payload())

Here's my email with three links:

gmail

This is the result:

Today's highlights
Web API in=C2=A0.Net 6.0 with Auth0 with Roles and Permissions
This week, I was tutoring a student client of mine. We have been
working ou= r way through using Auth0. It=E2=80=A6
Jay (https://medium.com/@second-link) in ProjectWT
(https://medium.com/@third-link)
=C2=B73 min read

Links two and three are absolutely identical to those in the email, but as you can see the first link is missing (also in all similar cases) and I can't understand why. Any help would be appreciated.

Adding the default policy is not helping.

message = email.message_from_bytes(msg_as_bytes, policy=policy.default)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

话少情深 2025-02-19 16:05:56

直接的问题似乎是您可能是可能从模仿部分中提取链接，该链接仅包含两个链接。消息的结构显然是类似于

-+ multipart/alternative
 -- text/plain
 -+ multipart/related
  -- text/html
  -- image/png
  -- image/png

您的屏幕截图显示text/html与其相关图像的部分，但文本摘录显示了第一个text/plain/plain part，链接提取也是针对的。

在一般情况下，如果您使用多个电子邮件客户端处理来自多个发件人的消息集合并发送多种类型的消息（有些具有嵌入式映像，其他可能是PDF Attacment或CSV文件集合），则需要执行对每个单独消息的结构进行分析，并根据这些结果确定要提取的MIME部分。但是，对于常见情况，消息的顶级结构仅是一个主体部分，或者使用text/plain和text> text>文本/html渲染相同的“主要”消息（按任何顺序），最新版本的Python提供了一种简单的方法，试图“做正确的事”。

顺便说一句，标准库中的电子邮件模块在Python 3.6中进行了大修，以使其更合乎逻辑，多功能和简洁。新代码应针对（不再非常）新的emailmessage api。当您向Message_from_bytes提供策略参数时，这就是您得到的（没有它，您将获得旧版email> email> emage.message.message.message.message api，也称为“ compat32”，因为它与python 3.2及更早的python兼容。 3.6。）

这样，以下代码应该希望做您想做的。

msg = email.message_from_bytes(email_data[0][1], policy=default)
print(msg.get_body())

新的API不需要您单独要求解码提取的身体部分的内容传输编码，这是您最初尝试的另一个问题。

get_body（）（在旧版API中根本不存在）允许您指定首选的MIME类型的有序列表，但是在这种情况下，默认的首选项列表应执行您想要的操作。如果有的话，它会更喜欢HTML，否则会回到纯文本。

为了进行测试，这是带有假定结构的快速而肮脏的示例消息。如果您需要更多的帮助，则可能会发布一个带有示例消息的新问题（理想地缩减到基本问题，并且可能没有IMAP代码，而IMAP代码与此特定问题无关）。

From: tripleee <[email protected]>
To: you <[email protected]>
Subject: Simple multipart example
MIME-Version: 1.0
Content-type: multipart/alternative; boundary="snowden-risen-woodward-manning"

--snowden-risen-woodward-manning
Content-type: text/plain; charset=utf-8
Content-transfer-encoding: quoted-printable

Today's highlights

Web API in=C2=A0.Net 6.0 with Auth0 with Roles and Permissions=

--snowden-risen-woodward-manning
Content-type: multipart/related; boundary="pol-pot-stalin-trump-mao"

--pol-pot-stalin-trump-mao
Content-type: text/html; charset=utf-8
Content-transfer-encoding: quoted-printable

<h1>Today's highlights</h1>

<p><a href=3D"https://example.com/spam">=
Web API in=C2=A0.Net 6.0 with Auth0 with Roles and Permissions=
</a></p>
<img src="cid:[email protected]"/>
<img src="cid:[email protected]"/>

--pol-pot-stalin-trump-mao
Content-type: image/png
Content-transfer-encoding: base64
Content-id: <[email protected]>

somebase64gobbledygook=
--pol-pot-stalin-trump-mao
Content-type: image/png
Content-transfer-encoding: base64
Content-id: <[email protected]>

morebase64gobbledygook=
--pol-pot-stalin-trump-mao--
--snowden-risen-woodward-manning--

The immediate problem seems to be that you are probably extracting links from a MIME part which simply contains only two links. The structure of the message is apparently something like

-+ multipart/alternative
 -- text/plain
 -+ multipart/related
  -- text/html
  -- image/png
  -- image/png

where your screen shot shows the text/html part with its related images, but the text excerpt shows the first text/plain part, and the link extraction targets that, too.

In the general case, if you are processing a collection of messages from multiple senders using multiple email clients and sending multiple types of messages (some with embedded images, others perhaps a PDF attacment or a collection of CSV files), you will need to perform an analysis of each individual message's structure and decide which MIME part(s) you want to extract based on those results. But for the common case where the message's top-level structure is either just a single body part or a common multipart/alternative with a text/plain and a text/html rendering of the same "main" message (in any order), recent versions of Python offer a simple method which attempts to "do the right thing".

As an aside, the email module in the standard library was overhauled in Python 3.6 to be more logical, versatile, and succinct; new code should target the (no longer very) new EmailMessage API. When you supply a policy argument to message_from_bytes, this is what you get (without it, you get the legacy email.message.Message API, also called "compat32" because it's compatible back to Python 3.2 and earlier. The new API was informally introduced in Python 3.3, though it only became the preferred and official API in 3.6.)

With that, the following code should hopefully do what you want.

msg = email.message_from_bytes(email_data[0][1], policy=default)
print(msg.get_body())

The new API should not require you to separately request decoding of the extracted body part's content transfer encoding, which was another problem with your original attempt.

get_body() (which did not exist at all in the legacy API) lets you specify an ordered list of preferred MIME types, but the default preference list should do what you want in this case. It will prefer HTML if available, and otherwise fall back to plain text.

For testing, here is a quick and dirty example message with the assumed structure. If you need more help, probably post a new question with a sample message (ideally pared down to just the essentials, and probably without the IMAP code which isn't relevant for this particular problem).

From: tripleee <[email protected]>
To: you <[email protected]>
Subject: Simple multipart example
MIME-Version: 1.0
Content-type: multipart/alternative; boundary="snowden-risen-woodward-manning"

--snowden-risen-woodward-manning
Content-type: text/plain; charset=utf-8
Content-transfer-encoding: quoted-printable

Today's highlights

Web API in=C2=A0.Net 6.0 with Auth0 with Roles and Permissions=

--snowden-risen-woodward-manning
Content-type: multipart/related; boundary="pol-pot-stalin-trump-mao"

--pol-pot-stalin-trump-mao
Content-type: text/html; charset=utf-8
Content-transfer-encoding: quoted-printable

<h1>Today's highlights</h1>

<p><a href=3D"https://example.com/spam">=
Web API in=C2=A0.Net 6.0 with Auth0 with Roles and Permissions=
</a></p>
<img src="cid:[email protected]"/>
<img src="cid:[email protected]"/>

--pol-pot-stalin-trump-mao
Content-type: image/png
Content-transfer-encoding: base64
Content-id: <[email protected]>

somebase64gobbledygook=
--pol-pot-stalin-trump-mao
Content-type: image/png
Content-transfer-encoding: base64
Content-id: <[email protected]>

morebase64gobbledygook=
--pol-pot-stalin-trump-mao--
--snowden-risen-woodward-manning--

回复收藏 0 原文

~没有更多了~