来自 jython 中解析的电子邮件对象的电子邮件正文

发布于 2024-07-08 12:41:38 字数 6058 浏览 7 评论 0 原文

我有一个对象。

    fp = open(self.currentEmailPath, "rb")
    p = email.Parser.Parser()
    self._currentEmailParsedInstance= p.parse(fp)
    fp.close()

self.currentEmailParsedInstance,从这个对象我想获取电子邮件的正文,只有文本没有 HTML...

我该怎么做?


像这样的东西?

        newmsg=self._currentEmailParsedInstance.get_payload()
        body=newmsg[0].get_content....?

然后从正文中剥离 html。 那是什么......返回实际文本的方法......也许我误解了你的

        msg=self._currentEmailParsedInstance.get_payload()
        print type(msg)

输出=类型“列表”


电子邮件

返回路径:
收到:来自 xx.xx.net(示例),作者:mxx3.xx.net (xxx)
[电子邮件受保护] 的 id 485EF65F08EDX5E12; 2008 年 10 月 23 日星期四 06:07:51 +0200
已收到:来自 xxxxx2 (ccc),通过 example.net (ccc)(验证为 [电子邮件受保护] ]) id 48798D4001146189 for [电子邮件受保护]; 2008 年 10 月 23 日星期四 06:07:51 +0200
来自:“示例”
致:
主题:固件:示例 日期:2008 年 10 月 23 日星期四 12:07:45 +0800
组织:示例 消息 ID:<001601c934c4$xxxx30$a9ff460a@xxx>
MIME 版本:1.0
内容类型:多部分/混合;
边界=“----=_NextPart_000_0017_01C93507.F6F64E30”
X-Mailer:Microsoft Office Outlook 11
X-MimeOLE:微软出品 MimeOLE V6.00.2900.3138
线程索引:Ack0wLaumqgZo1oXSBuIpUCEg/wfOAABAFEA

这是 MIME 格式的多部分消息。

------=_NextPart_000_0017_01C93507.F6F64E30
内容类型:多部分/替代;
边界=“----=_NextPart_001_0018_01C93507.F6F64E30”

------=_NextPart_001_0018_01C93507.F6F64E30
内容类型:文本/纯文本;
字符集=“us-ascii”
内容传输编码:7 位

来自:example.example[mailto:[电子邮件受保护]< /a>]
发送时间:2008 年 10 月 23 日星期四上午 11:37
收件人:
[电子邮件受保护]
主题:例如S/I(B/L
编号:4357-0120-810.044)

请参阅附件中的示例.doc),

谢谢。

B.rgds,

xxx xxx

------=_NextPart_001_0018_01C93507.F6F64E30
内容类型:text/html;
字符集=“us-ascii”
内容传输编码:引用可打印

xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:st1=3D"urn:schemas-microsoft-com:office:smarttags" =
xmlns=3D"http://www.w3.org/TR/REC-html40“>

HTML 内容直到

------=_NextPart_001_0018_01C93507.F6F64E30--

------=_NextPart_000_0017_01C93507.F6F64E30
内容类型:application/msword;
名称=“xxxx.doc”
内容传输编码:base64
内容处置:附件;
文件名=“xxxx.doc”

0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAABAAAAYAAAAAAAAAAA EAAAYggAAAnEIAAA4AYmpiaqEVoRUAAAAAAAAAAAAAAAAAAAAA AAAECBYAMlAAAMN/AADDfwAAQQ4AAAAAAAAAPAAAAAAAAAAAAAAAAAAAAAAAAAAAAD//w8AAAAA AAAAAAD//w8AAAAAAAAAAD//w8AAAAAAAAAAAAAAAAAAAKQAAAAAAEYEAAAAAAAARgQAAEYE AAAAAARgQAAAAAAABGBAAAAAAAAEYEAAAAAAAARgQAABQAAAAAAAAAAAAAAFoEAAAAAAAAA4hsA AAAAAADiGwAAAAAAAOIbAAA4AAAAGhwAAHwAAACWHAAARAAAAFoEAAAAAAAABzcAAEgBAADmHAAA FgAAAPwcAAAAAAAA/BwAAAAAAAD8HAAAAAAAAPwcAAAAAAAA/BwAAAAAAAD8HAAAAAAAAPwcAAAA AAAAMjYAAAIAAAA0NgAAAAAAADQ2AAAAAAAANDYAAAAAAAA0NgAAAAAAADQ2AAAAAAAANDYAACQA AABPOAAAaAIAALc6AACOAAAWWDYAAGkAAAAAAAAAAAAAAAAAAAAAARgQAAAAAAABHLAAAAAAA AAAAAAAAAAAAAAAAAAAAAAD8HAAAAAAAAPwcAAAAAAAARywAAAAAAABHLAAAAAAAAFg2AAAAAAAA

------=_NextPart_000_0017_01C93507.F6F64E30--


我只想获取:

来自: xxxx.xxxx [mailto:[电子邮件受保护]]
发送时间:2008 年 10 月 23 日星期四上午 11:37
收件人:[电子邮件受保护]
主题: xxxxx 的 S/I(提单
编号:4357-0120-810.044)

请找到附件中的xxxx.doc),

谢谢。

B.rgds,

xxx xxx


不确定邮件是否格式错误! 似乎如果你得到一个 html 页面,你必须这样做:

        parts=self._currentEmailParsedInstance.get_payload()
        print parts[0].get_content_type()
        ..._multipart/alternative_
        textParts=parts[0].get_payload()
        print textParts[0].get_content_type()
        ..._text/plain_
        body=textParts[0].get_payload()
        print body
        ...get the text without a problem!!

非常感谢 Vinko。

所以它有点像处理 xml,本质上是递归的。

I have an object.

    fp = open(self.currentEmailPath, "rb")
    p = email.Parser.Parser()
    self._currentEmailParsedInstance= p.parse(fp)
    fp.close()

self.currentEmailParsedInstance, from this object I want to get the body of an email, text only no HTML....

How do I do it?


something like this?

        newmsg=self._currentEmailParsedInstance.get_payload()
        body=newmsg[0].get_content....?

then strip the html from body.
just what is that .... method to return the actual text... maybe I mis-understand you

        msg=self._currentEmailParsedInstance.get_payload()
        print type(msg)

output = type 'list'


the email

Return-Path:
Received: from xx.xx.net (example) by mxx3.xx.net (xxx)
id 485EF65F08EDX5E12 for [email protected]; Thu, 23 Oct 2008 06:07:51 +0200
Received: from xxxxx2 (ccc) by example.net (ccc) (authenticated as [email protected])
id 48798D4001146189 for [email protected]; Thu, 23 Oct 2008 06:07:51 +0200
From: "example"
To:
Subject: FW: example
Date: Thu, 23 Oct 2008 12:07:45 +0800
Organization: example
Message-ID: <001601c934c4$xxxx30$a9ff460a@xxx>
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_NextPart_000_0017_01C93507.F6F64E30"
X-Mailer: Microsoft Office Outlook 11
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3138
Thread-Index: Ack0wLaumqgZo1oXSBuIpUCEg/wfOAABAFEA

This is a multi-part message in MIME format.

------=_NextPart_000_0017_01C93507.F6F64E30
Content-Type: multipart/alternative;
boundary="----=_NextPart_001_0018_01C93507.F6F64E30"

------=_NextPart_001_0018_01C93507.F6F64E30
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit

From: example.example[mailto:[email protected]]
Sent: Thursday, October 23, 2008 11:37 AM
To: [email protected]
Subject: S/I for example(B/L
No.:4357-0120-810.044)

Please find attached the example.doc),

Thanks.

B.rgds,

xxx xxx

------=_NextPart_001_0018_01C93507.F6F64E30
Content-Type: text/html;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:st1=3D"urn:schemas-microsoft-com:office:smarttags" =
xmlns=3D"http://www.w3.org/TR/REC-html40">

HTML STUFF till

------=_NextPart_001_0018_01C93507.F6F64E30--

------=_NextPart_000_0017_01C93507.F6F64E30
Content-Type: application/msword;
name="xxxx.doc"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="xxxx.doc"

0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAABAAAAYAAAAAAAAAAA
EAAAYgAAAAEAAAD+////AAAAAF8AAAD/////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////s
pcEAI2AJBAAA+FK/AAAAAAAAEAAAAAAABgAAnEIAAA4AYmpiaqEVoRUAAAAAAAAAAAAAAAAAAAAA
AAAECBYAMlAAAMN/AADDfwAAQQ4AAAAAAAAPAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAD//w8AAAAA
AAAAAAD//w8AAAAAAAAAAAD//w8AAAAAAAAAAAAAAAAAAAAAAKQAAAAAAEYEAAAAAAAARgQAAEYE
AAAAAAAARgQAAAAAAABGBAAAAAAAAEYEAAAAAAAARgQAABQAAAAAAAAAAAAAAFoEAAAAAAAA4hsA
AAAAAADiGwAAAAAAAOIbAAA4AAAAGhwAAHwAAACWHAAARAAAAFoEAAAAAAAABzcAAEgBAADmHAAA
FgAAAPwcAAAAAAAA/BwAAAAAAAD8HAAAAAAAAPwcAAAAAAAA/BwAAAAAAAD8HAAAAAAAAPwcAAAA
AAAAMjYAAAIAAAA0NgAAAAAAADQ2AAAAAAAANDYAAAAAAAA0NgAAAAAAADQ2AAAAAAAANDYAACQA
AABPOAAAaAIAALc6AACOAAAAWDYAAGkAAAAAAAAAAAAAAAAAAAAAAAAARgQAAAAAAABHLAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAD8HAAAAAAAAPwcAAAAAAAARywAAAAAAABHLAAAAAAAAFg2AAAAAAAA

------=_NextPart_000_0017_01C93507.F6F64E30--


I just want to get :

From: xxxx.xxxx [mailto:[email protected]]
Sent: Thursday, October 23, 2008 11:37 AM
To: [email protected]
Subject: S/I for xxxxx (B/L
No.:4357-0120-810.044)

Pls find attached the xxxx.doc),

Thanks.

B.rgds,

xxx xxx


not sure if the mail is malformed!
seems if you get an html page you have to do this:

        parts=self._currentEmailParsedInstance.get_payload()
        print parts[0].get_content_type()
        ..._multipart/alternative_
        textParts=parts[0].get_payload()
        print textParts[0].get_content_type()
        ..._text/plain_
        body=textParts[0].get_payload()
        print body
        ...get the text without a problem!!

thank you so much Vinko.

So its kinda like dealing with xml, recursive in nature.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

谁人与我共长歌 2024-07-15 12:41:38

这将为您提供消息的内容

self.currentEmailParsedInstance.get_payload()

至于纯文本部分,您必须自己剥离 HTML,例如使用 BeautifulSoup。

检查此链接了解有关解析器返回的消息类。 如果您的意思是获取包含 HTML 和纯文本版本的消息的文本部分,您可以指定 get_payload() 的索引来获取您想要的部分。

我尝试使用不同的 MIME 电子邮件,因为您粘贴的内容似乎格式错误,希望您在编辑它时格式错误。

>>> parser = email.parser.Parser()
>>> message = parser.parse(open('/home/vinko/jlm.txt','r'))
>>> message.is_multipart()
True
>>> parts = message.get_payload()
>>> len(parts)
2
>>> parts[0].get_content_type()
'text/plain'
>>> parts[1].get_content_type()
'message/rfc822'
>>> parts[0].get_payload()
'Message Text'

parts 将包含多部分消息的所有部分,您可以检查其内容类型,如图所示,并仅获取文本/纯文本。

祝你好运。

This will get you the contents of the message

self.currentEmailParsedInstance.get_payload()

As for the text only part you will have to strip HTML on your own, for example using BeautifulSoup.

Check this link for more information about the Message class the Parser returns. If you mean getting the text part of messages containing both HTML and plain text version of themselves, you can specify an index to get_payload() to get the part you want.

I tried with a different MIME email because what you pasted seems malformed, hopefully it got malformed when you edited it.

>>> parser = email.parser.Parser()
>>> message = parser.parse(open('/home/vinko/jlm.txt','r'))
>>> message.is_multipart()
True
>>> parts = message.get_payload()
>>> len(parts)
2
>>> parts[0].get_content_type()
'text/plain'
>>> parts[1].get_content_type()
'message/rfc822'
>>> parts[0].get_payload()
'Message Text'

parts will contain all parts of the multipart message, you can check their content types as shown and get only the text/plain ones, for instance.

Good luck.

月寒剑心 2024-07-15 12:41:38

结束了这个

        parser = email.parser.Parser()
        self._email = parser.parse(open('/home/vinko/jlm.txt','r'))
        parts=self._email.get_payload()
        check=parts[0].get_content_type()
        if check == "text/plain":
            return parts[0].get_payload()
        elif check == "multipart/alternative":
            part=parts[0].get_payload()
            if part[0].get_content_type() == "text/plain":
                return part[0].get_payload()
            else:
                return "cannot obtain the body of the email"
        else:
            return "cannot obtain the body of the email"

ended up with this

        parser = email.parser.Parser()
        self._email = parser.parse(open('/home/vinko/jlm.txt','r'))
        parts=self._email.get_payload()
        check=parts[0].get_content_type()
        if check == "text/plain":
            return parts[0].get_payload()
        elif check == "multipart/alternative":
            part=parts[0].get_payload()
            if part[0].get_content_type() == "text/plain":
                return part[0].get_payload()
            else:
                return "cannot obtain the body of the email"
        else:
            return "cannot obtain the body of the email"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文