MIME消息结构解析与分析

发布于 2024-12-29 15:17:03 字数 617 浏览 2 评论 0原文

我正在寻找现有的库或代码示例,以从 mime 消息结构中提取相关部分,以便对这些部分的文本内容进行分析。

我将解释:

我正在编写一个库(用 Python),它是一个需要通过 IMAP 迭代大量电子邮件消息的项目的一部分。对于每条消息,它需要确定需要哪些 mime 部分,以便分析需要最少解析量的消息文本内容(例如,更喜欢文本/纯文本而不是文本/html 或富文本)并且没有重复(即如果 text/plain 存在,则忽略匹配的 text/html)。它还需要处理嵌套部分(文本附件、转发的消息等)以及所有这些,而无需下载整个消息正文(需要太多时间和带宽)。最终目标是稍后仅检索这些部分,以便对这些消息的文本内容(不包括任何标记、元数据、二进制数据等)执行一些统计和模式分析。

我见过的库和示例需要完整的消息正文才能组装消息结构并理解消息的内容。我试图使用 IMAP FETCH 命令的响应和 BODYSTRUCTURE 数据项来实现此目的。

BODYSTRUCTURE 应该包含足够的信息来实现我的目标,但尽管结构和返回的数据已在相关 RFC(3501、2822、2045)中正式记录,但嵌套、组合和各种怪癖的数量加起来使任务非常乏味且错误修剪。

有谁知道任何可以帮助实现此目的的库或任何代码示例(最好是Python,但任何语言都可以)?

I am looking for an existing library or code samples, to extract the relevant parts from a mime message structure in order to perform analysis on the textual content of those parts.

I will explain:

I am writing a library (in Python) that is part of a project that needs to iterate over very large amount of email messages through IMAP. For each message, it needs to determine what are the mime parts it will need in order to analyze the textual content of the message that require the least amount of parsing (e.g. prefer text/plain over text/html or rich text) and without duplicates (i.e. if text/plain exists, ignore the matching text/html). It also needs to address nested parts (text attachments, forwarded messages, etc) and all this without downloading the entire message body (takes too much time and bandwidth). The end goal is later to retrieve only those parts in order to perform some statistical and pattern analysis on the text content of those messages (excluding any markup, meta data, binary data, etc).

The libraries and examples I've seen, require the full message body in order to assemble the message structure and understand the content of the message. I am trying to achieve this using the response from the IMAP FETCH command with the BODYSTRUCTURE data item.

BODYSTRUCTURE should contain enough information to achieve my goal but although the structure and returned data are officially documented in the relevant RFCs (3501, 2822, 2045), the amount of nesting, combinations and various quirks all add up to make the task very tedious and error prune.

Does anyone know any libraries that can help to achieve this or any code samples (preferably in Python but any language will do)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

谜泪 2025-01-05 15:17:03

有什么是你不能使用模块 email 和子模块 email.mime 做的吗?

http://docs.python.org/library/email.html#module-email

Is there something that you can not do with module email and the submodule email.mime ?

http://docs.python.org/library/email.html#module-email

睡美人的小仙女 2025-01-05 15:17:03

为了完整起见回答我自己的问题并结束这个问题。

我找不到任何满足要求的现有库。我最终编写了自己的代码来获取 BODYSTRUCTURE 树,解析它并将其存储在内部结构中。这使我能够控制我需要决定实际需要下载消息的哪些部分,并考虑各种情况,如附件、转发、冗余部分(纯文本与 html)等。

Answering my own question for the sake of completeness and to close this question.

I couldn't find any existing library that answers the requirements. I ended up writing my own code to fetch BODYSTRUCTURE tree, parse it and store it in an internal structure. This gives me the control I need to decide which exact parts of the message I need to actually download and take into account various cases like attachments, forwards, redundant parts (plain text vs html) etc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文