如何使用 bash 或 Perl 重新格式化 mbox 文件中的消息?
我有一个巨大的 mbox 文件,其中可能有 500 封电子邮件。
它看起来如下所示:
From [email protected] Fri Aug 12 09:34:09 2005
Message-ID: <[email protected]>
Date: Fri, 12 Aug 2005 09:34:09 +0900
From: me <[email protected]>
User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: someone <[email protected]>
Subject: Re: (no subject)
References: <[email protected]>
In-Reply-To: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Status: RO
X-Status:
X-Keywords:
X-UID: 371
X-Evolution-Source: imap://[email protected]/
X-Evolution: 00000002-0010
Hey
the actual content of the email
someone wrote:
> lines of quotedtext
我想知道如何删除所有引用的文本,去除除“收件人”、“发件人”和“日期”行之外的大部分标题,并且仍然保持其连续性。
我的目标是能够将这些电子邮件作为书籍格式打印,目前每个程序都希望每页打印一封电子邮件,或所有标题和引用的文本。 对于从哪里开始使用 shell 工具编写小程序有什么建议吗?
I have a huge mbox file, with maybe 500 emails in it.
It looks like the following:
From [email protected] Fri Aug 12 09:34:09 2005
Message-ID: <[email protected]>
Date: Fri, 12 Aug 2005 09:34:09 +0900
From: me <[email protected]>
User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: someone <[email protected]>
Subject: Re: (no subject)
References: <[email protected]>
In-Reply-To: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Status: RO
X-Status:
X-Keywords:
X-UID: 371
X-Evolution-Source: imap://[email protected]/
X-Evolution: 00000002-0010
Hey
the actual content of the email
someone wrote:
> lines of quotedtext
I would like to know how I can remove all of the quoted text, strip most of the headers except the To, From and Date lines, and still have it somewhat continuous.
My goal is to be able to print these emails as a book sort of format, and at the moment every program wants to print one email per page, or all of the headers and quoted text. Any suggestions for where to start on whipping up a small program using shell tools?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Mail::Box::Mbox 将让您轻松地将文件解析为单独的消息。 Mark Overmeer 的 来自 YAPC::Europe 2002 的幻灯片 详细介绍了如下内容为什么解析比看起来要困难得多。 使用这个库还可以处理 mh、IMAP 和许多其他格式,而不仅仅是 mbox。
您可能需要重新考虑删除引用文本的请求 - 如果您的电子邮件格式为交错回复怎么办? 删除引用的文本会使此类电子邮件非常难以理解:
此外,您打算如何处理附件、非文本/纯 MIME 类型、编码文本实体和其他奇怪的内容?
Mail::Box::Mbox will let you easily parse the file into separate messages. Mark Overmeer's slides from YAPC::Europe 2002 go into quite a bit of detail as to why parsing is much more difficult than it seems. Using this library will also deal with mh, IMAP and many other formats than just mbox.
You may want to reconsider your request to strip the quoted text -- what if you email that is formatted with interleaved replies? Stripping the quoted text would make this sort of email very hard to understand:
Additionally, what do you plan to do with attachments, non-text/plain MIME types, encoded text entities and other oddities?
首先,我可能会使用“formail”来提取仅包含您想要的标题的邮件。 或者,或者使用 awk 中的某种状态表来查看您是否在标头中,如果您在标头中,则删除除所需标头之外的所有内容,如果不在标头中,则删除引号。
As a start, I would probably use "formail" to extract the mails with just the headers you want. Either that, or use some sort of state table in awk to see if you're in the header or not, and either strip everything but the wanted headers if you're in the header and strip the quotes if you're not.
使用 shell 工具可能不是最好的答案,因为有许多语言的许多库可以处理 mbox,无论是 Ruby、Perl 还是其他语言。 您还必须考虑引用字符并不总是“>”,这可能会破坏您的取消引用过程。 至于提取你想要的标题,这对于任何语言来说都不应该是困难的。 我知道这是一个笼统的答案,也许不够具体......
Using shell tools may not be the best answer to that as there are many libraries in many languages to deal with mbox, be it in Ruby, Perl or whatever. You will have to also consider that quoting characters are not always "> " which can screw up your de-quoting process. As for extracting the headers you want, this should not be difficult in any language. I know this is a general answer, maybe not specific enough...