如何使用 bash 或 Perl 重新格式化 mbox 文件中的消息?

发布于 2024-07-11 05:51:21 字数 1896 浏览 15 评论 0原文

我有一个巨大的 mbox 文件,其中可能有 500 封电子邮件。

它看起来如下所示:

From [email protected] Fri Aug 12 09:34:09 2005
Message-ID: <[email protected]>
Date: Fri, 12 Aug 2005 09:34:09 +0900
From: me <[email protected]>
User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: someone <[email protected]>
Subject: Re: (no subject)
References: <[email protected]>
In-Reply-To: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Status: RO
X-Status: 
X-Keywords:                 
X-UID: 371
X-Evolution-Source: imap://[email protected]/
X-Evolution: 00000002-0010

Hey

the actual content of the email

someone wrote:

> lines of quotedtext

我想知道如何删除所有引用的文本,去除除“收件人”、“发件人”和“日期”行之外的大部分标题,并且仍然保持其连续性。

我的目标是能够将这些电子邮件作为书籍格式打印,目前每个程序都希望每页打印一封电子邮件,或所有标题和引用的文本。 对于从哪里开始使用 shell 工具编写小程序有什么建议吗?

I have a huge mbox file, with maybe 500 emails in it.

It looks like the following:

From [email protected] Fri Aug 12 09:34:09 2005
Message-ID: <[email protected]>
Date: Fri, 12 Aug 2005 09:34:09 +0900
From: me <[email protected]>
User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: someone <[email protected]>
Subject: Re: (no subject)
References: <[email protected]>
In-Reply-To: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Status: RO
X-Status: 
X-Keywords:                 
X-UID: 371
X-Evolution-Source: imap://[email protected]/
X-Evolution: 00000002-0010

Hey

the actual content of the email

someone wrote:

> lines of quotedtext

I would like to know how I can remove all of the quoted text, strip most of the headers except the To, From and Date lines, and still have it somewhat continuous.

My goal is to be able to print these emails as a book sort of format, and at the moment every program wants to print one email per page, or all of the headers and quoted text. Any suggestions for where to start on whipping up a small program using shell tools?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

陪你搞怪i 2024-07-18 05:51:21

Mail::Box::Mbox 将让您轻松地将文件解析为单独的消息。 Mark Overmeer 的 来自 YAPC::Europe 2002 的幻灯片 详细介绍了如下内容为什么解析比看起来要困难得多。 使用这个库还可以处理 mh、IMAP 和许多其他格式,而不仅仅是 mbox。

    #!/usr/bin/perl
    use warnings;
    use strict;
    use Mail::Box::Manager;

    my $file = shift || $ENV{MAIL};
    my $mgr = Mail::Box::Manager->new(
        access      => 'r',
    );

    my $folder = $mgr->open( folder => $file )
    or die "$file: Unable to open: $!\n";

    for my $msg ($folder->messages)
    {
        my $to      = join( ', ', map { $_->format } $msg->to );
        my $from    = join( ', ', map { $_->format } $msg->from );
        my $date    = localtime( $msg->timestamp );
        my $subject = $msg->subject;
        my $body    = $msg->body;

        # Strip all quoted text
        $body =~ s/^>.*$//msg;

        print <<"";
    From: $from
    To: $to
    Date: $date
    $body

    }

您可能需要重新考虑删除引用文本的请求 - 如果您的电子邮件格式为交错回复怎么办? 删除引用的文本会使此类电子邮件非常难以理解:

  Foo wrote:
  > I like bar.

  Bar?  Who likes bar?

  > It is better than baz.

  Everyone knows that.

  -- 
  Quux

此外,您打算如何处理附件、非文本/纯 MIME 类型、编码文本实体和其他奇怪的内容?

Mail::Box::Mbox will let you easily parse the file into separate messages. Mark Overmeer's slides from YAPC::Europe 2002 go into quite a bit of detail as to why parsing is much more difficult than it seems. Using this library will also deal with mh, IMAP and many other formats than just mbox.

    #!/usr/bin/perl
    use warnings;
    use strict;
    use Mail::Box::Manager;

    my $file = shift || $ENV{MAIL};
    my $mgr = Mail::Box::Manager->new(
        access      => 'r',
    );

    my $folder = $mgr->open( folder => $file )
    or die "$file: Unable to open: $!\n";

    for my $msg ($folder->messages)
    {
        my $to      = join( ', ', map { $_->format } $msg->to );
        my $from    = join( ', ', map { $_->format } $msg->from );
        my $date    = localtime( $msg->timestamp );
        my $subject = $msg->subject;
        my $body    = $msg->body;

        # Strip all quoted text
        $body =~ s/^>.*$//msg;

        print <<"";
    From: $from
    To: $to
    Date: $date
    $body

    }

You may want to reconsider your request to strip the quoted text -- what if you email that is formatted with interleaved replies? Stripping the quoted text would make this sort of email very hard to understand:

  Foo wrote:
  > I like bar.

  Bar?  Who likes bar?

  > It is better than baz.

  Everyone knows that.

  -- 
  Quux

Additionally, what do you plan to do with attachments, non-text/plain MIME types, encoded text entities and other oddities?

三生一梦 2024-07-18 05:51:21

首先,我可能会使用“formail”来提取仅包含您想要的标题的邮件。 或者,或者使用 awk 中的某种状态表来查看您是否在标头中,如果您在标头中,则删除除所需标头之外的所有内容,如果不在标头中,则删除引号。

As a start, I would probably use "formail" to extract the mails with just the headers you want. Either that, or use some sort of state table in awk to see if you're in the header or not, and either strip everything but the wanted headers if you're in the header and strip the quotes if you're not.

手长情犹 2024-07-18 05:51:21

使用 shell 工具可能不是最好的答案,因为有许多语言的许多库可以处理 mbox,无论是 Ruby、Perl 还是其他语言。 您还必须考虑引用字符并不总是“>”,这可能会破坏您的取消引用过程。 至于提取你想要的标题,这对于任何语言来说都不应该是困难的。 我知道这是一个笼统的答案,也许不够具体......

Using shell tools may not be the best answer to that as there are many libraries in many languages to deal with mbox, be it in Ruby, Perl or whatever. You will have to also consider that quoting characters are not always "> " which can screw up your de-quoting process. As for extracting the headers you want, this should not be difficult in any language. I know this is a general answer, maybe not specific enough...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文