从引用的回复中解析电子邮件内容

发布于 2024-07-08 08:09:04 字数 144 浏览 11 评论 0原文

我试图弄清楚如何从电子邮件可能包含的任何引用的回复文本中解析出电子邮件的文本。我注意到，电子邮件客户端通常会写上“在某某日期某某写的”或在行前加上尖括号。不幸的是，并不是每个人都这样做。有谁知道如何以编程方式检测回复文本？我正在使用 C# 来编写这个解析器。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

十六岁半 2024-07-15 08:09:04

我对此进行了更多搜索，这就是我发现的。基本上有两种情况可以执行此操作：当您拥有整个线程时和当您没有时。我将其分为这两类：

当您拥有主题时：

如果您拥有整个系列的电子邮件，则可以高度保证您要删除的内容确实被引用文本。有两种方法可以做到这一点。第一，您可以使用消息的 Message-ID、In-Reply-To ID 和 Thread-Index 来确定单个消息、它的父级以及它所属的线程。有关详细信息，请参阅 RFC822、RFC2822, 这篇关于线程的有趣文章，或这篇关于线程的文章< /a>. 重新组装线程后，您可以删除外部文本（例如“收件人”、“发件人”、“抄送”等行），然后就完成了。

如果您正在处理的邮件没有标题，您还可以使用相似性匹配来确定电子邮件的哪些部分是回复文本。在这种情况下，您必须进行相似性匹配来确定重复的文本。在这种情况下，您可能需要研究 Levenshtein 距离算法，例如 Code Project 上的此内容或这个。

无论如何，如果您对线程过程感兴趣，请查看这个关于重组电子邮件线程的精彩 PDF。

当您没有帖子时：

如果您只看到帖子中的一条消息，那么您就必须尝试猜测引用的内容。在这种情况下，以下是我见过的不同的报价方法：

一条线（如在outlook中看到的）。
尖括号
“---原始消息---”
“在某某天，某某写道：”

从那里删除文本，你就完成了。其中任何一个的缺点是，它们都假设发件人将其回复放在引用的文本之上，并且没有将其交错（就像互联网上的旧样式一样）。如果发生这种情况，祝你好运。我希望这对你们中的一些人有帮助！

回复收藏 0 原文

陪我终i 2024-07-15 08:09:04

首先，这是一项棘手的任务。

您应该收集来自不同电子邮件客户端的典型响应，并准备正确的正则表达式（或其他）来解析它们。我收集了来自 Outlook、thunderbird、Gmail、Apple mail 和 mail.ru 的回复。

我使用正则表达式以以下方式解析响应：如果表达式不匹配，我尝试使用下一个。

new Regex("From:\\s*" + Regex.Escape(_mail), RegexOptions.IgnoreCase);
new Regex("<" + Regex.Escape(_mail) + ">", RegexOptions.IgnoreCase);
new Regex(Regex.Escape(_mail) + "\\s+wrote:", RegexOptions.IgnoreCase);
new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline);
new Regex("-+original\\s+message-+\\s*$", RegexOptions.IgnoreCase);
new Regex("from:\\s*$", RegexOptions.IgnoreCase);

最后删除引号：

new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline);

这是我的一小部分测试响应（样本除以 --- ）：

From: [email protected] [mailto:[email protected]] 
Sent: Tuesday, January 13, 2009 1:27 PM
----
2008/12/26 <[email protected]>

>  text
----
[email protected] wrote:
> text
----
      [email protected] wrote:         text
text
----
2009/1/13 <[email protected]>

>  text
----
 [email protected] wrote:         text
 text
----
2009/1/13 <[email protected]>

> text
> text
----
2009/1/13 <[email protected]>

> text
> text
----
[email protected] wrote:
> text
> text
<response here>
----
--- On Fri, 23/1/09, [email protected] <[email protected]> wrote:

> text
> text

First of all, this is a tricky task.

You should collect typical responses from different e-mail clients and prepare correct regular expressions (or whatever) to parse them. I've collected responses from outlook, thunderbird, Gmail, Apple mail, and mail.ru.

I am using regular expressions to parse responses in the following manner: if an expression did not match, I try to use the next one.

new Regex("From:\\s*" + Regex.Escape(_mail), RegexOptions.IgnoreCase);
new Regex("<" + Regex.Escape(_mail) + ">", RegexOptions.IgnoreCase);
new Regex(Regex.Escape(_mail) + "\\s+wrote:", RegexOptions.IgnoreCase);
new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline);
new Regex("-+original\\s+message-+\\s*quot;, RegexOptions.IgnoreCase);
new Regex("from:\\s*quot;, RegexOptions.IgnoreCase);

To remove quotation in the end:

new Regex("^>.*quot;, RegexOptions.IgnoreCase | RegexOptions.Multiline);

Here is my small collection of test responses (samples divided by --- ):

From: [email protected] [mailto:[email protected]] 
Sent: Tuesday, January 13, 2009 1:27 PM
----
2008/12/26 <[email protected]>

>  text
----
[email protected] wrote:
> text
----
      [email protected] wrote:         text
text
----
2009/1/13 <[email protected]>

>  text
----
 [email protected] wrote:         text
 text
----
2009/1/13 <[email protected]>

> text
> text
----
2009/1/13 <[email protected]>

> text
> text
----
[email protected] wrote:
> text
> text
<response here>
----
--- On Fri, 23/1/09, [email protected] <[email protected]> wrote:

> text
> text

回复收藏 0 原文

陌若浮生 2024-07-15 08:09:04

谢谢 Goleg 提供的正则表达式！真的很有帮助。这不是 C#，但对于 Google 用户来说，这是我的 Ruby 解析脚本：

def extract_reply(text, address)
    regex_arr = [
      Regexp.new("From:\s*" + Regexp.escape(address), Regexp::IGNORECASE),
      Regexp.new("<" + Regexp.escape(address) + ">", Regexp::IGNORECASE),
      Regexp.new(Regexp.escape(address) + "\s+wrote:", Regexp::IGNORECASE),
      Regexp.new("^.*On.*(\n)?wrote:$", Regexp::IGNORECASE),
      Regexp.new("-+original\s+message-+\s*$", Regexp::IGNORECASE),
      Regexp.new("from:\s*$", Regexp::IGNORECASE)
    ]

    text_length = text.length
    #calculates the matching regex closest to top of page
    index = regex_arr.inject(text_length) do |min, regex|
        [(text.index(regex) || text_length), min].min
    end

    text[0, index].strip
end

到目前为止，它运行得很好。

Thank you, Goleg, for the regexes! Really helped. This isn't C#, but for the googlers out there, here's my Ruby parsing script:

def extract_reply(text, address)
    regex_arr = [
      Regexp.new("From:\s*" + Regexp.escape(address), Regexp::IGNORECASE),
      Regexp.new("<" + Regexp.escape(address) + ">", Regexp::IGNORECASE),
      Regexp.new(Regexp.escape(address) + "\s+wrote:", Regexp::IGNORECASE),
      Regexp.new("^.*On.*(\n)?wrote:$", Regexp::IGNORECASE),
      Regexp.new("-+original\s+message-+\s*$", Regexp::IGNORECASE),
      Regexp.new("from:\s*$", Regexp::IGNORECASE)
    ]

    text_length = text.length
    #calculates the matching regex closest to top of page
    index = regex_arr.inject(text_length) do |min, regex|
        [(text.index(regex) || text_length), min].min
    end

    text[0, index].strip
end

It's worked pretty well so far.

回复收藏 0 原文

世界如花海般美丽 2024-07-15 08:09:04

到目前为止，最简单的方法是在您的内容中放置一个标记，例如：

--- 请在此行上方回复 ---

正如您毫无疑问注意到的那样，解析引用的文本并不是一项简单的任务，因为不同的电子邮件客户以不同的方式引用文本。要正确解决此问题，您需要考虑并在每个电子邮件客户端中进行测试。

Facebook 可以做到这一点，但除非您的项目有大量预算，否则您可能做不到。

Oleg 已使用正则表达式找到“On 13 Jul 2012, at 13:09, xxx write:”文本解决了该问题。但是，如果用户删除此文本，或者像许多人一样在电子邮件底部进行回复，则此解决方案将不起作用。

同样，如果电子邮件客户端使用不同的日期字符串，或者不包含日期字符串，则正则表达式将失败。

回复收藏 0 原文

迷乱花海 2024-07-15 08:09:04

电子邮件中没有通用的回复指示符。您能做的最好的事情就是尝试捕获最常见的模式并在遇到新模式时解析它们。

请记住，有些人会在引用的文本中插入回复（例如，我的老板在我问他们的同一行回答问题），因此无论您做什么，您都可能会丢失一些您想保留的信息。

回复收藏 0 原文

甜妞爱困 2024-07-15 08:09:04

这是我的 @hurshagrawal 的 Ruby 代码的 C# 版本。我不太了解 Ruby，所以可能会出错，但我认为我做对了。

public string ExtractReply(string text, string address)
{
    var regexes = new List<Regex>() { new Regex("From:\\s*" + Regex.Escape(address), RegexOptions.IgnoreCase),
                        new Regex("<" + Regex.Escape(address) + ">", RegexOptions.IgnoreCase),
                        new Regex(Regex.Escape(address) + "\\s+wrote:", RegexOptions.IgnoreCase),
                        new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline),
                        new Regex("-+original\\s+message-+\\s*$", RegexOptions.IgnoreCase),
                        new Regex("from:\\s*$", RegexOptions.IgnoreCase),
                        new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline)
                    };

    var index = text.Length;

    foreach(var regex in regexes){
        var match = regex.Match(text);

        if(match.Success && match.Index < index)
            index = match.Index;
    }

    return text.Substring(0, index).Trim();
}

Here is my C# version of @hurshagrawal's Ruby code. I don't know Ruby really well so it could be off, but I think I got it right.

public string ExtractReply(string text, string address)
{
    var regexes = new List<Regex>() { new Regex("From:\\s*" + Regex.Escape(address), RegexOptions.IgnoreCase),
                        new Regex("<" + Regex.Escape(address) + ">", RegexOptions.IgnoreCase),
                        new Regex(Regex.Escape(address) + "\\s+wrote:", RegexOptions.IgnoreCase),
                        new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline),
                        new Regex("-+original\\s+message-+\\s*$", RegexOptions.IgnoreCase),
                        new Regex("from:\\s*$", RegexOptions.IgnoreCase),
                        new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline)
                    };

    var index = text.Length;

    foreach(var regex in regexes){
        var match = regex.Match(text);

        if(match.Success && match.Index < index)
            index = match.Index;
    }

    return text.Substring(0, index).Trim();
}

回复收藏 0 原文

妄想挽回 2024-07-15 08:09:04

现在，鉴于 text/html 内容类型适合您（Outlook 是一个例外；请参阅下面的详细信息），这应该相当容易。下表列出了各种桌面电子邮件客户端中解析选项的真实测试结果：

邮件客户端	回复消息格式	HTML 可以轻松可靠地解析	HTML 要删除的标记	纯文本引用标记
web.de	始终 html	yes	`div name="quote">`	- （始终为 html）
Thunderbird	与原始消息中的相同	yes	,	"2022 年 10 月 26 日 12:37，John Doe 写道："
Gmail	都是	yes		"On 2022 年 10 月 27 日星期四下午 1:39 John Doe [电子邮件受保护] 写道：“
Outlook 2016、2019	与原始消息中的相同	可能是不可能的，因为使用了一些奇怪的	未知	字处理器纯文本消息：“-----原始消息-----”，多部分：3 个空白带有一些空格的行，后跟“发件人：John Doe [电子邮件受保护]”
Apple	未知	是	"	> 2021 年 12 月 22 日 12:50，John Doe [电子邮件受保护] 写道：“

It should be fairly easy these days, given text/html content type works for you (with Outlook being an exception; see details below). Here is a table with the real testing results of parsing options in various desktop email clients:

Mail client	Reply message format	HTML can be parsed easily and reliably	HTML tags to be deleted	Plain text quote marker
web.de	always html	yes	`<div name="quote">`	- (always html)
Thunderbird	same as in the original message	yes	`<div class="moz-cite-prefix">`, `<blockquote type="cite">`	"On 26.10.2022 12:37, John Doe wrote:"
Gmail	both	yes	`<div class="gmail_quote">`	"On Thu, Oct 27, 2022 at 1:39 PM John Doe [email protected] wrote:"
Outlook 2016, 2019	same as in the original message	Probably impossible due to use of some weird Word processor	unknown	Plain text-only message: "-----Original Message-----", multipart: 3 blank lines with some space followed by "From: John Doe [email protected]"
Apple	unknown	yes	`<blockquote type="cite">`	"> On 22. Dec 2021, at 12:50, John Doe [email protected] wrote:"

回复收藏 0 原文

思念满溢 2024-07-15 08:09:04

如果您控制原始消息（例如来自 Web 应用程序的通知），您可以放置一个独特的、可识别的标头，并将其用作原始帖子的分隔符。

回复收藏 0 原文

离不开的别离 2024-07-15 08:09:04

这是一个很好的解决方案。找了好久才找到。

另外，如上所述，这是区分大小写的，因此上述表达式无法正确解析我的 gmail 和 Outlook (2010) 响应，为此我添加了以下两个正则表达式。如有任何问题请告诉我。

//Works for Gmail
new Regex("\\n.*On.*<(\\r\\n)?" + Regex.Escape(address) + "(\\r\\n)?>", RegexOptions.IgnoreCase),
//Works for Outlook 2010
new Regex("From:.*" + Regex.Escape(address), RegexOptions.IgnoreCase),

干杯

This is a good solution. Found it after searching for so long.

One addition, as mentioned above, this is case wise, so the above expressions did not correctly parse my gmail and outlook (2010) responses, for which I added the following two Regex(s). Let me know for any issues.

//Works for Gmail
new Regex("\\n.*On.*<(\\r\\n)?" + Regex.Escape(address) + "(\\r\\n)?>", RegexOptions.IgnoreCase),
//Works for Outlook 2010
new Regex("From:.*" + Regex.Escape(address), RegexOptions.IgnoreCase),

Cheers

回复收藏 0 原文