从引用的回复中解析电子邮件内容

发布于 2024-07-08 08:09:04 字数 144 浏览 11 评论 0原文

我试图弄清楚如何从电子邮件可能包含的任何引用的回复文本中解析出电子邮件的文本。 我注意到,电子邮件客户端通常会写上“在某某日期某某写的”或在行前加上尖括号。 不幸的是,并不是每个人都这样做。 有谁知道如何以编程方式检测回复文本? 我正在使用 C# 来编写这个解析器。

I'm trying to figure out how to parse out the text of an email from any quoted reply text that it might include. I've noticed that usually email clients will put an "On such and such date so and so wrote" or prefix the lines with an angle bracket. Unfortunately, not everyone does this. Does anyone have any idea on how to programmatically detect reply text? I am using C# to write this parser.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

十六岁半 2024-07-15 08:09:04

我对此进行了更多搜索,这就是我发现的。 基本上有两种情况可以执行此操作:当您拥有整个线程时和当您没有时。 我将其分为这两类:

当您拥有主题时:

如果您拥有整个系列的电子邮件,则可以高度保证您要删除的内容确实被引用文本。 有两种方法可以做到这一点。 第一,您可以使用消息的 Message-ID、In-Reply-To ID 和 Thread-Index 来确定单个消息、它的父级以及它所属的线程。 有关详细信息,请参阅 RFC822RFC2822, 这篇关于线程的有趣文章,或这篇关于线程的文章< /a>. 重新组装线程后,您可以删除外部文本(例如“收件人”、“发件人”、“抄送”等行),然后就完成了。

如果您正在处理的邮件没有标题,您还可以使用相似性匹配来确定电子邮件的哪些部分是回复文本。 在这种情况下,您必须进行相似性匹配来确定重复的文本。 在这种情况下,您可能需要研究 Levenshtein 距离算法,例如 Code Project 上的此内容这个

无论如何,如果您对线程过程感兴趣,请查看 这个关于重组电子邮件线程的精彩 PDF

当您没有帖子时:

如果您只看到帖子中的一条消息,那么您就必须尝试猜测引用的内容。 在这种情况下,以下是我见过的不同的报价方法:

  1. 一条线(如在outlook中看到的)。
  2. 尖括号
  3. “---原始消息---”
  4. “在某某天,某某写道:”

从那里删除文本,你就完成了。 其中任何一个的缺点是,它们都假设发件人将其回复放在引用的文本之上,并且没有将其交错(就像互联网上的旧样式一样)。 如果发生这种情况,祝你好运。 我希望这对你们中的一些人有帮助!

I did a lot more searching on this and here's what I've found. There are basically two situations under which you are doing this: when you have the entire thread and when you don't. I'll break it up into those two categories:

When you have the thread:

If you have the entire series of emails, you can achieve a very high level of assurance that what you are removing is actually quoted text. There are two ways to do this. One, you could use the message's Message-ID, In-Reply-To ID, and Thread-Index to determine the individual message, it's parent, and the thread it belongs to. For more information on this, see RFC822, RFC2822, this interesting article on threading, or this article on threading. Once you have re-assembled the thread, you can then remove the external text (such as To, From, CC, etc... lines) and you're done.

If the messages you are working with do not have the headers, you can also use similarity matching to determine what parts of an email are the reply text. In this case you're stuck with doing similarity matching to determine the text that is repeated. In this case you might want to look into a Levenshtein Distance algorithm such as this one on Code Project or this one.

No matter what, if you're interested in the threading process, check out this great PDF on reassembling email threads.

When you don't have the thread:

If you are stuck with only one message from the thread, you're doing to have to try to guess what the quote is. In that case, here are the different quotation methods I have seen:

  1. a line (as seen in outlook).
  2. Angle Brackets
  3. "---Original Message---"
  4. "On such-and-such day, so-and-so wrote:"

Remove the text from there down and you're done. The downside to any of these is that they all assume that the sender put their reply on top of the quoted text and did not interleave it (as was the old style on the internet). If that happens, good luck. I hope this helps some of you out there!

陪我终i 2024-07-15 08:09:04

首先,这是一项棘手的任务。

您应该收集来自不同电子邮件客户端的典型响应,并准备正确的正则表达式(或其他)来解析它们。 我收集了来自 Outlook、thunderbird、Gmail、Apple mail 和 mail.ru 的回复。

我使用正则表达式以以下方式解析响应:如果表达式不匹配,我尝试使用下一个。

new Regex("From:\\s*" + Regex.Escape(_mail), RegexOptions.IgnoreCase);
new Regex("<" + Regex.Escape(_mail) + ">", RegexOptions.IgnoreCase);
new Regex(Regex.Escape(_mail) + "\\s+wrote:", RegexOptions.IgnoreCase);
new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline);
new Regex("-+original\\s+message-+\\s*$", RegexOptions.IgnoreCase);
new Regex("from:\\s*$", RegexOptions.IgnoreCase);

最后删除引号:

new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline);

这是我的一小部分测试响应(样本除以 --- ):

From: [email protected] [mailto:[email protected]] 
Sent: Tuesday, January 13, 2009 1:27 PM
----
2008/12/26 <[email protected]>

>  text
----
[email protected] wrote:
> text
----
      [email protected] wrote:         text
text
----
2009/1/13 <[email protected]>

>  text
----
 [email protected] wrote:         text
 text
----
2009/1/13 <[email protected]>

> text
> text
----
2009/1/13 <[email protected]>

> text
> text
----
[email protected] wrote:
> text
> text
<response here>
----
--- On Fri, 23/1/09, [email protected] <[email protected]> wrote:

> text
> text

First of all, this is a tricky task.

You should collect typical responses from different e-mail clients and prepare correct regular expressions (or whatever) to parse them. I've collected responses from outlook, thunderbird, Gmail, Apple mail, and mail.ru.

I am using regular expressions to parse responses in the following manner: if an expression did not match, I try to use the next one.

new Regex("From:\\s*" + Regex.Escape(_mail), RegexOptions.IgnoreCase);
new Regex("<" + Regex.Escape(_mail) + ">", RegexOptions.IgnoreCase);
new Regex(Regex.Escape(_mail) + "\\s+wrote:", RegexOptions.IgnoreCase);
new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline);
new Regex("-+original\\s+message-+\\s*
quot;, RegexOptions.IgnoreCase);
new Regex("from:\\s*
quot;, RegexOptions.IgnoreCase);

To remove quotation in the end:

new Regex("^>.*
quot;, RegexOptions.IgnoreCase | RegexOptions.Multiline);

Here is my small collection of test responses (samples divided by --- ):

From: [email protected] [mailto:[email protected]] 
Sent: Tuesday, January 13, 2009 1:27 PM
----
2008/12/26 <[email protected]>

>  text
----
[email protected] wrote:
> text
----
      [email protected] wrote:         text
text
----
2009/1/13 <[email protected]>

>  text
----
 [email protected] wrote:         text
 text
----
2009/1/13 <[email protected]>

> text
> text
----
2009/1/13 <[email protected]>

> text
> text
----
[email protected] wrote:
> text
> text
<response here>
----
--- On Fri, 23/1/09, [email protected] <[email protected]> wrote:

> text
> text
陌若浮生 2024-07-15 08:09:04

谢谢 Goleg 提供的正则表达式! 真的很有帮助。 这不是 C#,但对于 Google 用户来说,这是我的 Ruby 解析脚本:

def extract_reply(text, address)
    regex_arr = [
      Regexp.new("From:\s*" + Regexp.escape(address), Regexp::IGNORECASE),
      Regexp.new("<" + Regexp.escape(address) + ">", Regexp::IGNORECASE),
      Regexp.new(Regexp.escape(address) + "\s+wrote:", Regexp::IGNORECASE),
      Regexp.new("^.*On.*(\n)?wrote:$", Regexp::IGNORECASE),
      Regexp.new("-+original\s+message-+\s*$", Regexp::IGNORECASE),
      Regexp.new("from:\s*$", Regexp::IGNORECASE)
    ]

    text_length = text.length
    #calculates the matching regex closest to top of page
    index = regex_arr.inject(text_length) do |min, regex|
        [(text.index(regex) || text_length), min].min
    end

    text[0, index].strip
end

到目前为止,它运行得很好。

Thank you, Goleg, for the regexes! Really helped. This isn't C#, but for the googlers out there, here's my Ruby parsing script:

def extract_reply(text, address)
    regex_arr = [
      Regexp.new("From:\s*" + Regexp.escape(address), Regexp::IGNORECASE),
      Regexp.new("<" + Regexp.escape(address) + ">", Regexp::IGNORECASE),
      Regexp.new(Regexp.escape(address) + "\s+wrote:", Regexp::IGNORECASE),
      Regexp.new("^.*On.*(\n)?wrote:$", Regexp::IGNORECASE),
      Regexp.new("-+original\s+message-+\s*$", Regexp::IGNORECASE),
      Regexp.new("from:\s*$", Regexp::IGNORECASE)
    ]

    text_length = text.length
    #calculates the matching regex closest to top of page
    index = regex_arr.inject(text_length) do |min, regex|
        [(text.index(regex) || text_length), min].min
    end

    text[0, index].strip
end

It's worked pretty well so far.

世界如花海般美丽 2024-07-15 08:09:04

到目前为止,最简单的方法是在您的内容中放置一个标记,例如:

--- 请在此行上方回复 ---

正如您毫无疑问注意到的那样,解析引用的文本并不是一项简单的任务,因为不同的电子邮件客户以不同的方式引用文本。 要正确解决此问题,您需要考虑并在每个电子邮件客户端中进行测试。

Facebook 可以做到这一点,但除非您的项目有大量预算,否则您可能做不到。

Oleg 已使用正则表达式找到“On 13 Jul 2012, at 13:09, xxx write:”文本解决了该问题。 但是,如果用户删除此文本,或者像许多人一样在电子邮件底部进行回复,则此解决方案将不起作用。

同样,如果电子邮件客户端使用不同的日期字符串,或者不包含日期字符串,则正则表达式将失败。

By far the easiest way to do this is by placing a marker in your content, such as:

--- Please reply above this line ---

As you have no doubt noticed, parsing out quoted text is not a trivial task as different email clients quote text in different ways. To solve this problem properly you need to account for and test in every email client.

Facebook can do this, but unless your project has a big budget, you probably can't.

Oleg has solved the problem using regexes to find the "On 13 Jul 2012, at 13:09, xxx wrote:" text. However, if the user deletes this text, or replies at the bottom of the email, as many people do, this solution will not work.

Likewise if the email client uses a different date string, or doesn't include a date string the regex will fail.

迷乱花海 2024-07-15 08:09:04

电子邮件中没有通用的回复指示符。 您能做的最好的事情就是尝试捕获最常见的模式并在遇到新模式时解析它们。

请记住,有些人会在引用的文本中插入回复(例如,我的老板在我问他们的同一行回答问题),因此无论您做什么,您都可能会丢失一些您想保留的信息。

There is no universal indicator of a reply in an e-mail. The best you can do is try to catch the most common and parse new patterns as you come across them.

Keep in mind that some people insert replies inside the quoted text (My boss for example answers questions on the same line as I asked them) so whatever you do, you might lose some information you would have liked to keep.

甜妞爱困 2024-07-15 08:09:04

这是我的 @hurshagrawal 的 Ruby 代码的 C# 版本。 我不太了解 Ruby,所以可能会出错,但我认为我做对了。

public string ExtractReply(string text, string address)
{
    var regexes = new List<Regex>() { new Regex("From:\\s*" + Regex.Escape(address), RegexOptions.IgnoreCase),
                        new Regex("<" + Regex.Escape(address) + ">", RegexOptions.IgnoreCase),
                        new Regex(Regex.Escape(address) + "\\s+wrote:", RegexOptions.IgnoreCase),
                        new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline),
                        new Regex("-+original\\s+message-+\\s*$", RegexOptions.IgnoreCase),
                        new Regex("from:\\s*$", RegexOptions.IgnoreCase),
                        new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline)
                    };

    var index = text.Length;

    foreach(var regex in regexes){
        var match = regex.Match(text);

        if(match.Success && match.Index < index)
            index = match.Index;
    }

    return text.Substring(0, index).Trim();
}

Here is my C# version of @hurshagrawal's Ruby code. I don't know Ruby really well so it could be off, but I think I got it right.

public string ExtractReply(string text, string address)
{
    var regexes = new List<Regex>() { new Regex("From:\\s*" + Regex.Escape(address), RegexOptions.IgnoreCase),
                        new Regex("<" + Regex.Escape(address) + ">", RegexOptions.IgnoreCase),
                        new Regex(Regex.Escape(address) + "\\s+wrote:", RegexOptions.IgnoreCase),
                        new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline),
                        new Regex("-+original\\s+message-+\\s*$", RegexOptions.IgnoreCase),
                        new Regex("from:\\s*$", RegexOptions.IgnoreCase),
                        new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline)
                    };

    var index = text.Length;

    foreach(var regex in regexes){
        var match = regex.Match(text);

        if(match.Success && match.Index < index)
            index = match.Index;
    }

    return text.Substring(0, index).Trim();
}
妄想挽回 2024-07-15 08:09:04

现在,鉴于 text/html 内容类型适合您(Outlook 是一个例外;请参阅下面的详细信息),这应该相当容易。 下表列出了各种桌面电子邮件客户端中解析选项的真实测试结果:

邮件客户端回复消息格式HTML 可以轻松可靠地解析HTML 要删除的标记纯文本引用标记
web.de始终 htmlyesdiv name="quote">- (始终为 html)
Thunderbird与原始消息中的相同yes

,

"2022 年 10 月 26 日 12:37,John Doe 写道:"
Gmail都是yes

"On 2022 年 10 月 27 日星期四下午 1:39 John Doe [电子邮件受保护] 写道:“
Outlook 2016、2019与原始消息中的相同可能是不可能的,因为使用了一些奇怪的未知字处理器纯文本消息:“-----原始消息-----”,多部分:3 个空白带有一些空格的行,后跟“发件人:John Doe [电子邮件受保护]
Apple未知

"

> 2021 年 12 月 22 日 12:50,John Doe [电子邮件受保护] 写道:“

It should be fairly easy these days, given text/html content type works for you (with Outlook being an exception; see details below). Here is a table with the real testing results of parsing options in various desktop email clients:

Mail clientReply message formatHTML can be parsed easily and reliablyHTML tags to be deletedPlain text quote marker
web.dealways htmlyes<div name="quote">- (always html)
Thunderbirdsame as in the original messageyes<div class="moz-cite-prefix">, <blockquote type="cite">"On 26.10.2022 12:37, John Doe wrote:"
Gmailbothyes<div class="gmail_quote">"On Thu, Oct 27, 2022 at 1:39 PM John Doe [email protected] wrote:"
Outlook 2016, 2019same as in the original messageProbably impossible due to use of some weird Word processorunknownPlain text-only message: "-----Original Message-----", multipart: 3 blank lines with some space followed by "From: John Doe [email protected]"
Appleunknownyes<blockquote type="cite">"> On 22. Dec 2021, at 12:50, John Doe [email protected] wrote:"
思念满溢 2024-07-15 08:09:04

如果您控制原始消息(例如来自 Web 应用程序的通知),您可以放置​​一个独特的、可识别的标头,并将其用作原始帖子的分隔符。

If you control the original message (e.g. notifications from a web application), you can put a distinct, identifiable header in place, and use that as the delimiter for the original post.

离不开的别离 2024-07-15 08:09:04

这是一个很好的解决方案。 找了好久才找到。

另外,如上所述,这是区分大小写的,因此上述表达式无法正确解析我的 gmail 和 Outlook (2010) 响应,为此我添加了以下两个正则表达式。 如有任何问题请告诉我。

//Works for Gmail
new Regex("\\n.*On.*<(\\r\\n)?" + Regex.Escape(address) + "(\\r\\n)?>", RegexOptions.IgnoreCase),
//Works for Outlook 2010
new Regex("From:.*" + Regex.Escape(address), RegexOptions.IgnoreCase),

干杯

This is a good solution. Found it after searching for so long.

One addition, as mentioned above, this is case wise, so the above expressions did not correctly parse my gmail and outlook (2010) responses, for which I added the following two Regex(s). Let me know for any issues.

//Works for Gmail
new Regex("\\n.*On.*<(\\r\\n)?" + Regex.Escape(address) + "(\\r\\n)?>", RegexOptions.IgnoreCase),
//Works for Outlook 2010
new Regex("From:.*" + Regex.Escape(address), RegexOptions.IgnoreCase),

Cheers

荒人说梦 2024-07-15 08:09:04

这是旧帖子,但是,不确定您是否知道 github 有 a Ruby lib 提取回复。 如果您使用 .NET,我在 https://github.com/EricJWHuang/EmailReplyParser

It is old post, however, not sure if you are aware github has a Ruby lib extracting the reply. If you use .NET, I have a .NET one at https://github.com/EricJWHuang/EmailReplyParser

生寂 2024-07-15 08:09:04

如果您使用 SigParser.com 的 API,它将为您提供回复链中所有分解电子邮件的数组来自单个电子邮件文本字符串。 因此,如果有 10 封电子邮件,您将获得所有 10 封电子邮件的文本。

输入图片此处描述

您可以在此处查看详细的 API 规范。

https://api.sigparser.com/

在此处输入图像描述

If you use SigParser.com's API, it will give you an array of all the broken out emails in a reply chain from a single email text string. So if there are 10 emails, you'll get the text for all 10 of the emails.

enter image description here

You can view the detailed API spec here.

https://api.sigparser.com/

enter image description here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文