从回复电子邮件中删除之前的部分
我正在尝试编写定期接收电子邮件的应用程序。它将每封邮件写入数据库。但有时我会收到“回复:”电子邮件,如下所示:
新消息
2010年9月21日24:26有人写道(a):
|旧消息 |
格式取决于电子邮件提供商。
是否有任何库可以帮助从电子邮件中删除“Re”部分?也许 IMAP 服务器可以做到这一点?我在数据库中有所有以前来自线程的电子邮件,因此我可以获取它们并在新消息中搜索。
I'm trying to write application that periodically receives e-mails. It writes every mail into database. But sometimes i'm getting 'Re:' e-mail that looks something like this:
New message
On September 21, 2010 24:26 Someone wrote (a):
| Old message
|
The format depends on e-mail provider.
Is there any library that helps removing 'Re' part from e-mail message? Maybe IMAP server can do that? I have all the previous e-mails from thread in database so I can take them and search in new message.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您能够将回复 (RE:) 消息与其作为回复的原始/上一条消息关联起来,那么我认为您可以从数据库中获取原始/上一条消息的正文,然后删除该文本来自回复正文。但是,此方法不会 100% 准确,因为客户可以将 HTML/富文本电子邮件转换为纯文本,反之亦然。在任何这种情况下,这个方法可能都行不通。即便如此,这种技术仍然是通用的,并且可能在大多数情况下都有效。
此外,电子邮件提供商可能会在回复中引用消息的开头添加某些标头字段或前导码。在这种情况下,我认为没有任何“包罗万象”的解决方案。
我的建议是针对一些真正庞大的网络邮件提供商(Gmail、雅虎、微软等),了解他们用于回复的格式并相应地解析消息。此外,您还可以处理一些通用格式。例如,“>”字符通常用在回复中每行引用文本的开头。
如果您打算使用 C# 等语言进行开发,请为自己创建一个类似 IReplyFormat 的接口,其中包含每个提供程序的相应实现,以及可能的一些通用格式。
我认为您不会找到解决此问题的任何包罗万象/完美的解决方案,因为有太多的邮件提供商具有太多不同的格式。然而,我认为你至少可以找到一些技术,就像上面提到的那些,这些技术对你来说会更有效,这是你目前所能期望的最好结果。
If you are able to associate a reply (RE:) message with the original/previous message that it is a reply to, then I would think that you could grab the body text of the original/previous message from your database, and then remove that text from the body of the reply. However, this method will not be 100% accurate, because clients could convert an HTML/Rich Text email in to plain text, or vice-versa. In any such case, this method probably wouldn't work. Even so, this technique would be generic and probably work the majority of the time.
In addition, the email provider may add certain header fields, or preambles, to the beginnings of a quoted message in a reply. In this case, I don't think there is any "catch all" solution.
My recommendation would be to target a few of the really huge web-mail providers (Gmail, Yahoo, Microsoft, etc), learn the formats that they use for their replies and parse the messages accordingly. In addition, you could likely handle a few generic formats as well. For instance, the '>' character is commonly used at the beginning of each line of quoted text in a reply.
If you're going to be developing in a language like C#, create yourself an Interface like
IReplyFormat
, with a corresponding implementation for each provider, and possibly some generic formats.I don't think you will find any catch-all/perfect solution to this problem, as there are simply too many mail providers with too many different formats. However, I think you can at the very least find some techniques, like the ones mentioned above, that will work for you more times than not, which is the best you can hope for at this point.
就我个人而言,我认为您在这里运气不佳,因为消息副本是正文的一部分。因此,为了删除它,您必须处理消息正文并为每种已知格式编写提取方法(显然问题是您无法知道所有可能的格式)。
那么,为什么不将整个消息保存到数据库中,而不是解析正文呢?通常消息的大小不应该成为现代 DBMS 的问题。如果确实有问题,您始终可以压缩主体并将其存储在 BLOB 中。
Personally I think that you are out of luck here, as the message copy is part of the body. So in order to remove it you will have to process the message's body and write an extraction method for each known format (obviously the problem is that you cannot know all possible formats).
So, instead of parsing the body why don't you persist the whole message into the database? Normally the size of the message should not be the problem with modern DBMS. If it really is a problem you always can compress the body and store it in a BLOB.
并且您必须省略下面这一行中的部分,但是仅检查这一点是不够的,因为通常 from 后面跟着 subject、cc、to 等,因此需要检查模式。我认为可能存在一些开源项目或文本库,但在谷歌上很难找到它。
and you have to omit the parts from this line below, howerver only checking this will not be sufficient as usually from is followed by subject,cc,to etc, so the pattern needs to be checked. I think some open source project or text library may exist, but its too difficult to find it on google.
我同意奥巴利克斯的观点。过滤掉回复太难了,所以必须保留整个消息。但是,当您向用户显示电子邮件时,您可以隐藏其中的某些部分。这些部分可以通过可选的“单击此处查看完整消息”或类似内容来显示。例如,过滤“>”的正则表达式字符看起来像
@"^[ \f\t\v>]*"
I agree with Obalix. It's too hard to filter out replies so must keep the whole message. However, when you present email to the user, you can hide some parts of it. Those part can be shown with an optional "Click here to see the full message" or similar. For instance, regular expression to filter '>' characters would look something like
@"^[ \f\t\v>]*"