如何解析电子邮件文本中的<称呼><正文><签名><回复文本>等组件ETC？

发布于 2024-11-07 07:37:29 字数 535 浏览 2 评论 0原文

我正在编写一个分析电子邮件的应用程序，如果我可以使用一个 python 库将电子邮件文本解析为诸如 之类的命名组件，那么它会节省我很多时间。 <回复文本> 等。

例如，以下文本“嗨 Dave，\n让我们在本周二见面\n干杯，Tom\n\n2011 年 5 月 15 日星期日下午 5:02，Dave Trindall 写道：嘿汤姆，\n我们聚在一起怎么样......”将被解析，因为

Salutation: "Hi Dave,\n"
Body: "Lets meet up this Tuesday\n"
Signature: "Cheers, Tom\n\n"
Reply Text: "On Sunday, 15 May 2011 at 5:02 PM, Dave Trindal wrote: ..."

我知道这类问题没有完美的解决方案，但即使是一个能够很好地近似的库也会有所帮助。我在哪里可以找到一个？

原文

I'm writing an application that analyzes emails and it would save me a bunch of time if I could use a python library that would parse email text down into named components like <salutation><body><signature><reply text> etc.

For example, the following text "Hi Dave,\nLets meet up this Tuesday\nCheers, Tom\n\nOn Sunday, 15 May 2011 at 5:02 PM, Dave Trindall wrote: Hey Tom,\nHow about we get together ..." would be parsed as

Salutation: "Hi Dave,\n"
Body: "Lets meet up this Tuesday\n"
Signature: "Cheers, Tom\n\n"
Reply Text: "On Sunday, 15 May 2011 at 5:02 PM, Dave Trindal wrote: ..."

I know there's no perfect solution for this kind of problem, but even a library that does good approximation would help. Where can I find one?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

清引 2024-11-14 07:37:29

https://github.com/Trindaz/EFZP

这提供了原始问题中提出的功能，以及公平的认可电子邮件区域，因为它们通常出现在以英语为母语的人通过 Outlook 和 Gmail 等常见电子邮件客户端编写的电子邮件中。

回复收藏 0 原文

暮年 2024-11-14 07:37:29

如果您根据每一行包含的单词类型对每一行进行评分，您可能会得到相当好的指示。

EG 开头附近带有问候语的一行是称呼语（称呼语也可能包含表示过去时态的短语，例如很高兴上次见到您）

正文通常包含“电影、音乐会”等单词。还将包含动词（去，跑，走等）和问号和产品（例如想要，我们可以，我们应该，更喜欢......）。
查看 http://nodebox.net/code/index.php/Linguistics#verb_conjugation
http://ogden.basic-english.org/
http://osteele.com/projects/pywordnet/

签名将包含结束语。

如果您发现一个数据源包含您想要的结构消息，您可以进行一些频率分析，以查看每个单词在每个部分中出现的频率。

每个单词都会得到一个分数[称呼分数、身体分数、签名分数……]
例如，hello 在称呼中可以出现 900 次，在正文中出现 10 次，在签名中出现 3 次。
这意味着 hello 将被分配为 [900, 10, 3, ..]
欢呼声可能会被指定为 [10,3,100,..]

现在您将拥有大约 500,000 个单词的大列表。
范围不大的单词是没有用的。
例如 catch 可能有 [100,101,80..] = 范围 21
（赶上了真好，想去抓鱼，等会再抓你）。 catch 可能发生在任何地方。

现在，您可以将每行的单词数减少到大约 10,000 个

，也为该行提供一个分数，形式为 [称呼分数、正文分数、签名分数，..]

该分数是通过将每个行的向量分数相加来计算的单词。

例如，一句“你好，欢呼给我你的电话号码”可以是：
[900, 10, 3, ..] + [10,3,100,..] + .. + .. + = [900+10+..,10+3+..,3+100,..]
=[1023,900,500,..]

那么说因为最大的数字在称呼语分数位置的开头，所以这句话是称呼语。

然后，如果您必须对其中一行进行评分，以了解该行应位于哪个组件中，则对于您要在其分数上添加的每个单词，

祝您好运，计算复杂性和准确性之间始终存在权衡。如果你能找到一组好的单词并建立一个好的模型来作为计算的基础，这将会有所帮助。

If you score each line based on the types of words it contains you may get a fairly good indication.

E.G. A line with greeting words near the start is the salutation (also salutations may have phrases that refer to the past tense e.g. it was good to see you last time)

A Body will typically contain words such as "movie, concert" etc. It will also contain verbs (go to, run, walk, etc) and questions marks and offerings (e.g. want to, can we, should we, prefer..).
Check out http://nodebox.net/code/index.php/Linguistics#verb_conjugation
http://ogden.basic-english.org/
http://osteele.com/projects/pywordnet/

the signature will contain closing words.

If you find a datasource that has messages of the structure you want you could do some frequency analysis to see how often each word occurs in each section.

Each word would get a score [salutation score, body score, signature score,..]
e.g. hello could occur 900 times in the salutation, 10 times in the body, and 3 times in the signature.
this means hello would get assigned [900, 10, 3, ..]
cheers might get assigned [10,3,100,..]

now you will have a large list of about 500,000 words.
words that don't have a large range aren't useful.
e.g. catch might have [100,101,80..] = range of 21
(it was good to catch up, wanna go catch a fish, catch you later). catch can occur anywhere.

Now you can reduce the number of words down to about 10,000

now for each line, give the line a score also of the form [salutation score, body score, signature score,..]

this score is calculated by adding the vector scores of each word.

e.g. a sentence "hello cheers for giving me your number" could be:
[900, 10, 3, ..] + [10,3,100,..] + .. + .. + = [900+10+..,10+3+..,3+100,..]
=[1023,900,500,..] say

then because the biggest number is at the start in the salutation score position, this sentence is a salutation.

then if you had to score one of your lines to see what component the line should be in, for each word you would add on its score

Good luck, there is always a trade-off between computation complexity and accuracy. If you can find a good set of words and make a good model to base you calculations it will help.

回复收藏 0 原文

撑一把青伞 2024-11-14 07:37:29

我想到的第一个方法（不一定是最好的......）是从使用 split 开始。这是一些代码和东西

Linearray=emailtext.split('\n')
现在你有一个字符串数组，每个字符串都像一个段落或其他任何东西，

因此 Linearray[0] 将包含称呼，

决定回复文本的开始位置有点棘手，我注意到在它之前有一个双换行符，所以也许可以这样做从后面搜索，并希望最后一个指示回复文本的开始。

或者存储一些您可能期望的标志性词语，然后搜索前面的标志性词语，例如欢呼、问候等。

一旦你弄清楚签名在哪里，剩下的就很容易了

希望这会有所帮助

回复收藏 0 原文

走过海棠暮 2024-11-14 07:37:29

我为此构建了一个相当便宜的 API，实际上是为了解析电子邮件和电子邮件链签名中的联系人数据。它被称为 SigParser。您可以在此处查看 Swagger 文档。

基本上，您向它发送一个带有 JSON 正文的标头“x-api-key”，如下所示，它会解析电子邮件回复链中的所有联系人。

{
  "subject": "Thanks for meeting...",
  "from_address": "[email protected]",
  "from_name": "Bill Gates",
  "htmlbody": "<div>Hi, good seeing you the other day.</div><div>--</div><div>Bill Gates</div><div>Cell 777-444-8888</div><a href=\"https://www.linkedin.com/in/williamhgates/\">LinkedIn</a><a href=\"https://twitter.com/BillGates\">Twitter</a>",
  "plainbody": "Hi, good seeing you the other day. \r\n--\r\nBill Gates\r\nCell 777-444-8888",
  "date": "Mon, 28 May 2018 23:33:40 +0000 (UTC)"
}

I built a pretty cheap API for this actually to parse the contact data from signatures of emails and email chains. It's called SigParser. You can see the Swagger docs here for it.

Basically you send it a header 'x-api-key' with a JSON body like so and it parses all the contacts in the reply chain of an email.

{
  "subject": "Thanks for meeting...",
  "from_address": "[email protected]",
  "from_name": "Bill Gates",
  "htmlbody": "<div>Hi, good seeing you the other day.</div><div>--</div><div>Bill Gates</div><div>Cell 777-444-8888</div><a href=\"https://www.linkedin.com/in/williamhgates/\">LinkedIn</a><a href=\"https://twitter.com/BillGates\">Twitter</a>",
  "plainbody": "Hi, good seeing you the other day. \r\n--\r\nBill Gates\r\nCell 777-444-8888",
  "date": "Mon, 28 May 2018 23:33:40 +0000 (UTC)"
}

回复收藏 0 原文

~没有更多了~