检测电子邮件是否本质上是文本
我正在编写一个 Outlook 插件,用于保存电子邮件以用于历史目的。遗憾的是,即使经过压缩,Outlook 的 MSG 格式也过于冗长。这会导致保存的 MSG 文件的大小是其文本大小的许多倍。然而,将所有消息保存为文本存在明显的缺陷,即缺少附件、图像和任何相关格式。
对于大多数电子邮件来说,这不是问题,但是具有一定程度复杂格式、图片、附件等的电子邮件应该以 MSG 格式保存。
大多数用户的电子邮件以 HTML 形式发送,这使得我的算法大致如下:
1. If email has attachment(s), save as MSG and be done
2. If email is stored as text, save as text and be done
3. If email is not stored as HTML store as MSG and be done
4. Decide if the HTML should be converted to text and
store it as text if so
store it as MSG if not
这很简单,但步骤 #4 除外:如何决定保存时 HTML 格式的电子邮件应转换为哪种格式?
I'm writing an Outlook Add-in that saves emails for historical purposes. Outlook's MSG format is unfortunately overly-verbose, even when compressed. This causes saved MSG files to be many times the size of their text equivalent. However, saving all messages as text has the obvious pitfalls of lacking attachments, images, and any relevant formatting.
For the majority of emails this isn't an issue, however emails with a certain degree of complex formatting, pictures, attachments, (etc...) ought to be saved in MSG format.
The majority of users' emails are sent as HTML making my algorithm roughly as follows:
1. If email has attachment(s), save as MSG and be done
2. If email is stored as text, save as text and be done
3. If email is not stored as HTML store as MSG and be done
4. Decide if the HTML should be converted to text and
store it as text if so
store it as MSG if not
This is straightforward with exception of Step #4: How can I decide which format an HTML-formatted email should be converted to upon saving?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
一个想法:计算消息中 HTML 标签的加权密度。根据现有数据选择阈值。 HTML 密度高于阈值的消息将存储为 MSG;密度低于阈值的消息将存储为纯文本。
如何计算加权密度?使用 HTML 解析库。让它解析文档并计算文档中每种类型的 HTML 标签的数量。这就是您从图书馆所需要的一切。将每个标签计数乘以其权重并将它们相加。然后尝试将消息转换为纯文本并计算消息中的字符数。将加权标签计数总和除以该数字,就得到了密度。
密度应该用什么来加权?通过您创建的表格,其中包含每种 HTML 标记的重要性。我猜想失去粗体和斜体也不算太糟糕。丢失有序列表和无序列表的情况会更糟,除非在消息转换为纯文本时保留项目符号和数字。表格的权重应该很高,因为它们对于格式设置很重要。也为无法识别的标签选择权重。
你应该如何选择你的阈值?对电子邮件样本运行密度计算函数。还要手动检查这些电子邮件,看看将它们作为 MSG 还是纯文本会更好,并为每封电子邮件写下该选择。使用某种算法处理该数据来找到边界值。我认为该算法可能是朴素贝叶斯分类,但在这种情况下可能有更简单的算法。或者人工计算的猜测可能就足够了。我认为,在查看人类选择的格式与加权 HTML 标签密度的散点图,并观察大约区分两种格式决策的密度值后,您可以做出猜测。
An idea: count the weighted density of HTML tags in the message. Choose a threshold based on existing data. Messages with HTML density higher than the threshold get stored as MSG; messages with density lower than the threshold get stored as plain text.
How do you calculate the weighted density? Use an HTML parsing library. Have it parse the document and count the number of each type of HTML tag are in the document. That's all you need from the library. Multiply each tag-count by its weight and sum them together. Then try converting the message to plain text and counting the number of characters in the message. Divide the weighted-tag-count-sum by that number and you have your density.
What should the density be weighted by? By a table you create with the importance of each type of HTML tag. I would guess that losing bold and italics are not too bad. Losing ordered and unordered lists lists are a bit worse, unless bullets and numbers are preserved when the messages are are converted to plain text. Tables should be weighted highly as they are important to the formatting. Choose a weight for unrecognized tags too.
How should you choose your threshold? Run your density-calculating function on a sample of emails. Also manually inspect those emails to see if they would be better off as MSG or plain text, and write that choice down for each email. Use some algorithm with that data to find the boundary value. I think that algorithm could be Naive Bayes classification, but there might be a simpler algorithm in this case. Or a human-calculated guess might be good enough. I think you could make a guess after looking at a scatter plot of human-chosen format vs weighted HTML tag density, and eyeballing the density value that approximately separates the two format decisions.