检测电子邮件是否本质上是文本

发布于 2025-01-08 20:52:25 字数 607 浏览 6 评论 0原文

我正在编写一个 Outlook 插件，用于保存电子邮件以用于历史目的。遗憾的是，即使经过压缩，Outlook 的 MSG 格式也过于冗长。这会导致保存的 MSG 文件的大小是其文本大小的许多倍。然而，将所有消息保存为文本存在明显的缺陷，即缺少附件、图像和任何相关格式。

对于大多数电子邮件来说，这不是问题，但是具有一定程度复杂格式、图片、附件等的电子邮件应该以 MSG 格式保存。

大多数用户的电子邮件以 HTML 形式发送，这使得我的算法大致如下：

1. If email has attachment(s), save as MSG and be done
2. If email is stored as text, save as text and be done
3. If email is not stored as HTML store as MSG and be done
4. Decide if the HTML should be converted to text and
     store it as text if so
     store it as MSG if not

这很简单，但步骤 #4 除外：如何决定保存时 HTML 格式的电子邮件应转换为哪种格式？

原文

I'm writing an Outlook Add-in that saves emails for historical purposes. Outlook's MSG format is unfortunately overly-verbose, even when compressed. This causes saved MSG files to be many times the size of their text equivalent. However, saving all messages as text has the obvious pitfalls of lacking attachments, images, and any relevant formatting.

For the majority of emails this isn't an issue, however emails with a certain degree of complex formatting, pictures, attachments, (etc...) ought to be saved in MSG format.

The majority of users' emails are sent as HTML making my algorithm roughly as follows:

1. If email has attachment(s), save as MSG and be done
2. If email is stored as text, save as text and be done
3. If email is not stored as HTML store as MSG and be done
4. Decide if the HTML should be converted to text and
     store it as text if so
     store it as MSG if not

This is straightforward with exception of Step #4: How can I decide which format an HTML-formatted email should be converted to upon saving?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

铜锣湾横着走 2025-01-15 20:52:26

一个想法：计算消息中 HTML 标签的加权密度。根据现有数据选择阈值。 HTML 密度高于阈值的消息将存储为 MSG；密度低于阈值的消息将存储为纯文本。

如何计算加权密度？使用 HTML 解析库。让它解析文档并计算文档中每种类型的 HTML 标签的数量。这就是您从图书馆所需要的一切。将每个标签计数乘以其权重并将它们相加。然后尝试将消息转换为纯文本并计算消息中的字符数。将加权标签计数总和除以该数字，就得到了密度。

密度应该用什么来加权？通过您创建的表格，其中包含每种 HTML 标记的重要性。我猜想失去粗体和斜体也不算太糟糕。丢失有序列表和无序列表的情况会更糟，除非在消息转换为纯文本时保留项目符号和数字。表格的权重应该很高，因为它们对于格式设置很重要。也为无法识别的标签选择权重。

你应该如何选择你的阈值？对电子邮件样本运行密度计算函数。还要手动检查这些电子邮件，看看将它们作为 MSG 还是纯文本会更好，并为每封电子邮件写下该选择。使用某种算法处理该数据来找到边界值。我认为该算法可能是朴素贝叶斯分类，但在这种情况下可能有更简单的算法。或者人工计算的猜测可能就足够了。我认为，在查看人类选择的格式与加权 HTML 标签密度的散点图，并观察大约区分两种格式决策的密度值后，您可以做出猜测。

回复收藏 0 原文

~没有更多了~