如何识别人类发送的电子邮件？

发布于 2025-01-01 11:28:27 字数 140 浏览 5 评论 0 原文

我正在开发一个项目，需要识别真人发送的电子邮件，而不是批量邮件、通知和时事通讯。有什么明确的方法可以做到这一点吗？电子邮件标题中是否有任何可以提供帮助的信息。我正在 Gmail IMAP 上工作，因此我已经有非垃圾邮件。

感谢这方面的任何帮助。谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心是晴朗的。 2025-01-08 11:28:27

没有明确的方法来区分批量邮件和个性化邮件。与垃圾邮件不同，大多数批量邮件都是请求/预期的，因此发件人不会做奇怪的事情来绕过垃圾邮件过滤器，这意味着这些电子邮件通常会很好地融合在一起。

但是，您可以寻找一些趋势。如果您想可靠地做到这一点，您可能需要应用一些评分系统，例如垃圾邮件过滤器。

您还需要接受这样的事实：您必然会得到很大比例的误报和漏报。

批量邮件常见的一些情况在个性化信件中很少出现：

“收件人”和“抄送”地址不包含本地收件人。有时发件人会发送到“[email protected]”而不是“<一个href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="394b5c5a5049505c574d78794b5c5a5049505c574d785d5654585057175a5654">[电子邮件受保护]", "[电子邮件受保护]"等。在这种情况下，“收件人”中也可能只出现一个地址，而“抄送”中则没有任何内容。“发件人”
地址是“noreply@”、“newsletter@”、“do-not-reply@”、“mailinglist@” ，甚至不太常见的术语，如“support@”或“sales@”（但请记住，它们可能会导致误报）
“List-Unsubscribe:" 标头
该消息包含取消订阅链接。运行模式匹配以查找电子邮件最后几行中的常见短语。寻找链接或“取消订阅”、“选择退出”等词语。
邮件列表往往内容丰富。检查是否大量使用 CSS 和大量图像，整个消息包含在
或
中;
结构。即 Dreamweaver 之类的东西会放入其中，而不是邮件客户端。
邮件顶部的标题或粗体内容。如果消息的第一部分类似于新闻通讯，那么它可能是新闻通讯。
大量链接或频繁链接到相同（或相同的少数）网站。时事通讯将尽力引导用户访问公司网站。如果链接的域与发件人域匹配（或相似），您可能会获得更高的分数。
大量引用社交媒体。如果它是包含多篇文章的时事通讯，则每个故事可能都有自己的“推文”、“点赞”链接。个人用户可能（最多）包含一次对 Twitter、Facebook 等（在其签名中）的引用。
通知和其他自动生成的消息通常会遵循相同的基本格式。如果您有能力，请对以前的消息运行某种差异或其他比较。强匹配意味着自动化。
没有问候语，也没有通用的问候语。然而，个人电子邮件通常也会跳过“亲爱的弗雷德”位，因此这本身并不是一个足够好的检测；但像“亲爱的用户”或“亲爱的客户”这样的词几乎肯定是通用的。
不太可能以“此致，伊恩”或“您真诚的，约翰·多伊”结尾
发件人之前得分很高。保留记录。如果发件人多次触发高分，则几乎可以肯定他们是批量邮件。

There isn't a clear way to distinguish bulk mail from personalised mailings. Unlike with spam, most bulk mail is requested/expected, so the sender doesn't do odd things to get round spam filters, which means these emails often blend in fairly well.

However, there are some trends that you can look for. If you want to do it reliably, you will probably need to apply some scoring system, like spam-filters do.

You will also need to accept that you are bound to get a substantial proportion of false positives and false negatives.

Some things that are common to bulk mail that appear less often in personalised correspondence:

"To" and "Cc" addresses do not contain a local recipient. Sometimes the sender will send to "[email protected]" instead of "[email protected]", "[email protected]", etc. In these cases, it is also likely that only one address appears in "To" and nothing appears in "Cc"
"From" address is "noreply@", "newsletter@", "do-not-reply@", "mailinglist@", even less common terms like "support@" or "sales@" (but remember, they could cause false positives)
The presence of a "List-Unsubscribe:" header
The message contains an unsubscribe link. Run pattern matching to find common phrases in the final few lines of the email. Look for links, or words such as "unsubscribe", "opt out", etc.
Mailing lists tend to have rich content. Check for heavy use of CSS and lots of images, the entire message being contained within a <table></table> or <ul><li></li></ul> structure. i.e. the stuff that something like Dreamweaver would put in, rather than a mail client.
Headers or bold content at the top of the message. If the first bit of a message resembles a newsletter, it's probably a newsletter.
Lots of links or frequent linking to the same (or same few) websites. Newsletters will try to guide the user to the company's site(s), as much as they can. You may score this even more highly if the linked domain matches (or resembles) the sender domain.
Heavy references to social media. If it's a newsletter containing several articles, each story may have its own "Tweet this", "Like this" link. Personal users are likely to contain (at most) one reference to Twitter, Facebook, etc (in their signature)
Notifications and other auto-generated messages will often follow the same basic format. If you have the capabilities, run some kind of diffing or other comparison against previous messages. A strong match would imply automation.
There is no greeting, or a generic greeting. However, personal emails will often skip the "Dear Fred" bit too, so this isn't a good enough detection by itself; but things like "Dear User" or "Dear Customer" are almost certainly generic.
Unlikely to end in "Regards, Ian" or "Yours Sincerely, John Doe"
The sender has scored highly before. Keep a record. If a sender triggers a high score several times, they are almost certainly bulk mailing.