用于电子邮件垃圾邮件检测的神经网络

发布于 2024-07-16 21:32:58 字数 496 浏览 7 评论 0原文

假设您可以访问一个电子邮件帐户,其中包含过去几年接收电子邮件的历史记录(约 10,000 封电子邮件),分为 2 组

  • 真正的
  • 垃圾邮件

您将如何处理创建可用于垃圾邮件检测的神经网络解决方案的任务- 基本上将任何电子邮件分类为垃圾邮件或非垃圾邮件?

假设电子邮件提取已经到位,我们只需关注分类部分。

我希望得到回答的要点是:

  1. 选择哪些参数作为神经网络的输入,为什么?
  2. 哪种神经网络结构最有可能最适合此类任务?

另外,任何资源建议或现有实现(最好是 C# )都非常受欢迎

谢谢

编辑

  • 我决定使用神经网络,因为该项目的主要方面是测试 NN 方法的工作原理用于垃圾邮件检测
  • 简单地探索神经网络和垃圾邮件主题也是一个“玩具问题”

Let's say you have access to an email account with the history of received emails from the last years (~10k emails) classified into 2 groups

  • genuine email
  • spam

How would you approach the task of creating a neural network solution that could be used for spam detection - basically classifying any email either as spam or not spam?

Let's assume that the email fetching is already in place and we need to focus on classification part only.

The main points which I would hope to get answered would be:

  1. Which parameters to choose as the input for the NN, and why?
  2. What structure of the NN would most likely work best for such task?

Also any resource recommendations, or existing implementations (preferably in C#) are more than welcome

Thank you

EDIT

  • I am set on using neural networks as the main aspect on the project is to test how the NN approach would work for spam detection
  • Also it is a "toy problem" simply to explore subject on neural networks and spam

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

深者入戏 2024-07-23 21:32:58

如果你坚持使用神经网络...我会为每封电子邮件计算一些特征

基于字符、基于单词和词汇的特征(我计算这些特征时大约有 97 个):

  1. 字符总数 (C)
  2. 字母字符总数 / C 字母字符比例
  3. 数字字符总数 / C
  4. 空白字符总数 /C
  5. 每个字母的出现频率 / C (键盘的 36 个字母 – AZ、0-9)
  6. 特殊字符的出现频率(10 个字符:*、_、 +,=,%,$,@,ـ , \,/ )
  7. 总数 (M)
  8. 短单词总数/M 两个字母或更少
  9. 单词中字符总数/C
  10. 平均单词长度
  11. 单词 平均句子长度(字符
  12. 数) 句子长度(以单词为单位)
  13. 单词长度 频率。 distribution/M 长度为 n 的单词比例,n 在 1 到 15 之间
  14. 类型 Token 比例 唯一单词数量/M
  15. Hapax Legomena Freq. 一次出现的单词
  16. Hapax Dislegomena Freq。 两次出现单词的
  17. 数量 Yule 的 K 测量值
  18. Simpson 的 D 测量值
  19. Sichel 的 S 测量值
  20. Brunet 的 W 测量值
  21. Honore 的 R 测量值
  22. 标点符号频率 18 个标点符号: 。 Ë ; ? ! : ( ) – “ « » < > [ ] { }

您还可以根据格式添加更多功能:颜色、字体、大小……使用。

大多数这些措施都可以在网上、论文甚至维基百科中找到(它们都是简单的计算,可能基于其他功能)。

因此,对于大约 100 个特征,您需要 100 个输入、隐藏层中的一定数量的节点以及一个输出节点。

输入需要根据您当前的预分类语料库进行标准化。

我将其分成两组,一组作为训练组,另一组作为测试组,从不混合它们。 也许训练/测试组的比例为 50/50,具有相似的垃圾邮件/非垃圾邮件比率。

If you insist on NNs... I would calculate some features for every email

Both Character-Based, Word-based, and Vocabulary features (About 97 as I count these):

  1. Total no of characters (C)
  2. Total no of alpha chars / C Ratio of alpha chars
  3. Total no of digit chars / C
  4. Total no of whitespace chars/C
  5. Frequency of each letter / C (36 letters of the keyboard – A-Z, 0-9)
  6. Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ , \,/ )
  7. Total no of words (M)
  8. Total no of short words/M Two letters or less
  9. Total no of chars in words/C
  10. Average word length
  11. Avg. sentence length in chars
  12. Avg. sentence length in words
  13. Word length freq. distribution/M Ratio of words of length n, n between 1 and 15
  14. Type Token Ratio No. Of unique Words/ M
  15. Hapax Legomena Freq. of once-occurring words
  16. Hapax Dislegomena Freq. of twice-occurring words
  17. Yule’s K measure
  18. Simpson’s D measure
  19. Sichel’s S measure
  20. Brunet’s W measure
  21. Honore’s R measure
  22. Frequency of punctuation 18 punctuation chars: . ، ; ? ! : ( ) – “ « » < > [ ] { }

You could also add some more features based on the formatting: colors, fonts, sizes, ... used.

Most of these measures can be found online, in papers, or even Wikipedia (they're all simple calculations, probably based on the other features).

So with about 100 features, you need 100 inputs, some number of nodes in a hidden layer, and one output node.

The inputs would need to be normalized according to your current pre-classified corpus.

I'd split it into two groups, use one as a training group, and the other as a testing group, never mixing them. Maybe at a 50/50 ratio of train/test groups with similar spam/nonspam ratios.

怀中猫帐中妖 2024-07-23 21:32:58

你打算用神经网络来做这件事吗? 听起来您已经很好地使用贝叶斯分类,它在Paul Graham 的几篇文章:

您有权访问的分类历史记录将生成非常强大的语料库来输入​​贝叶斯算法,您可能最终会得到相当多的结果有效的结果。

Are you set on doing it with a Neural Network? It sounds like you're set up pretty well to use Bayesian classification, which is outlined well in a couple of essays by Paul Graham:

The classified history you have access to would make very strong corpora to feed to a Bayesian algorithm, you'd probably end up with quite an effective result.

  1. 您基本上会遇到一个与设计和训练神经网络类似的特征提取问题。 如果我是你,我会从以多种方式对输入文本进行切片和切块开始,每一种都是一个潜在的特征输入,类似于“如果‘价格’和‘伟哥’发生,这个神经元发出信号 1.0彼此相差 3 个字以内”,并根据与垃圾邮件识别的最佳绝对相关性来剔除这些邮件。
  2. 我首先采用最好的 50 到 200 个输入特征神经元,并将它们连接到单个输出神经元(针对 1.0 = 垃圾邮件,-1.0 = 非垃圾邮件训练的值),即单层感知器。 如果效果不佳,我可能会尝试多层反向传播网络,但不会屏息以求好的结果。

一般来说,我的经验让我相信神经网络在这项任务中最多只能表现出平庸的性能,如果这不是探索神经网络的玩具问题,我肯定会推荐查德·伯奇(Chad Birch)建议的贝叶斯方法。

  1. You'll basically have an entire problem, of similar scope to designing and training the neural net, of feature extraction. Where I would start, if I were you, is in slicing and dicing the input text in a large number of ways, each one being a potential feature input along the lines of "this neuron signals 1.0 if 'price' and 'viagra' occur within 3 words of each other", and culling those according to best absolute correlation with spam identification.
  2. I'd start by taking my best 50 to 200 input feature neurons and hooking them up to a single output neuron (values trained for 1.0 = spam, -1.0 = not spam), i.e. a single-layer perceptron. I might try a multi-layer backpropagation net if that worked poorly, but wouldn't be holding my breath for great results.

Generally, my experience has led me to believe that neural networks will show mediocre performance at best in this task, and I'd definitely recommend something Bayesian as Chad Birch suggests, if this is something other than a toy problem for exploring neural nets.

む无字情书 2024-07-23 21:32:58

查德,到目前为止您得到的答案是合理的,但我会回复您的更新:

我决定使用神经网络,因为该项目的主要方面是测试神经网络方法如何用于垃圾邮件检测。

好吧,那么你就有一个问题:像这样的经验测试并不能证明不适合。

您可能最好了解一下神经网络实际上做什么和不做什么,以了解为什么它们对于此类分类问题不是一个特别好的主意。 也许将它们视为通用函数逼近器是一种有用的思考方式。 但要了解这一切如何在分类领域结合在一起(这就是垃圾邮件过滤问题),请浏览类似 模式分类可能会有所帮助。

如果做不到这一点,如果你执意要看到它运行,只需对网络本身使用任何通用的神经网络库即可。 您的大部分问题将是如何表示输入数据。 “最好”的结构是不明显的,而且可能并不那么重要。 输入必须是语料库本身的许多(标准化)测量(特征)。 有些是显而易见的(“垃圾邮件”单词的计数等),有些则不那么明显。 这是您真正可以使用的部分,但由于问题的性质,与贝叶斯过滤器(这里有自己的问题)相比,您应该预料到它会做得很差。

Chad, the answers you've gotten so far are reasonable, but I'll respond to your update that:

I am set on using neural networks as the main aspect on the project is to test how the NN approach would work for spam detection.

Well, then you have a problem: an empirical test like this can't prove unsuitability.

You're probably best off learning a bit about what NN actually do and don't do, to see why they are not a particularly good idea for this sort of classification problem. Probably a helpful way to think about them is as universal function approximators. But for some idea of how this all fits together in the area of classification (which is what the spam filtering problem is), browsing an intro text like pattern classification might be helpful.

Failing that if you are dead set on seeing it run, just use any general NN library for the network itself. Most of your issue is going to be how to represent the input data anyway. The `best' structure is non-obvious, and it probably doesn't matter that much. The inputs are going to have to be a number of (normalized) measurements (features) on the corpus itself. Some are obvious (counts of 'spam' words, etc), some much less so. This is the part you can really play around with, but you should expect to do poorly compared to Bayesian filters (which have their own problems here) due to the nature of the problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文