垃圾邮件分类任务中奇怪的 FANN 行为

发布于 2024-12-28 17:56:04 字数 1588 浏览 5 评论 0原文

我尝试在 FANN 库的帮助下编写简单的垃圾邮件分类器。为此,我收集了一些垃圾邮件和火腿电子邮件信件,并收集了一本最常用英语单词的字典。 我使用以下代码创建了一个具有一个隐藏层的神经网络:

num_input = get_input_size(dictionary_size)
num_output = 1

ann.create_standard_array((num_input, num_neurons_hidden, num_output))
ann.set_activation_function_hidden(libfann.SIGMOID_SYMMETRIC)
ann.set_activation_function_output(libfann.SIGMOID_SYMMETRIC)
ann.set_training_algorithm(libfann.TRAIN_INCREMENTAL)

当信件是火腿邮件时,输出为 1;当信件是垃圾邮件时,输出为 -1。每个输入神经元代表特定单词是否在电子邮件中(1 - 单词在邮件中。0 - 不在)

为了训练神经网络,我使用以下代码。 (对于训练集中的每封电子邮件)

# Create input from train e-mail letter
input = get_input(train_res, train_file, dictionary)               
ann.train(input, (train_res,))

要检查测试集中的电子邮件是否是垃圾邮件,我使用以下代码: (对于测试集中的每封电子邮件)

input = get_input(SPAM, test_spam, dictionary)
res = ann.run(input)[0]

但无论我使用什么大小的字典(我尝试从 1000 个单词到 40000 个单词)或隐藏层中的神经元数量(20 到 640),在我的网络训练后,它都会假设几乎所有电子邮件都是垃圾邮件或非垃圾邮件。例如,我收到这样的结果:

Dictionary size: 10000
Hidden layer size: 80
Correctly classified hams: 596
Incorrectly classified hams: 3845
Correctly classified spams: 436
Incorrectly classified spams: 62

几乎所有垃圾邮件都被正确分类,但所有火腿都被错误分类,或者这样的结果:

Dictionary size: 20000
Hidden layer size: 20
Correctly classified hams: 4124
Incorrectly classified hams: 397
Correctly classified spams: 116
Incorrectly classified spams: 385

相反。 我尝试使用更多的训练数据。我开始在训练集中使用大约 1000 封电子邮件(垃圾邮件与火腿邮件的比例几乎为 50:50),现在我用大约 4000 封电子邮件进行测试(垃圾邮件:火腿邮件大约为 50:50),但结果是相同的。

可能出现的问题是什么? 先感谢您。

I tried to write simple spam classifier with help of FANN library. To do it I collected number of spam and ham email letters, and collected a dictionary of the most used English words.
I created a neural network with one hidden layer with the following code:

num_input = get_input_size(dictionary_size)
num_output = 1

ann.create_standard_array((num_input, num_neurons_hidden, num_output))
ann.set_activation_function_hidden(libfann.SIGMOID_SYMMETRIC)
ann.set_activation_function_output(libfann.SIGMOID_SYMMETRIC)
ann.set_training_algorithm(libfann.TRAIN_INCREMENTAL)

Output is 1 when the letter is ham and -1 when it is spam. Each of the input neurons represent if a specific word was in an e-mail or not (1 - word was in a mail. 0 - was not)

To train neural network I use the following code.
(For each e-mail letter in a training set)

# Create input from train e-mail letter
input = get_input(train_res, train_file, dictionary)               
ann.train(input, (train_res,))

To check if an e-mail from a test set is a spam or not I use the following code:
(For each e-mail in a test set)

input = get_input(SPAM, test_spam, dictionary)
res = ann.run(input)[0]

But no matter what size of dictionary I use (I tried from 1000 words to 40000 words) or number of neurons in the hidden layer (20 to 640) after my network is trained it assumes that almost all e-mails are spam or ham. For example I receive either results like this:

Dictionary size: 10000
Hidden layer size: 80
Correctly classified hams: 596
Incorrectly classified hams: 3845
Correctly classified spams: 436
Incorrectly classified spams: 62

where almost all spams are classified correctly, but all hams are misclassified, or results like this:

Dictionary size: 20000
Hidden layer size: 20
Correctly classified hams: 4124
Incorrectly classified hams: 397
Correctly classified spams: 116
Incorrectly classified spams: 385

that are oposite.
I tried to use more training data. I started with approximately 1000 e-mails in training set (the proportion of spam to ham is almost 50:50) and now I am testing it with approximately 4000 e-mails (spam:ham approximately 50:50), but result is the same.

What is the possible problem?
Thank you in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

在风中等你 2025-01-04 17:56:04

您是否断言垃圾邮件和普通邮件在您的单词列表中的单词内容方面存在显着差异?我的猜测是,就常规单词的内容而言,垃圾邮件和火腿邮件之间可能没有明显的区别。

如果您使用“真正的”垃圾邮件,许多垃圾邮件发送者会使用所谓的“贝叶斯中毒”,其中包含大量“合法”文本,以迷惑垃圾邮件过滤器。由于您只是过滤常用单词的内容,而不是统计上垃圾邮件/火腿中常见的单词,因此您的方法将对贝叶斯中毒非常敏感。

Have you asserted that there is a significant difference between spam and ham mails in terms of their content of the words in your wordlist? My guess would be that there might not be a very clear difference between spam and ham when it comes to the content of regular words.

If you are using 'real' spam mails many spamers use something known as Bayesian poisoning where they include lots of 'legitimate' text in order to confuse spam filters. Since you simply filter on content of usual word and not on words statistically common to spam/ham your approach will be very sensitive to bayesian poisoning.

被翻牌 2025-01-04 17:56:04

我对 FANN 不太了解,但对于垃圾邮件分类,训练方案很重要。首先:不要训练所有火腿邮件,然后训练所有垃圾邮件。将它们混合在一起,最好随机挑选一封邮件,然后对其进行训练,无论是火腿邮件还是垃圾邮件。

除此之外,还有很多不同的方法来决定分类器是否应该针对特定消息进行训练。例如,如果分类器已经认为某条消息是垃圾邮件,然后您将其训练为垃圾邮件,则它可能会对该消息中的单词产生无根据的偏见。

可能的训练方案包括(但不限于):

  • TEFT(训练一切)

    将所有内容训练一次。通常不是一个好的选择。

  • TOE(错误训练)

    仅对分类器出错的邮件进行训练。

  • TTR(厚阈值训练)

    训练分类器出错或位于“厚阈值”上的所有邮件。例如,如果低于 0.0 的所有邮件都是垃圾邮件,则对分类为 -0.05 到 0.05 之间的所有电子邮件进行训练。

  • TUNE(训练直到没有错误)

    重复执行 TOE 或 TTR,直到分类器正确分类所有训练邮件。这会有所帮助,但也可能会造成伤害,具体取决于您的训练数据。

其中每一个都存在差异,但厚阈值训练通常会产生良好的结果:因为它不会训练每封邮件,所以它对垃圾邮件中出现的单词的偏见较小,但这并不真正帮助做出火腿/垃圾邮件决策(例如尼古拉斯提到的贝叶斯中毒)。而且由于它是针对边界情况进行训练的,即使它在训练运行中对它们进行了正确分类,它也会在处理此类边界情况方面获得更多经验,并且在实际使用中不太可能被它们愚弄。

最后一句话:我假设您正在使用神经网络来了解有关它们的更多信息,但如果您确实需要过滤垃圾邮件,朴素贝叶斯Winnow 算法 通常更合适。

I don't know much about FANN, but for spam classification the training regimen is important. First of all: don't train all ham mails and then all spam mails. Mix them together, preferably pick a mail at random, then train it, whether it is ham or spam.

Other than that, there are quite a few different ways of deciding whether the classifier should be trained on a particular message at all. For example, if the classifier already thinks a message is spam, and then you train it as spam, it may develop an unwarranted prejudice against the words in that message.

Possible training regimens include (but are not limited to):

  • TEFT (Train Everything)

    Train everything once. Usually not a good choice.

  • TOE (Train on Error)

    Train only on mails the classifier gets wrong.

  • TTR (Thick Threshold Training)

    Train all mails the classifier gets wrong, or that lie on a "thick threshold". For example, if everything below 0.0 is spam, train on all emails classified as between -0.05 and 0.05.

  • TUNE (Train Until No Error)

    Do TOE or TTR repeatedly, until the classifier correctly classifies all training mails. This can help, but it can also hurt, depending on your training data.

There are variations on each of these, but Thick Threshold Training will usually give good results: Because it does not train every mail, it is less prejudiced against words that appear in spam, but that don't really help with the ham/spam decision (for example the Bayesian poisoning Niclas mentioned). And because it trains on borderline cases, even though it classified them correctly in the training run, it gets more experience with such borderline cases and is less likely to be fooled by them in actual use.

As a last remark: I am assuming you're using Neural Networks to learn more about them, but if you actually need to filter spam, Naive Bayes or the Winnow algorithm are usually more appropriate.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文