对于资源,Andrew Ng 的 Coursera 机器学习课程(我相信其中包括垃圾邮件分类示例)是一个好的开始。
Here's a simple machine learning approach to the problem, and is what I'd do to get started on this problem and develop a baseline classifier:
Build up a corpus of scripts and attach a label either 'good' (label= 0) or 'bad' (label = 1) the more the better. Try to ensure that the 'bad' scripts are a reasonable fraction of the total corpus,50-50 good/bad is ideal.
Develop binary features that indicate suspicious or bad scripts. For example, the presence of 'eval', the presence of 'base64_decode'.Be as comprehensive as you can be and don't be afraid of including afeature that might capture some 'good' scripts too. One way to help to do this might be to calculate the frequency counts of words in the two classes of script and select as features words that appear prominently in 'bad' but less prominently in 'good'.
Run the feature generator over the corpus and build up a binary matrix of features with labels.
Split the corpus into train (80% of examples) and test sets (20%). Using the scikit learn library, train a few different classification algorithms (random forests, support vector machines, naive bayes etc) with the training set and test their performance on the unseen test set.
Hopefully I have a reasonable classification accuracy to benchmark against. I'd then look at improving the features, some unsupervised methods (without labels) and more specialised algorithms to get better performance.
For resources, Andrew Ng's Coursera course on Machine Learning (which includes example spam classification, I believe) is a good start.
发布评论
评论(1)
这是解决该问题的简单机器学习方法,也是我开始解决该问题并开发基线分类器的方法:
构建脚本语料库并附加标签“好”(标签= 0)或“ bad'(标签 = 1)越多越好。尽量确保“坏”脚本占总语料库的合理比例,50-50 个好/坏是理想的。
开发指示可疑或不良脚本的二进制功能。例如,“eval”的存在、“base64_decode”的存在。尽可能全面,不要害怕包含可能捕获一些“好”脚本的功能。帮助做到这一点的一种方法可能是计算两类脚本中单词的频率计数,并选择在“坏”中突出出现但在“好”中不太突出的单词作为特征。
在语料库上运行特征生成器,并构建带有标签的特征的二进制矩阵。
将语料库分为训练集(80% 的示例)和测试集(20%)。使用 scikit learn 库,使用训练集训练一些不同的分类算法(随机森林、支持向量机、朴素贝叶斯等),并在未见过的测试集上测试它们的性能。
希望我有一个合理的分类精度可以作为基准。然后,我会考虑改进功能、一些无监督方法(无标签)和更专业的算法,以获得更好的性能。
对于资源,Andrew Ng 的 Coursera 机器学习课程(我相信其中包括垃圾邮件分类示例)是一个好的开始。
Here's a simple machine learning approach to the problem, and is what I'd do to get started on this problem and develop a baseline classifier:
Build up a corpus of scripts and attach a label either 'good' (label= 0) or 'bad' (label = 1) the more the better. Try to ensure that the 'bad' scripts are a reasonable fraction of the total corpus,50-50 good/bad is ideal.
Develop binary features that indicate suspicious or bad scripts. For example, the presence of 'eval', the presence of 'base64_decode'.Be as comprehensive as you can be and don't be afraid of including afeature that might capture some 'good' scripts too. One way to help to do this might be to calculate the frequency counts of words in the two classes of script and select as features words that appear prominently in 'bad' but less prominently in 'good'.
Run the feature generator over the corpus and build up a binary matrix of features with labels.
Split the corpus into train (80% of examples) and test sets (20%). Using the scikit learn library, train a few different classification algorithms (random forests, support vector machines, naive bayes etc) with the training set and test their performance on the unseen test set.
Hopefully I have a reasonable classification accuracy to benchmark against. I'd then look at improving the features, some unsupervised methods (without labels) and more specialised algorithms to get better performance.
For resources, Andrew Ng's Coursera course on Machine Learning (which includes example spam classification, I believe) is a good start.