如何选择朴素贝叶斯分类器的训练数据
我想仔细检查一些我不确定的关于分类器学习训练集的概念。当我们为训练数据选择记录时,我们是否为每个类别选择相同数量的记录,总和为 N,还是应该随机挑选 N 条记录(无论类别)?
直觉上我想到的是前者,但想到先前的类别概率将是相等的并且没有真正的帮助?
I want to double check some concepts I am uncertain of regarding the training set for classifier learning. When we select records for our training data, do we select an equal number of records per class, summing to N or should it be randomly picking N number of records (regardless of class)?
Intuitively I was thinking of the former but thought of the prior class probabilities would then be equal and not be really helpful?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这取决于您的班级的分布,并且只能根据现有问题的领域知识来做出决定。
您可以提出以下问题:
如果是这样,您可能应该重新分配这些类。
根据我的经验,重新分配类没有什么坏处,但并不总是必要的。
这实际上取决于您的班级分布。在欺诈或入侵检测的情况下,预测类别的分布可以小于1%。
在这种情况下,如果您希望分类器了解每个类之间的差异,则必须在训练集中均匀分布类。否则,它将生成一个可以正确分类 99% 以上案例的分类器,而无法正确识别欺诈案例,而这正是创建分类器的全部目的。
一旦您拥有一组均匀分布的类,您就可以使用任何技术(例如 k 折)来执行实际训练。
需要调整类别分布但不一定每个类别的记录数量相同的另一个示例是根据字母表中的大写字母的形状来确定其情况。
如果您采用英语中常用的字母分布来训练分类器,则几乎不会出现字母
Q
的情况(如果有的话)。另一方面,字母O
很常见。如果您不重新分配类以允许相同数量的Q
和O
,则分类器没有足够的信息来区分 <代码>Q。您需要向其提供足够的信息(即更多Q
),以便它可以确定Q
和O
确实是不同的字母。It depends on the distribution of your classes and the determination can only be made with domain knowledge of problem at hand.
You can ask the following questions:
If so, you should probably redistribute the classes.
In my experience, there is no harm in redistributing the classes, but it's not always necessary.
It really depends on the distribution of your classes. In the case of fraud or intrusion detection, the distribution of the prediction class can be less than 1%.
In this case you must distribute the classes evenly in the training set if you want the classifier to learn differences between each class. Otherwise, it will produce a classifier that correctly classifies over 99% of the cases without ever correctly identifying a fraud case, which is the whole point of creating a classifier to begin with.
Once you have a set of evenly distributed classes you can use any technique, such as k-fold, to perform the actual training.
Another example where class distributions need to be adjusted, but not necessarily in an equal number of records for each, is the case of determining upper-case letters of the alphabet from their shapes.
If you take a distribution of letters commonly used in the English language to train the classifier, there will be almost no cases, if any, of the letter
Q
. On the other hand, the letterO
is very common. If you don't redistribute the classes to allow for the same number ofQ
's andO
's, the classifier doesn't have enough information to ever distinguish aQ
. You need to feed it enough information (i.e. moreQ
s) so it can determine thatQ
andO
are indeed different letters.首选方法是使用 K-Fold 交叉验证来获取学习和测试数据。
引用维基百科:
您应该始终采用通用方法,以便获得与其他科学数据可比较的结果。
The preferred approach is to use K-Fold Cross validation for picking up learning and testing data.
Quote from wikipedia:
You should always take the common approach in order to have comparable results with other scientific data.
我构建了一个贝叶斯分类器的实现,通过检查示例中单词的出现来确定样本是否为 NSFW(工作不安全)。在训练用于 NSFW 检测的分类器时,我尝试使训练集中的每个类都具有相同数量的示例。这并没有像我计划的那样顺利,因为其中一个类的每个示例的单词数比另一个类多得多。
由于我根据这些词计算 NSFW 的可能性,我发现根据类的实际大小(以 MB 为单位)平衡类是有效的。我尝试对这两种方法进行 10 交叉折叠验证(通过示例数量和类大小进行平衡),发现通过数据大小进行平衡效果很好。
I built an implementation of a Bayesian classifier to determine if a sample is NSFW (Not safe for work) by examining the occurrence of words in examples. When training a classifier for NSFW detection I've tried making it so that each class in the training sets has the same number of examples. This didn't work out as well as I had planned being that one of the classes had many more words per example than the other class.
Since I was computing the likelihood of NSFW based on these words I found that balancing out the classes based on their actual size (in MB) worked. I tried 10-cross fold validation for both approaches (balancing by number of examples and size of classes) and found that balancing by the size of the data worked well.