如何选择朴素贝叶斯分类器的训练数据

发布于 2024-11-18 14:54:36 字数 139 浏览 2 评论 0原文

我想仔细检查一些我不确定的关于分类器学习训练集的概念。当我们为训练数据选择记录时,我们是否为每个类别选择相同数量的记录,总和为 N,还是应该随机挑选 N 条记录(无论类别)?

直觉上我想到的是前者,但想到先前的类别概率将是相等的并且没有真正的帮助?

I want to double check some concepts I am uncertain of regarding the training set for classifier learning. When we select records for our training data, do we select an equal number of records per class, summing to N or should it be randomly picking N number of records (regardless of class)?

Intuitively I was thinking of the former but thought of the prior class probabilities would then be equal and not be really helpful?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

一紙繁鸢 2024-11-25 14:54:36

这取决于您的班级的分布,并且只能根据现有问题的领域知识来做出决定。
您可以提出以下问题:

  • 是否有两个非常相似的课程,学习者是否有足够的信息来区分它们?
  • 每个类别的先验概率是否存在较大差异?

如果是这样,您可能应该重新分配这些类。

根据我的经验,重新分配类没有什么坏处,但并不总是必要的。

这实际上取决于您的班级分布。在欺诈或入侵检测的情况下,预测类别的分布可以小于1%。
在这种情况下,如果您希望分类器了解每个类之间的差异,则必须在训练集中均匀分布类。否则,它将生成一个可以正确分类 99% 以上案例的分类器,而无法正确识别欺诈案例,而这正是创建分类器的全部目的。

一旦您拥有一组均匀分布的类,您就可以使用任何技术(例如 k 折)来执行实际训练。

需要调整类别分布但不一定每个类别的记录数量相同的另一个示例是根据字母表中的大写字母的形状来确定其情况。

如果您采用英语中常用的字母分布来训练分类器,则几乎不会出现字母 Q 的情况(如果有的话)。另一方面,字母O很常见。如果您不重新分配类以允许相同数量的 QO,则分类器没有足够的信息来区分 <代码>Q。您需要向其提供足够的信息(即更多Q),以便它可以确定QO 确实是不同的字母。

It depends on the distribution of your classes and the determination can only be made with domain knowledge of problem at hand.
You can ask the following questions:

  • Are there any two classes that are very similar and does the learner have enough information to distinguish between them?
  • Is there a large difference in the prior probabilities of each class?

If so, you should probably redistribute the classes.

In my experience, there is no harm in redistributing the classes, but it's not always necessary.

It really depends on the distribution of your classes. In the case of fraud or intrusion detection, the distribution of the prediction class can be less than 1%.
In this case you must distribute the classes evenly in the training set if you want the classifier to learn differences between each class. Otherwise, it will produce a classifier that correctly classifies over 99% of the cases without ever correctly identifying a fraud case, which is the whole point of creating a classifier to begin with.

Once you have a set of evenly distributed classes you can use any technique, such as k-fold, to perform the actual training.

Another example where class distributions need to be adjusted, but not necessarily in an equal number of records for each, is the case of determining upper-case letters of the alphabet from their shapes.

If you take a distribution of letters commonly used in the English language to train the classifier, there will be almost no cases, if any, of the letter Q. On the other hand, the letter O is very common. If you don't redistribute the classes to allow for the same number of Q's and O's, the classifier doesn't have enough information to ever distinguish a Q. You need to feed it enough information (i.e. more Qs) so it can determine that Q and O are indeed different letters.

何必那么矫情 2024-11-25 14:54:36

首选方法是使用 K-Fold 交叉验证来获取学习和测试数据。

引用维基百科:

K折交叉验证

在 K 折交叉验证中,
原始样本是随机的
划分为 K 个子样本。的
K 个子样本,单个子样本为
保留作为验证数据
测试模型,剩下的K
− 1 个子样本用作训练
数据。交叉验证过程是
然后重复K次(折叠),
使用每个 K 个子样本
作为验证数据恰好一次。
折叠后的 K 结果可以
平均(或以其他方式组合)为
产生单一估计。这
该方法相对于重复方法的优点
随机子采样是所有
观察结果用于两者
训练和验证,以及每个
观察用于验证
正好一次。 10倍交叉验证
是常用的。

在分层 K 折交叉验证中,
选择折叠以便
平均响应值约为
所有折叠都相等。如果是
二分法分类,这
意味着每个折叠大约包含
两种类型的比例相同
类标签。

您应该始终采用通用方法,以便获得与其他科学数据可比较的结果。

The preferred approach is to use K-Fold Cross validation for picking up learning and testing data.

Quote from wikipedia:

K-fold cross-validation

In K-fold cross-validation, the
original sample is randomly
partitioned into K subsamples. Of the
K subsamples, a single subsample is
retained as the validation data for
testing the model, and the remaining K
− 1 subsamples are used as training
data. The cross-validation process is
then repeated K times (the folds),
with each of the K subsamples used
exactly once as the validation data.
The K results from the folds then can
be averaged (or otherwise combined) to
produce a single estimation. The
advantage of this method over repeated
random sub-sampling is that all
observations are used for both
training and validation, and each
observation is used for validation
exactly once. 10-fold cross-validation
is commonly used.

In stratified K-fold cross-validation,
the folds are selected so that the
mean response value is approximately
equal in all the folds. In the case of
a dichotomous classification, this
means that each fold contains roughly
the same proportions of the two types
of class labels.

You should always take the common approach in order to have comparable results with other scientific data.

尴尬癌患者 2024-11-25 14:54:36

我构建了一个贝叶斯分类器的实现,通过检查示例中单词的出现来确定样本是否为 NSFW(工作不安全)。在训练用于 NSFW 检测的分类器时,我尝试使训练集中的每个类都具有相同数量的示例。这并没有像我计划的那样顺利,因为其中一个类的每个示例的单词数比另一个类多得多。

由于我根据这些词计算 NSFW 的可能性,我发现根据类的实际大小(以 MB 为单位)平衡类是有效的。我尝试对这两种方法进行 10 交叉折叠验证(通过示例数量和类大小进行平衡),发现通过数据大小进行平衡效果很好。

I built an implementation of a Bayesian classifier to determine if a sample is NSFW (Not safe for work) by examining the occurrence of words in examples. When training a classifier for NSFW detection I've tried making it so that each class in the training sets has the same number of examples. This didn't work out as well as I had planned being that one of the classes had many more words per example than the other class.

Since I was computing the likelihood of NSFW based on these words I found that balancing out the classes based on their actual size (in MB) worked. I tried 10-cross fold validation for both approaches (balancing by number of examples and size of classes) and found that balancing by the size of the data worked well.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文