当前位置：文江博客话题详情

如何故意过度拟合Weka树分类器？

发布于 2024-09-09 07:13:36 字数 774 浏览 1 评论 0原文

我有一个二进制类数据集 (0 / 1)，与“0”类有很大的偏差（大约 30000 与 1500）。每个实例有 7 个特征，没有缺失值。

当我使用 J48 或任何其他树分类器时，几乎所有“1”实例都被错误分类为“0”。

将分类器设置为“未修剪”，将每个叶子的最小实例数设置为 1，将置信因子设置为 1，添加带有实例 ID 号的虚拟属性 - 所有这些都没有帮助。

我只是无法创建一个过度拟合我的数据的模型！

我还尝试了 Weka 提供的几乎所有其他分类器，但得到了类似的结果。

使用 IB1 可以获得 100% 的准确率（训练集上的训练集），因此这不是具有相同特征值和不同类的多个实例的问题。

如何创建一棵完全未修剪的树？或者以其他方式迫使 Weka 过度拟合我的数据？

谢谢。

更新：好吧，这很荒谬。我只使用了大约 3100 个负例和 1200 个正例，这就是我得到的树（未修剪！）：

J48 unpruned tree
------------------

F <= 0.90747: 1 (201.0/54.0)
F > 0.90747: 0 (4153.0/1062.0)

不用说，IB1 仍然给出 100% 的精度。

更新 2： 不知道我是怎么错过的 - 未经修剪的 SimpleCart 可以正常工作，并在火车上提供 100% 的准确率；修剪后的 SimpleCart 不像 J48 那样有偏差，并且具有不错的误报率和误报率。

原文

I have a binary class dataset (0 / 1) with a large skew towards the "0" class (about 30000 vs 1500). There are 7 features for each instance, no missing values.

When I use the J48 or any other tree classifier, I get almost all of the "1" instances misclassified as "0".

Setting the classifier to "unpruned", setting minimum number of instances per leaf to 1, setting confidence factor to 1, adding a dummy attribute with instance ID number - all of this didn't help.

I just can't create a model that overfits my data!

I've also tried almost all of the other classifiers Weka provides, but got similar results.

Using IB1 gets 100% accuracy (trainset on trainset) so it's not a problem of multiple instances with the same feature values and different classes.

How can I create a completely unpruned tree?
Or otherwise force Weka to overfit my data?

Thanks.

Update: Okay, this is absurd. I've used only about 3100 negative and 1200 positive examples, and this is the tree I got (unpruned!):

J48 unpruned tree
------------------

F <= 0.90747: 1 (201.0/54.0)
F > 0.90747: 0 (4153.0/1062.0)

Needless to say, IB1 still gives 100% precision.

Update 2: Don't know how I missed it - unpruned SimpleCart works and gives 100% accuracy train on train; pruned SimpleCart is not as biased as J48 and has a decent false positive and negative ratio.

分享到QQ

分享到微博