机器学习 - 使用不平衡数据训练模型
我的数据中有两个类。
这就是班级分布的样子。
0.0 169072
1.0 84944
换句话说,我有2:1班级分布。
我相信我有两个选择。下样本类0.0
或upsample class 1.0
。如果我选择选项1,我将丢失数据。如果我使用选项2,那么我正在使用非现实数据。
有没有办法的,我可以在不采样或下样本的情况下训练模型?
这就是我的分类_repport的样子。
precision recall f1-score support
0.0 0.68 1.00 0.81 51683
1.0 1.00 0.00 0.00 24522
accuracy 0.68 76205
macro avg 0.84 0.50 0.40 76205
weighted avg 0.78 0.68 0.55 76205
I have two classes in my data.
This is how class distribution looks like.
0.0 169072
1.0 84944
In other words, I have 2:1 class distribution.
I believe I have two choices. Downsample the class 0.0
or upsample class 1.0
. If I go with option 1, I'm losing data. If i go with option 2, then I'm using non-real data.
Is there a way, I can train the model without upsample or downsample?
This is how my classification_report looks like.
precision recall f1-score support
0.0 0.68 1.00 0.81 51683
1.0 1.00 0.00 0.00 24522
accuracy 0.68 76205
macro avg 0.84 0.50 0.40 76205
weighted avg 0.78 0.68 0.55 76205
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您的数据有些不平衡,但这并不意味着您只有两个选项之一(在或示例数据下)。您可以按原样保留数据,并在模型中应用成本敏感培训。例如,如果在您的情况下,类具有
2:1
的匹配,那么您需要将权重2
给您的少数班级。在XGBoost分类器的示例中,此参数称为scale_pos_weight
。在此出色的关于模型评估,您应该使用分类报告对模型的真实和虚假预测具有完整的直觉(精度和回忆是您在此过程中的两个最好的朋友!)。
Your data is slightly imbalanced yes, but it does not mean that you only have one of the two options (under or over sample your data). You can leave the data as is and apply cost sensitive training in your model. For example, if in your case the classes have a match of
2:1
, then you need to give a weight of2
to your minority class. In the example of an XGBoost classifier, this argument is calledscale_pos_weight
. See more in this excellent tutorial.Regarding model evaluation, you should use a classification report to have a full intuition of your model's true and false predictions (precision and recall are your two best friends in this process!).
我不建议任何一种方法。
我正在考虑检测欺诈的模型。根据定义,欺诈应该是结果的一小部分 - 以1-5%的命令。改变培训百分比将是解决问题的总扭曲。
最好离开比例。
确保您的火车,验证和测试数据集都具有反映实际问题的比率。
改为调整结果。不要准确。假设0结果的天真模型将是当时正确的2/3。您希望您的模型比该模型更好或加权硬币翻转。
我建议您使用召回作为成功的标准 。
I would not recommend either approach.
I'm thinking about models to detect fraud. By definition, fraud should be a small percentage of outcomes - on the order of 1-5%. Changing the percentage for training would be a gross distortion of the problem being solved.
Better to leave the proportions as they are.
Make sure that your train, validation, and test data sets all have ratios that reflect the real problem.
Adjust your outcome instead. Don't go for accuracy. A naive model that assumes the 0 outcome will be correct 2/3rds of the time. You want your model to be better than that or a weighted coin flip.
I'd recommend using recall as your criterion for success.