机器学习 - 使用不平衡数据训练模型

发布于 2025-02-11 04:26:39 字数 641 浏览 3 评论 0原文

我的数据中有两个类。

这就是班级分布的样子。

0.0    169072
1.0     84944

换句话说,我有2:1班级分布。

我相信我有两个选择。下样本类0.0或upsample class 1.0。如果我选择选项1,我将丢失数据。如果我使用选项2,那么我正在使用非现实数据。

有没有办法的,我可以在不采样或下样本的情况下训练模型?

这就是我的分类_repport的样子。

               precision    recall  f1-score   support

         0.0       0.68      1.00      0.81     51683
         1.0       1.00      0.00      0.00     24522

    accuracy                           0.68     76205
   macro avg       0.84      0.50      0.40     76205
weighted avg       0.78      0.68      0.55     76205  

I have two classes in my data.

This is how class distribution looks like.

0.0    169072
1.0     84944

In other words, I have 2:1 class distribution.

I believe I have two choices. Downsample the class 0.0 or upsample class 1.0. If I go with option 1, I'm losing data. If i go with option 2, then I'm using non-real data.

Is there a way, I can train the model without upsample or downsample?

This is how my classification_report looks like.

               precision    recall  f1-score   support

         0.0       0.68      1.00      0.81     51683
         1.0       1.00      0.00      0.00     24522

    accuracy                           0.68     76205
   macro avg       0.84      0.50      0.40     76205
weighted avg       0.78      0.68      0.55     76205  

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

以可爱出名 2025-02-18 04:26:39

您的数据有些不平衡,但这并不意味着您只有两个选项之一(在或示例数据下)。您可以按原样保留数据,并在模型中应用成本敏感培训。例如,如果在您的情况下,类具有2:1的匹配,那么您需要将权重 2给您的少数班级。在XGBoost分类器的示例中,此参数称为scale_pos_weight。在此出色的

关于模型评估,您应该使用分类报告对模型的真实和虚假预测具有完整的直觉(精度和回忆是您在此过程中的两个最好的朋友!)。

Your data is slightly imbalanced yes, but it does not mean that you only have one of the two options (under or over sample your data). You can leave the data as is and apply cost sensitive training in your model. For example, if in your case the classes have a match of 2:1, then you need to give a weight of 2 to your minority class. In the example of an XGBoost classifier, this argument is called scale_pos_weight. See more in this excellent tutorial.

Regarding model evaluation, you should use a classification report to have a full intuition of your model's true and false predictions (precision and recall are your two best friends in this process!).

阳光的暖冬 2025-02-18 04:26:39

我不建议任何一种方法。

我正在考虑检测欺诈的模型。根据定义,欺诈应该是结果的一小部分 - 以1-5%的命令。改变培训百分比将是解决问题的总扭曲。

最好离开比例。

确保您的火车,验证和测试数据集都具有反映实际问题的比率。

改为调整结果。不要准确。假设0结果的天真模型将是当时正确的2/3。您希望您的模型比该模型更好或加权硬币翻转。

我建议您使用召回作为成功的标准 。

I would not recommend either approach.

I'm thinking about models to detect fraud. By definition, fraud should be a small percentage of outcomes - on the order of 1-5%. Changing the percentage for training would be a gross distortion of the problem being solved.

Better to leave the proportions as they are.

Make sure that your train, validation, and test data sets all have ratios that reflect the real problem.

Adjust your outcome instead. Don't go for accuracy. A naive model that assumes the 0 outcome will be correct 2/3rds of the time. You want your model to be better than that or a weighted coin flip.

I'd recommend using recall as your criterion for success.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文