多类抽样策略
场景:
目前我正在研究多类分类问题。我有 200 万个具有 180 类别的历史数据集,需要创建能够准确预测类别的模型。我使用 HybridGradientboosting 算法创建了模型,下降精度约为 80 - 85%
注意:也检查了其他分类算法,但在预测方面没有提供良好的性能。
问题:
我使用 imblearn 库进行了上采样、下采样以及两者的组合。预测时仍然面临问题。意味着模型给出了很好的准确性,但对于许多类别来说预测不正确。
问题:
下面的数据集(样本数据集)应该应用什么样的采样策略,这将创建一个好的预测模型?
我是否需要堆叠模型:将数据集分为三个范围,创建三个模型并堆叠它们的结果?
注意:以下数据集不包含任何空值和重复项。
示例数据集:
class number of records
A 12385
B 6932
C 3183
D 999
E 900
F 891
G 802
H 760
I 630
J 264
K 257
L 257
M 161
N 132
O 77
P 59
Q 31
R 18
S 8
您能否分享一下您对该数据集的采样策略。
添加代码:
# sampling
smote_enn = SMOTEENN(random_state=0,enn = EditedNearestNeighbours(kind_sel='mode'))
Xsample1, y_resampled1 = smote_enn.fit_resample(X, y)
`SMOTE 之前:Counter({'A': 12385, 'B': 6932, 'C': 3183, 'D': 3158,'E': 955 ... SMOTE 之后还有更多类
:[( 'B', 11873), ('C', 12320), ('D', 12327), ('A', 10404), ('E', 12326)] ...更多类`
# SAP classification algorithm
# n_estimators,learning_rate,max_depth --> selected values after hyperparameter tuning
rdt_params = dict(random_state=2,n_estimators=16,learning_rate=0.25,max_depth=30)
uc_rdt = UnifiedClassification(func = 'HybridGradientBoostingTree', **rdt_params)
uc_rdt.fit(data=final,
key= col_id,
features = features,
label='class',
partition_method='stratified',
stratified_column='class',
partition_random_state=2,
training_percent=0.8, ntiles=2)
准确度:0.906 ;AUC:0.9962 ;KAPPA:3.9813 ;
Scenario :
Currently I am working on multiclass classification problem. I have 2 million historical dataset of having 180 classes and need to create model which will predict the classes accurately. I have created model using HybridGradientboosting algorithm, and gives me descent accuracy around 80 - 85 %
Note : Checked other classification algorithms as well ,but not giving good performance in prediction.
Problem :
I did upsampling , downsampling , combination of both using imblearn libraries. Still facing problem while prediction. Means model is giving nice accuracy but for many of classes is not predicting correct.
Question :
What kind of sampling strategy should apply on below dataset(sample dataset) ,which will create a good model for prediction ?
Do I need to stack model : divide dataset in three range ,create three models and stack their results ?
Note: The below dataset does not contain any null values as well as duplicates.
Sample dataset :
class number of records
A 12385
B 6932
C 3183
D 999
E 900
F 891
G 802
H 760
I 630
J 264
K 257
L 257
M 161
N 132
O 77
P 59
Q 31
R 18
S 8
Can you please share your sampling strategy for such dataset.
Adding Code :
# sampling
smote_enn = SMOTEENN(random_state=0,enn = EditedNearestNeighbours(kind_sel='mode'))
Xsample1, y_resampled1 = smote_enn.fit_resample(X, y)
`Before SMOTE : Counter({'A': 12385, 'B': 6932, 'C': 3183, 'D': 3158,'E': 955 ... many more classes
After SMOTE : [('B', 11873), ('C', 12320), ('D', 12327), ('A', 10404), ('E', 12326)] ...many more classes`
# SAP classification algorithm
# n_estimators,learning_rate,max_depth --> selected values after hyperparameter tuning
rdt_params = dict(random_state=2,n_estimators=16,learning_rate=0.25,max_depth=30)
uc_rdt = UnifiedClassification(func = 'HybridGradientBoostingTree', **rdt_params)
uc_rdt.fit(data=final,
key= col_id,
features = features,
label='class',
partition_method='stratified',
stratified_column='class',
partition_random_state=2,
training_percent=0.8, ntiles=2)
Accuracy: 0.906 ; AUC: 0.9962 ; KAPPA: 3.9813
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论