特征选择后访问 scikit learn Pipeline 中的随机森林特征名称属性

发布于 2025-01-09 11:50:39 字数 2521 浏览 0 评论 0原文

我在数据集中运行随机森林分类器，作为 sklearn 管道的一个步骤。

# Numerical
numeric_cols = ['p1', 'p2', 'p3', 'p4', 'p5', 'p6', 'p7']
numeric_transformer = Pipeline(
    steps=[("scaler", StandardScaler())]
)

# Categorical
categ_cols = ['p8', 'p9', 'p10', 'p11', 'p12', 'p13']
categ_transformer = OneHotEncoder(handle_unknown="ignore")

# Preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categ_transformer, categ_cols),
    ]
)

rf_pipe = Pipeline(
    steps=[("preprocessor", preprocessor), 
           ("feature_selection_var", VarianceThreshold()),
           ("feature_selection_percentile", SelectPercentile(f_classif, percentile=90)),
           ("classifier", (RandomForestClassifier(n_jobs=-1, class_weight='balanced',
                                                 criterion='entropy', max_features=10,
                                                 min_samples_leaf=50, n_estimators=1000)))]
)
cross_score = cross_val_score(rf_pipe, x_train_up, y_train_up, cv=10, scoring='roc_auc', n_jobs=-1)
print(f'cross_mean: {cross_score.mean()}, cross_std: {cross_score.std()}')
rf_pipe.fit(x_train_up, y_train_up)

我想绘制 RFC 属性 feature_importances_ 但因为我的管道进行功能选择，所以我无法识别 fit 方法中使用的功能名称。因此，在使用 One Hot Encoder 后我知道数组 X 包含 31 个特征。然后在 SelectPercentile 数组 X 之后包含 RFC 中使用的 27 个功能。

如何确定 RFC 中选择并安装了哪些功能？当我访问 RFC 属性时，我只能获得有关功能重要性的数字，而名称不可用。

rf_pipe.named_steps['classifier'].feature_importances_

out: array([8.41159321e-02, 1.23094971e-01, 1.62218154e-02, 3.34926745e-01,
       1.06620128e-01, 1.37351967e-01, 9.39408084e-03, 1.74327442e-02,
       1.62594558e-02, 1.66887184e-04, 1.66724711e-02, 7.06176017e-03,
       6.81514535e-03, 1.11633257e-02, 1.32052716e-02, 3.72520454e-03,
       3.64255314e-03, 1.25925324e-02, 1.12110261e-02, 9.37540757e-04,
       7.53327441e-03, 7.30348346e-03, 1.40424287e-02, 2.04903820e-03,
       1.73613154e-02, 9.33500153e-03, 9.76390164e-03])

rf_pipe.named_steps['classifier'].feature_names_in_

out: 
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
C:\Users\PHELIP~1.SOA\AppData\Local\Temp/ipykernel_10268/205801647.py in <module>
----> 1 rf_pipe.named_steps['classifier'].feature_names_in_

AttributeError: 'RandomForestClassifier' object has no attribute 'feature_names_in_'

原文

I'm running Random Forest Classifier in a Dataset, as a step of a sklearn pipeline.

# Numerical
numeric_cols = ['p1', 'p2', 'p3', 'p4', 'p5', 'p6', 'p7']
numeric_transformer = Pipeline(
    steps=[("scaler", StandardScaler())]
)

# Categorical
categ_cols = ['p8', 'p9', 'p10', 'p11', 'p12', 'p13']
categ_transformer = OneHotEncoder(handle_unknown="ignore")

# Preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categ_transformer, categ_cols),
    ]
)

rf_pipe = Pipeline(
    steps=[("preprocessor", preprocessor), 
           ("feature_selection_var", VarianceThreshold()),
           ("feature_selection_percentile", SelectPercentile(f_classif, percentile=90)),
           ("classifier", (RandomForestClassifier(n_jobs=-1, class_weight='balanced',
                                                 criterion='entropy', max_features=10,
                                                 min_samples_leaf=50, n_estimators=1000)))]
)
cross_score = cross_val_score(rf_pipe, x_train_up, y_train_up, cv=10, scoring='roc_auc', n_jobs=-1)
print(f'cross_mean: {cross_score.mean()}, cross_std: {cross_score.std()}')
rf_pipe.fit(x_train_up, y_train_up)

I want to plot the RFC attribute feature_importances_ but because my pipeline does feature selection, I can't identify the feature names used in fit method. So what I know after the One Hot Encoder, is that array X contains 31 features. Then after SelectPercentile array X contains 27 features that are used in RFC.

How can I identify which features were chosen and fitted in RFC? When I access the RFC attributes, I only can have the numbers about feature importance, the names are not available.

rf_pipe.named_steps['classifier'].feature_importances_

out: array([8.41159321e-02, 1.23094971e-01, 1.62218154e-02, 3.34926745e-01,
       1.06620128e-01, 1.37351967e-01, 9.39408084e-03, 1.74327442e-02,
       1.62594558e-02, 1.66887184e-04, 1.66724711e-02, 7.06176017e-03,
       6.81514535e-03, 1.11633257e-02, 1.32052716e-02, 3.72520454e-03,
       3.64255314e-03, 1.25925324e-02, 1.12110261e-02, 9.37540757e-04,
       7.53327441e-03, 7.30348346e-03, 1.40424287e-02, 2.04903820e-03,
       1.73613154e-02, 9.33500153e-03, 9.76390164e-03])

rf_pipe.named_steps['classifier'].feature_names_in_

out: 
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
C:\Users\PHELIP~1.SOA\AppData\Local\Temp/ipykernel_10268/205801647.py in <module>
----> 1 rf_pipe.named_steps['classifier'].feature_names_in_

AttributeError: 'RandomForestClassifier' object has no attribute 'feature_names_in_'

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

伴梦长久 2025-01-16 11:50:39

我为解决这个问题而创立的方法是：

# Access pipeline steps:

# get the features names array that passed on feature selection object
x_features = preprocessor.fit(x_train_up).get_feature_names_out()

# get the boolean array that will show the chosen features by (true or false)
mask_used_ft = rf_pipe.named_steps['feature_selection_percentile'].get_support()

# combine those arrays to identify the dropped features and create the array with features names that were choosed
x_features_used = np.delete((x_features * mask_used_ft), np.where(x_features * mask_used_ft == ""))

# take the array with feature importance values
importances = rf_pipe.named_steps['classifier'].feature_importances_ 

# sort the numbers
indices = np.argsort(importances) 

#plot results
plt.figure(figsize=(15,10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [x_features_used[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

The way I founded to solve this problem was:

# Access pipeline steps:

# get the features names array that passed on feature selection object
x_features = preprocessor.fit(x_train_up).get_feature_names_out()

# get the boolean array that will show the chosen features by (true or false)
mask_used_ft = rf_pipe.named_steps['feature_selection_percentile'].get_support()

# combine those arrays to identify the dropped features and create the array with features names that were choosed
x_features_used = np.delete((x_features * mask_used_ft), np.where(x_features * mask_used_ft == ""))

# take the array with feature importance values
importances = rf_pipe.named_steps['classifier'].feature_importances_ 

# sort the numbers
indices = np.argsort(importances) 

#plot results
plt.figure(figsize=(15,10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [x_features_used[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

回复收藏 0 原文

~没有更多了~