当前位置：文江博客话题详情

Scikit-Learn如何使用管道以及如何进行逻辑＆＃x2B;山脊回归

发布于 2025-02-05 01:46:00 字数 3262 浏览 1 评论 0原文

两个问题：

我正在尝试运行一个预测流失的模型。我的许多功能都有多重共线性问题。为了解决这个问题，我试图用Ridge来惩罚系数。

更具体地说，我试图运行逻辑回归，但在模型上应用山脊惩罚（不确定是否有意义）...

问题：

选择脊回归分类器是否足够吗？还是我需要选择逻辑回归分类器并用一些参数以ridge惩罚（即logisticRegress（apply_penation = ridge） ridge）
我正在尝试确定特征重要性，并且通过某些研究，IT好像我需要使用它：

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.selectkbest.html

但是，如果我的模型如何访问此功能，我感到困惑已围绕sklearn.pipeline.make_pipeline函数构建。

我只是想找出哪些自变量在预测我的标签方面最重要。

下面的代码供参考

#prep data
X_prep = df_dummy.drop(columns='CHURN_FLAG')

#set predictor and target variables
X = X_prep #all features except churn_flag
y = df_dummy["CHURN_FLAG"]

#create train /test sets
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)


'''
standard scaler - incoming data needs to standardized before any other transformation is performed on it.
SelectKBest() -> This method comes from the feature_selection module of Scikit-learn. It selects the best features based on a specified scoring function (in this case, f_regression)
 The number of features is specified by the value of parameter k. Even within the selected features, we want to vary the final set of features that are fed to the model, and find what performs best. We can do that with the GridSearchCV method
 Ridge() -> This is an estimator that performs the actual regression -- performed to reduce the effect of multicollinearity.
 GridSearchCV _> Other than searching over all the permutations of the selected parameters, GridSearchCV performs cross-validation on training data.
'''
#Setting up a pipeline
pipe= make_pipeline(StandardScaler(),SelectKBest(f_regression),Ridge())

#A quick way to get a list of parameters that a pipeline can accept
#pipe.get_params().keys()

#putting together a parameter grid to search over using grid search
params={
    'selectkbest__k':[1,2,3,4,5,6],
    'ridge__fit_intercept':[True,False],
    'ridge__alpha':[0.01,0.1,1,10],
    'ridge__solver':[ 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag',
'saga']
}
#setting up the grid search
gs=GridSearchCV(pipe,params,n_jobs=-1,cv=5)
#fitting gs to training data
gs.fit(Xtrain, ytrain)

#building a dataframe from cross-validation data
df_cv_scores=pd.DataFrame(gs.cv_results_).sort_values(by='rank_test_score')

#selecting specific columns to create a view
df_cv_scores[['params','split0_test_score', 'split1_test_score', 'split2_test_score',
       'split3_test_score', 'split4_test_score', 'mean_test_score',
       'std_test_score', 'rank_test_score']].head()

#checking the selected permutation of parameters
gs.best_params_

'''
Finally, we can predict target values for test set by passing it’s feature matrix to gs. 
The predicted values can be compared with actual target values to visualize and communicate performance of the model.
'''
#checking how well the model does on the holdout-set
gs.score(Xtest,ytest)

#plotting predicted churn/active vs actual churn/active
y_preds=gs.predict(Xtest)
plt.scatter(ytest,y_preds)

原文

Two questions:

I'm trying to run a model that predicts churn. A lot of my features have multicollinearity issues. To address this problem I'm trying to penalize the coefficients with Ridge.

More specifically I'm trying to run a logistic regression but apply Ridge penalties (not sure if that makes sense) to the model...

Questions:

Would selecting a ridge regression classifier suffice this? Or do I need to select logistic regression classifier and append it with some param for ridge penalty (i.e. LogisticRegression(apply_penality=Ridge)
I'm trying to determine feature importance and through some research, it seems like I need to use this:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

However, I'm confused on how to access this function if my model has been built around sklearn.pipeline.make_pipeline function.

I'm just trying to figure out which independent variables have the most importance in predicting my label.

Code below for reference

#prep data
X_prep = df_dummy.drop(columns='CHURN_FLAG')

#set predictor and target variables
X = X_prep #all features except churn_flag
y = df_dummy["CHURN_FLAG"]

#create train /test sets
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)


'''
standard scaler - incoming data needs to standardized before any other transformation is performed on it.
SelectKBest() -> This method comes from the feature_selection module of Scikit-learn. It selects the best features based on a specified scoring function (in this case, f_regression)
 The number of features is specified by the value of parameter k. Even within the selected features, we want to vary the final set of features that are fed to the model, and find what performs best. We can do that with the GridSearchCV method
 Ridge() -> This is an estimator that performs the actual regression -- performed to reduce the effect of multicollinearity.
 GridSearchCV _> Other than searching over all the permutations of the selected parameters, GridSearchCV performs cross-validation on training data.
'''
#Setting up a pipeline
pipe= make_pipeline(StandardScaler(),SelectKBest(f_regression),Ridge())

#A quick way to get a list of parameters that a pipeline can accept
#pipe.get_params().keys()

#putting together a parameter grid to search over using grid search
params={
    'selectkbest__k':[1,2,3,4,5,6],
    'ridge__fit_intercept':[True,False],
    'ridge__alpha':[0.01,0.1,1,10],
    'ridge__solver':[ 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag',
'saga']
}
#setting up the grid search
gs=GridSearchCV(pipe,params,n_jobs=-1,cv=5)
#fitting gs to training data
gs.fit(Xtrain, ytrain)

#building a dataframe from cross-validation data
df_cv_scores=pd.DataFrame(gs.cv_results_).sort_values(by='rank_test_score')

#selecting specific columns to create a view
df_cv_scores[['params','split0_test_score', 'split1_test_score', 'split2_test_score',
       'split3_test_score', 'split4_test_score', 'mean_test_score',
       'std_test_score', 'rank_test_score']].head()

#checking the selected permutation of parameters
gs.best_params_

'''
Finally, we can predict target values for test set by passing it’s feature matrix to gs. 
The predicted values can be compared with actual target values to visualize and communicate performance of the model.
'''
#checking how well the model does on the holdout-set
gs.score(Xtest,ytest)

#plotting predicted churn/active vs actual churn/active
y_preds=gs.predict(Xtest)
plt.scatter(ytest,y_preds)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

°如果伤别离去 2025-02-12 01:46:00

选择山脊回归分类器是否足够吗？或我需要选择逻辑回归分类器并用一些参数以ridge惩罚（即LogisticRegress（apply_penality = ridge）
）

因此脊回归和物流之间的问题归结于您是否试图进行分类或回归。如果您想在某个连续的基础上预测搅拌的数量，如果您想预测某人的流失，或者他们可能会搅拌使用逻辑回归

。重新出现。

我正在尝试确定特征重要性，并且通过一些研究，似乎我需要使用此功能。

通常，您可以通过name_steps属性访问管道的元素。因此，在您的情况下，如果您想访问selectkbest可以做：

pipe.named_steps["SelectKBest"].get_feature_names()

这将为您提供功能名称，现在您仍然需要这些值。在这里，您必须访问模型学习系数。对于Ridge和Logistic回归，应该是这样的：

pipe.named_steps["logisticregression"].coef_

如果您想要一个更详细的教程在这里

Would selecting a ridge regression classifier suffice this? Or do I need to select logistic regression classifier and append it with some param for ridge penalty (i.e. LogisticRegression(apply_penality=Ridge)

So the question between ridge regression and Logistic here comes down to whether or not you are trying to do classification or regression. If you want to predict the quantity of churn on some continuous basis use ridge, if you want to predict did someone churn or are they likely to churn use logistic regression.

Sklearn's LogisticRegression uses l2 normalization by default which is equivalent to the regularization used by ridge regeression. So you should be fine using that if it's the regularization that you want : )

I'm trying to determine feature importance and through some research, it seems like I need to use this.

In general you can access the elements of a pipeline through the named_steps attribute. so in your case if you wanted to access SelectKBest you could do:

pipe.named_steps["SelectKBest"].get_feature_names()

That's going to get you the feature names, now you still need the values. Here you have to access your models learned coefficients. For ridge and logistic regression it should be something like:

pipe.named_steps["logisticregression"].coef_

I have a blog post about this if you want a more detailed tutorial here

回复收藏 0 原文

~没有更多了~

关于作者

风吹过旳痕迹

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

Scikit-Learn如何使用管道以及如何进行逻辑＆＃x2B;山脊回归

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

5040234068

樱花雨梦

≈。彩虹

雨轻弹

血之狂魔

qq_0bIjwE

友情链接

Scikit-Learn如何使用管道以及如何进行逻辑＆＃x2B;山脊回归

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

5040234068

樱花雨梦

≈。彩虹

雨轻弹

血之狂魔

qq_0bIjwE

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。