Scikit-Learn如何使用管道以及如何进行逻辑+山脊回归
两个问题:
我正在尝试运行一个预测流失的模型。我的许多功能都有多重共线性问题。为了解决这个问题,我试图用Ridge来惩罚系数。
更具体地说,我试图运行逻辑回归,但在模型上应用山脊惩罚(不确定是否有意义)...
问题:
选择脊回归分类器是否足够吗?还是我需要选择逻辑回归分类器并用一些参数以ridge惩罚(即
logisticRegress(apply_penation = ridge)
ridge)我正在尝试确定特征重要性,并且通过某些研究,IT好像我需要使用它:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.selectkbest.html
但是,如果我的模型如何访问此功能,我感到困惑已围绕sklearn.pipeline.make_pipeline
函数构建。
我只是想找出哪些自变量在预测我的标签方面最重要。
下面的代码供参考
#prep data
X_prep = df_dummy.drop(columns='CHURN_FLAG')
#set predictor and target variables
X = X_prep #all features except churn_flag
y = df_dummy["CHURN_FLAG"]
#create train /test sets
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)
'''
standard scaler - incoming data needs to standardized before any other transformation is performed on it.
SelectKBest() -> This method comes from the feature_selection module of Scikit-learn. It selects the best features based on a specified scoring function (in this case, f_regression)
The number of features is specified by the value of parameter k. Even within the selected features, we want to vary the final set of features that are fed to the model, and find what performs best. We can do that with the GridSearchCV method
Ridge() -> This is an estimator that performs the actual regression -- performed to reduce the effect of multicollinearity.
GridSearchCV _> Other than searching over all the permutations of the selected parameters, GridSearchCV performs cross-validation on training data.
'''
#Setting up a pipeline
pipe= make_pipeline(StandardScaler(),SelectKBest(f_regression),Ridge())
#A quick way to get a list of parameters that a pipeline can accept
#pipe.get_params().keys()
#putting together a parameter grid to search over using grid search
params={
'selectkbest__k':[1,2,3,4,5,6],
'ridge__fit_intercept':[True,False],
'ridge__alpha':[0.01,0.1,1,10],
'ridge__solver':[ 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag',
'saga']
}
#setting up the grid search
gs=GridSearchCV(pipe,params,n_jobs=-1,cv=5)
#fitting gs to training data
gs.fit(Xtrain, ytrain)
#building a dataframe from cross-validation data
df_cv_scores=pd.DataFrame(gs.cv_results_).sort_values(by='rank_test_score')
#selecting specific columns to create a view
df_cv_scores[['params','split0_test_score', 'split1_test_score', 'split2_test_score',
'split3_test_score', 'split4_test_score', 'mean_test_score',
'std_test_score', 'rank_test_score']].head()
#checking the selected permutation of parameters
gs.best_params_
'''
Finally, we can predict target values for test set by passing it’s feature matrix to gs.
The predicted values can be compared with actual target values to visualize and communicate performance of the model.
'''
#checking how well the model does on the holdout-set
gs.score(Xtest,ytest)
#plotting predicted churn/active vs actual churn/active
y_preds=gs.predict(Xtest)
plt.scatter(ytest,y_preds)
Two questions:
I'm trying to run a model that predicts churn. A lot of my features have multicollinearity issues. To address this problem I'm trying to penalize the coefficients with Ridge.
More specifically I'm trying to run a logistic regression but apply Ridge penalties (not sure if that makes sense) to the model...
Questions:
Would selecting a ridge regression classifier suffice this? Or do I need to select logistic regression classifier and append it with some param for ridge penalty (i.e.
LogisticRegression(apply_penality=Ridge)
I'm trying to determine feature importance and through some research, it seems like I need to use this:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
However, I'm confused on how to access this function if my model has been built around sklearn.pipeline.make_pipeline
function.
I'm just trying to figure out which independent variables have the most importance in predicting my label.
Code below for reference
#prep data
X_prep = df_dummy.drop(columns='CHURN_FLAG')
#set predictor and target variables
X = X_prep #all features except churn_flag
y = df_dummy["CHURN_FLAG"]
#create train /test sets
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)
'''
standard scaler - incoming data needs to standardized before any other transformation is performed on it.
SelectKBest() -> This method comes from the feature_selection module of Scikit-learn. It selects the best features based on a specified scoring function (in this case, f_regression)
The number of features is specified by the value of parameter k. Even within the selected features, we want to vary the final set of features that are fed to the model, and find what performs best. We can do that with the GridSearchCV method
Ridge() -> This is an estimator that performs the actual regression -- performed to reduce the effect of multicollinearity.
GridSearchCV _> Other than searching over all the permutations of the selected parameters, GridSearchCV performs cross-validation on training data.
'''
#Setting up a pipeline
pipe= make_pipeline(StandardScaler(),SelectKBest(f_regression),Ridge())
#A quick way to get a list of parameters that a pipeline can accept
#pipe.get_params().keys()
#putting together a parameter grid to search over using grid search
params={
'selectkbest__k':[1,2,3,4,5,6],
'ridge__fit_intercept':[True,False],
'ridge__alpha':[0.01,0.1,1,10],
'ridge__solver':[ 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag',
'saga']
}
#setting up the grid search
gs=GridSearchCV(pipe,params,n_jobs=-1,cv=5)
#fitting gs to training data
gs.fit(Xtrain, ytrain)
#building a dataframe from cross-validation data
df_cv_scores=pd.DataFrame(gs.cv_results_).sort_values(by='rank_test_score')
#selecting specific columns to create a view
df_cv_scores[['params','split0_test_score', 'split1_test_score', 'split2_test_score',
'split3_test_score', 'split4_test_score', 'mean_test_score',
'std_test_score', 'rank_test_score']].head()
#checking the selected permutation of parameters
gs.best_params_
'''
Finally, we can predict target values for test set by passing it’s feature matrix to gs.
The predicted values can be compared with actual target values to visualize and communicate performance of the model.
'''
#checking how well the model does on the holdout-set
gs.score(Xtest,ytest)
#plotting predicted churn/active vs actual churn/active
y_preds=gs.predict(Xtest)
plt.scatter(ytest,y_preds)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
因此脊回归和物流之间的问题归结于您是否试图进行分类或回归。如果您想在某个连续的基础上预测搅拌的数量,如果您想预测某人的流失,或者他们可能会搅拌使用逻辑回归
。重新出现。
通常,您可以通过
name_steps
属性访问管道的元素。因此,在您的情况下,如果您想访问selectkbest
可以做:这将为您提供功能名称,现在您仍然需要这些值。在这里,您必须访问模型学习系数。对于Ridge和Logistic回归,应该是这样的:
如果您想要一个更详细的教程在这里
So the question between ridge regression and Logistic here comes down to whether or not you are trying to do classification or regression. If you want to predict the quantity of churn on some continuous basis use ridge, if you want to predict did someone churn or are they likely to churn use logistic regression.
Sklearn's LogisticRegression uses l2 normalization by default which is equivalent to the regularization used by ridge regeression. So you should be fine using that if it's the regularization that you want : )
In general you can access the elements of a pipeline through the
named_steps
attribute. so in your case if you wanted to accessSelectKBest
you could do:That's going to get you the feature names, now you still need the values. Here you have to access your models learned coefficients. For ridge and logistic regression it should be something like:
I have a blog post about this if you want a more detailed tutorial here