Sklearn管道 - 试图计算估计器的次数

发布于 2025-02-02 12:24:31 字数 2043 浏览 4 评论 0原文

我正在尝试计算该管道中调用logisticRegress的次数,因此我扩展了类和Overrode .fit()。它应该很简单,但会生成这个怪异的错误:

typeError:float()参数必须是字符串或数字,而不是“ mylogistic”,

其中mylogistic是新类。如果复制并粘贴代码,则应该能够复制整个内容。

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (GridSearchCV, StratifiedKFold)
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import numpy as np

class MyLogistic(LogisticRegression):
    __call_counter = 0
    def fit(X, y, sample_weight=None):
        print("MyLogistic fit is called.")
        MyLogistic._MyLogistic__call_counter += 1
        # fit() returns self.
        return super().fit(X, y, sample_weight)

# If I use this "extension", everything works fine.
#class MyLogistic(LogisticRegression):
#    pass
    
initial_logistic = MyLogistic(solver="liblinear", random_state = np.random.RandomState(18))
final_logistic = LogisticRegression(solver="liblinear", random_state = np.random.RandomState(20))
# prefit = False by default
select_best = SelectFromModel(estimator = initial_logistic, threshold = -np.inf)

select_k_best_pipeline = Pipeline(steps=[
    ('first_scaler', StandardScaler(with_mean = False)),
    # initial_logistic will be called from select_best, prefit = false by default.
    ('select_k_best', select_best),
    ('final_logit', final_logistic)
])

select_best_grid = {'select_k_best__estimator__C' : [0.02, 0.03],
                    'select_k_best__max_features': [1, 2],
                    'final_logit__C' : [0.01, 0.5, 1.0]}

skf = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 17)

logit_best_searcher = GridSearchCV(estimator = select_k_best_pipeline, param_grid = select_best_grid, cv = skf, 
                               scoring = "roc_auc", n_jobs = 6, verbose = 4)

X, y = load_iris(return_X_y=True)
logit_best_searcher.fit(X, y > 0)
print("Best hyperparams: ", logit_best_searcher.best_params_)

I'm trying to count the number of times LogisticRegression is called in this pipeline, so I extended the class and overrode .fit(). It was supposed to be simple but it generates this weird error:

TypeError: float() argument must be a string or a number, not 'MyLogistic'

where MyLogistic is the new class. You should be able to reproduce the whole thing if you copy and paste the code.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (GridSearchCV, StratifiedKFold)
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import numpy as np

class MyLogistic(LogisticRegression):
    __call_counter = 0
    def fit(X, y, sample_weight=None):
        print("MyLogistic fit is called.")
        MyLogistic._MyLogistic__call_counter += 1
        # fit() returns self.
        return super().fit(X, y, sample_weight)

# If I use this "extension", everything works fine.
#class MyLogistic(LogisticRegression):
#    pass
    
initial_logistic = MyLogistic(solver="liblinear", random_state = np.random.RandomState(18))
final_logistic = LogisticRegression(solver="liblinear", random_state = np.random.RandomState(20))
# prefit = False by default
select_best = SelectFromModel(estimator = initial_logistic, threshold = -np.inf)

select_k_best_pipeline = Pipeline(steps=[
    ('first_scaler', StandardScaler(with_mean = False)),
    # initial_logistic will be called from select_best, prefit = false by default.
    ('select_k_best', select_best),
    ('final_logit', final_logistic)
])

select_best_grid = {'select_k_best__estimator__C' : [0.02, 0.03],
                    'select_k_best__max_features': [1, 2],
                    'final_logit__C' : [0.01, 0.5, 1.0]}

skf = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 17)

logit_best_searcher = GridSearchCV(estimator = select_k_best_pipeline, param_grid = select_best_grid, cv = skf, 
                               scoring = "roc_auc", n_jobs = 6, verbose = 4)

X, y = load_iris(return_X_y=True)
logit_best_searcher.fit(X, y > 0)
print("Best hyperparams: ", logit_best_searcher.best_params_)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

嗫嚅 2025-02-09 12:24:31

您只需忘记将self作为fit签名的第一个参数。因此,呼叫正在获取x = self,并且在尝试检查输入x时,它在某个时候尝试转换为float,因此错误消息。

平行化仍然有些怪异。我得到的计数器等于1。设置n_jobs = 1,我的计数器获得了正确的37(2x2x3 Hyper参数候选X3 Folds,+1用于最终改装)。

You just forgot to put self as the first parameter of the fit signature. So the call is getting X=self, and when trying to check the input X it at some point tries to convert to float, hence the error message.

There's still some weirdness around the parallelization; I get the counter equal to 1. Setting n_jobs=1 instead, I get the correct 37 for the counter (2x2x3 hyperparameter candidates on x3 folds, +1 for the final refit).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文