girdsearchcv用于多输出randomforest回归剂

发布于 2025-01-23 03:03:10 字数 3601 浏览 4 评论 0 原文

我已经使用 sklearn.ensemble.randomforestregressor 创建了一个MultiOutput RandomForestRegressor。我现在想执行 GridSearchCV 以找到良好的超参数并为每个单独的目标功能输出R^2分数。代码使用如下：

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
param_grid = {
    'model__bootstrap': [True],
    'model__max_depth': [8,10,12],
    'model__max_features': [3,4,5],
    'model__min_samples_leaf': [3,4,5],
    'model__min_samples_split': [3, 5, 7],
    'model__n_estimators': [100, 200, 300]
}
model = RandomForestRegressor()
pipe = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('model', model)])

scorer = make_scorer(r2_score, multioutput='raw_values')
search = GridSearchCV(pipe, param_grid, scoring=scorer)
search.fit(X_train, y_train)
print(f'Best parameter score {ship_type} {target}: {search.best_score_}')

运行此代码时，我明确会收到以下错误。

  File "run_xgb_rf_regressor.py", line 75, in <module>
    model, X = run_regression(ship_types[2], targets)
  File "run_xgb_rf_regressor.py", line 50, in run_regression
    search.fit(X_train, y_train)
  File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py", line 841, in fit
    self._run_search(evaluate_candidates)
  File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py", line 1296, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py", line 795, in evaluate_candidates
    out = parallel(delayed(_fit_and_score)(clone(base_estimator),
  File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 861, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 779, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/lucas/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/lucas/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/utils/fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 625, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, error_score)
  File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 721, in _score
    raise ValueError(error_msg % (scores, type(scores), scorer))
ValueError: scoring must return a number, got [0.57359176 0.54407165 0.40313057 0.32515033 0.346224   0.39513717
 0.34375699] (<class 'numpy.ndarray'>) instead. (scorer=make_scorer(r2_score, multioutput=raw_values))

错误表明我只能使用单个数字值，在我的情况下，这将是所有目标功能的平均R^2分数。是否有人知道我如何使用GridSearchCV，以便我可以输出单个R^2分数？

非常感谢。

原文

I have created a multioutput RandomForestRegressor using the sklearn.ensemble.RandomForestRegressor. I now want to perform a GridSearchCV to find good hyperparameters and output the r^2 scores for each individual target feature. The code is use looks as follows:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
param_grid = {
    'model__bootstrap': [True],
    'model__max_depth': [8,10,12],
    'model__max_features': [3,4,5],
    'model__min_samples_leaf': [3,4,5],
    'model__min_samples_split': [3, 5, 7],
    'model__n_estimators': [100, 200, 300]
}
model = RandomForestRegressor()
pipe = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('model', model)])

scorer = make_scorer(r2_score, multioutput='raw_values')
search = GridSearchCV(pipe, param_grid, scoring=scorer)
search.fit(X_train, y_train)
print(f'Best parameter score {ship_type} {target}: {search.best_score_}')

When running this code I get the following error

  File "run_xgb_rf_regressor.py", line 75, in <module>
    model, X = run_regression(ship_types[2], targets)
  File "run_xgb_rf_regressor.py", line 50, in run_regression
    search.fit(X_train, y_train)
  File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py", line 841, in fit
    self._run_search(evaluate_candidates)
  File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py", line 1296, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py", line 795, in evaluate_candidates
    out = parallel(delayed(_fit_and_score)(clone(base_estimator),
  File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 861, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 779, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/lucas/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/lucas/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/utils/fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 625, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, error_score)
  File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 721, in _score
    raise ValueError(error_msg % (scores, type(scores), scorer))
ValueError: scoring must return a number, got [0.57359176 0.54407165 0.40313057 0.32515033 0.346224   0.39513717
 0.34375699] (<class 'numpy.ndarray'>) instead. (scorer=make_scorer(r2_score, multioutput=raw_values))

Clearly the error suggests that I can only use a single numeric value, which in my case would be the average r^2 score over all target features. Does anybody know how I can use GridSearchCV so that I can output the individual r^2 scores?

Many thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挽袖吟 2025-01-30 03:03:10

我认为我将使用以下选项进行评分参数（来自文档）：

可呼叫的返回字典，其中键是公制名称，值为度量分数；

因此，

def my_scorer(estimator, X, y):
    preds = estimator.predict(X)
    scores = r2_score(y, preds, multioutput='raw_values')
    return {f'r2_y{i}': score for i, score in enumerate(scores)}

在文档中， revit 的文档中，需要通过多项搜索进行更仔细的设置。也许决定“最佳”参数应通过某种平均值来完成，在这种情况下，您可以在自定义得分手中添加另一个条目。

用户指南的其他有用部分：

I think I would use the following option for scoring parameter (from the docs):

a callable returning a dictionary where the keys are the metric names and the values are the metric scores;

So something like

def my_scorer(estimator, X, y):
    preds = estimator.predict(X)
    scores = r2_score(y, preds, multioutput='raw_values')
    return {f'r2_y{i}': score for i, score in enumerate(scores)}

Note though in the docs that refit will need to be set more carefully with multimetric searches. Maybe deciding the "best" parameters should be done by some average, in which case you can add another entry to the custom scorer.

Other useful parts of the User Guide:
https://scikit-learn.org/stable/modules/grid_search.html#multimetric-grid-search
https://scikit-learn.org/stable/modules/model_evaluation.html#implementing-your-own-scoring-object