如何使用GridSearchCV中的DF字典中的每个DF获得最佳参数?

发布于 2025-02-06 21:16:39 字数 1698 浏览 0 评论 0原文

第一,我有一个dataframes,dfs的字典,其中包含五个不同的数据框。

第二,我使用的是Scikit-Learn回归剂,具有以下参数的RandomTrest:

regressor = RandomForestRegressor(random_state = None)

num_estimators = list(np.linspace(10, 100, num=5, endpoint=True).astype(int))
max_features = ["auto", "sqrt", "log2"]
min_samples_split = [2,4,8]

params = {'regressor__n_estimators': num_estimators,
          'regressor__max_features': max_features,
          'regressor__min_samples_split': min_samples_split,
          'regressor__bootstrap': [False]},

我的管道的三个元素如下:

# numeric columns to use
num_columns = list(subset_features[2:])

# pipeline for processing numerical feeatures
num_transformer = Pipeline([('impute', IterativeImputer()),
                            ('scale', StandardScaler())])

column_transformer = ColumnTransformer([('num_pipeline', num_transformer, num_columns)])


# the pipeline
pipe = Pipeline(steps=[("ct", column_transformer), ("reg", regressor)])

最后,GridSearch and Fit是以下内容:

gs = GridSearchCV(estimator=pipe, 
                  param_grid=params, 
                  cv=5, 
                  n_jobs=-1,
                  verbose=1, 
                  scoring=scorer # a user-defined scoring function,
                  refit=True)

# run the gs for each dataframe
gs_output = {}
for id, df in enumerate(dfs.values()):
    print('starting id:', id)
    gs_results[id] = gs.fit(df)

AFR运行上述模型,我尝试获得最佳参数的尝试对于gs.best_params _的每个数据框架 仅检索一组最佳参数,如下所示。

Best params: {'bootstrap': False, 'max_features': 'log2', 'min_samples_split': 4, 'n_estimators': 10}

我想要的是获得五个最佳参数估计,每个数据框架一个。

One, I have a dictionary of dataframes, dfs, with five different dataframes in it.

Two, I am using a scikit-learn regressor, RandomForest with the following parameters:

regressor = RandomForestRegressor(random_state = None)

num_estimators = list(np.linspace(10, 100, num=5, endpoint=True).astype(int))
max_features = ["auto", "sqrt", "log2"]
min_samples_split = [2,4,8]

params = {'regressor__n_estimators': num_estimators,
          'regressor__max_features': max_features,
          'regressor__min_samples_split': min_samples_split,
          'regressor__bootstrap': [False]},

Three the elements of my pipeline are as below:

# numeric columns to use
num_columns = list(subset_features[2:])

# pipeline for processing numerical feeatures
num_transformer = Pipeline([('impute', IterativeImputer()),
                            ('scale', StandardScaler())])

column_transformer = ColumnTransformer([('num_pipeline', num_transformer, num_columns)])


# the pipeline
pipe = Pipeline(steps=[("ct", column_transformer), ("reg", regressor)])

Finally, the gridsearch and fit are the following:

gs = GridSearchCV(estimator=pipe, 
                  param_grid=params, 
                  cv=5, 
                  n_jobs=-1,
                  verbose=1, 
                  scoring=scorer # a user-defined scoring function,
                  refit=True)

# run the gs for each dataframe
gs_output = {}
for id, df in enumerate(dfs.values()):
    print('starting id:', id)
    gs_results[id] = gs.fit(df)

Afer running the above model, my attempts at getting the best parameters for each dataframe with gs.best_params_
retrieves only one set of best parameters, shown below.

Best params: {'bootstrap': False, 'max_features': 'log2', 'min_samples_split': 4, 'n_estimators': 10}

What I want is to get five best parameter estimates, one for each dataframe.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

将军与妓 2025-02-13 21:16:39

在发布了这个问题之后,我想我提出了一个解决我想做的事情的解决方案。我只需要将代码包裹在功能中,然后输出我想看到的输出的元组,并迭代函数。以下是我的更新代码。

def GridSearchCVRF(x, y):

    # the estimator and parameters
    regressor = RandomForestRegressor(random_state = None)

    num_estimators = list(np.linspace(10, 100, num=5, endpoint=True).astype(int))
    max_features = ["auto", "sqrt", "log2"]
    min_samples_split = [2,4,8]

    params = {'regressor__n_estimators': num_estimators,
              'regressor__max_features': max_features,
              'regressor__min_samples_split': min_samples_split,
              'regressor__bootstrap': [False]},
    
    # numeric columns to use
    num_columns = list(subset_features[2:])

    # pipeline for processing numerical feeatures
    num_transformer = Pipeline([('impute', IterativeImputer()),
                                ('scale', StandardScaler())])
    column_transformer = ColumnTransformer([('num_pipeline', num_transformer, num_columns)])


    # the pipeline
    pipe = Pipeline(steps=[("ct", column_transformer), ("reg", regressor)])

    # the gridsearch and fit the following:
    gs = GridSearchCV(estimator=pipe, 
                      param_grid=params, 
                      cv=5, 
                      n_jobs=-1,
                      verbose=1, 
                      scoring=scorer # a user-defined scoring function,
                      refit=True)

    gs.fit(x, y)
    return gs.best_score, gs.best_params

然后,我在功能上运行以下代码,并获取输出:最佳分数最佳参数用于随机森林模型中使用的每个数据帧。

# fit the model function
best_scores = {}
best_params = {}
for i, df in enumerate(dfs.values()):
    print('starting i...:', i)
    best_scores[i], best_params[i] = GridSearchCVRF(df)
print('best scores', best_scores[i])
print('best params', best_params[i])

After posting this question, I think I came up with a solution that does what I wanted to do. I simply had to wrap the code in a function and output a tuple of output of what I wanted to see and iterate over function. Below is my updated code.

def GridSearchCVRF(x, y):

    # the estimator and parameters
    regressor = RandomForestRegressor(random_state = None)

    num_estimators = list(np.linspace(10, 100, num=5, endpoint=True).astype(int))
    max_features = ["auto", "sqrt", "log2"]
    min_samples_split = [2,4,8]

    params = {'regressor__n_estimators': num_estimators,
              'regressor__max_features': max_features,
              'regressor__min_samples_split': min_samples_split,
              'regressor__bootstrap': [False]},
    
    # numeric columns to use
    num_columns = list(subset_features[2:])

    # pipeline for processing numerical feeatures
    num_transformer = Pipeline([('impute', IterativeImputer()),
                                ('scale', StandardScaler())])
    column_transformer = ColumnTransformer([('num_pipeline', num_transformer, num_columns)])


    # the pipeline
    pipe = Pipeline(steps=[("ct", column_transformer), ("reg", regressor)])

    # the gridsearch and fit the following:
    gs = GridSearchCV(estimator=pipe, 
                      param_grid=params, 
                      cv=5, 
                      n_jobs=-1,
                      verbose=1, 
                      scoring=scorer # a user-defined scoring function,
                      refit=True)

    gs.fit(x, y)
    return gs.best_score, gs.best_params

I then ran the following code on the function and got the outputs: best scores and best params for each dataframe used in the random forest model.

# fit the model function
best_scores = {}
best_params = {}
for i, df in enumerate(dfs.values()):
    print('starting i...:', i)
    best_scores[i], best_params[i] = GridSearchCVRF(df)
print('best scores', best_scores[i])
print('best params', best_params[i])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文