GridSearchCV拟合的时间比基本分类器长19倍，即使在那里只有一组可能的参数

发布于 2025-02-04 17:32:04 字数 1682 浏览 3 评论 0原文

我写了一个简单的基准测试，该基准表明，使用GridSearchCV拟合功能将Scikit-Learn与基本分类器作为LogisticRegression中使用，并且只有一组可能的超参数需要至少8次，多达19倍长达19倍，而不是使用使用的拟合功能基本分类器。知道为什么这种巨大的差异会发生吗？这是代码：

import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

iris = load_iris()
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
                              random_state=0, n_jobs=36)
distributions = dict(C=[1], penalty=['l1'])

ls = [next(ShuffleSplit(n_splits=1, test_size=.25, random_state=0).split(iris.data))]
train_X = iris.data[ls[0][0]]
train_Y = iris.target[ls[0][0]]

n = 20
tot_t = 0
for _ in range(n):
    t0 = time.time()
    clf = GridSearchCV(logistic, distributions, n_jobs=36, cv=ls)
    search = clf.fit(iris.data, iris.target)
    tot_t += time.time() - t0
print(f"avg = {tot_t / n}")

tot_t = 0
for _ in range(n):
    t0 = time.time()
    clf = logistic
    search = clf.fit(train_X, train_Y)
    tot_t += time.time() - t0
print(f"avg = {tot_t / n}")

clf = logistic的结果，clf = gscv分别为0.324和0.017（慢19倍）。请注意，对于GSCV，只有一组HyperParams可能（C = 1，numinty ='l1'），因此，基本上，GSCV仅适用于一个CLF，而不适合多个，并且没有CV（只给出了一组拆分对此），但是花了更多的时间！

如果我将iris.data和itis.target大100倍：

iris.data = np.repeat(iris.data, 100, axis=0)
iris.target = np.repeat(iris.target, 100, axis=0)

我得到这些结果：0.528和0.064（慢8倍）。 Iris和Iris较大的1000倍。Target：2.70和0.34（慢8倍）。

我用正常的iris.data和iris.target进行了随机化CV的测试：

clf = RandomizedSearchCV(logistic, distributions, random_state=2, n_jobs=36, cv=ls)

并得到这些结果：0.337和0.013（慢26倍）。

原文

I have written a simple benchmark that shows that using the GridSearchCV fit function in scikit-learn with the base classifier as LogisticRegression and only one set of possible hyperparameters takes at least 8 times and up to 19 times longer than just using the fit function of the base classifier. Any idea why this big difference is happening? Here's the code:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

iris = load_iris()
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
                              random_state=0, n_jobs=36)
distributions = dict(C=[1], penalty=['l1'])

ls = [next(ShuffleSplit(n_splits=1, test_size=.25, random_state=0).split(iris.data))]
train_X = iris.data[ls[0][0]]
train_Y = iris.target[ls[0][0]]

n = 20
tot_t = 0
for _ in range(n):
    t0 = time.time()
    clf = GridSearchCV(logistic, distributions, n_jobs=36, cv=ls)
    search = clf.fit(iris.data, iris.target)
    tot_t += time.time() - t0
print(f"avg = {tot_t / n}")

tot_t = 0
for _ in range(n):
    t0 = time.time()
    clf = logistic
    search = clf.fit(train_X, train_Y)
    tot_t += time.time() - t0
print(f"avg = {tot_t / n}")

The results for clf = logistic, and clf = GSCV are 0.324 and 0.017, respectively (19 times slower). Note that for GSCV, there's only one set of hyperparams possible (C=1, penalty='l1'), so basically, GSCV has to fit only one clf and not multiple, and there's no CV (only one set of splits is given to it), yet it's taking much more time!

If I make iris.data and itis.target 100 times larger:

iris.data = np.repeat(iris.data, 100, axis=0)
iris.target = np.repeat(iris.target, 100, axis=0)

I get these results: 0.528 and 0.064 (8 times slower).
With 1000 times larger iris.data and iris.target: 2.70 and 0.34 (8 times slower).

I tested with the normal iris.data and iris.target with RandomizedCV:

clf = RandomizedSearchCV(logistic, distributions, random_state=2, n_jobs=36, cv=ls)

and got these results: 0.337 and 0.013 (26 times slower).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

朱染 2025-02-11 17:32:04

我看到您正在为逻辑回归零件调用36个线程，然后在此基础上使用n_jobs = 36在GridSearchCV中并行化。这是两次并行的，并且可能会减慢您的流程。

对于logistic回归，仅在文档中使用多个类时，并行才能有效：

n_jobs int, default=None
Number of CPU cores used when parallelizing over classes if multi_class=’ovr’”.

如果您有2或3个类，例如Iris中，这并不重要，因此您可以做到：

logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,random_state=0, n_jobs=None)

GridSearchCV不仅适合模型，它计算模型的分数，并在整个过程中总结了它。因此，您需要设置revit = false以确保它不会在完整数据集中重新装修最佳模型。另外，在此示例中，设置CV =无确保它不会运行并行进程。我们还可以包含一个测试数据集，以便我们也执行评分：

test_X = iris.data[ls[0][1]]
test_Y = iris.target[ls[0][1]]

例如，如果我这样做：

n = 20
tot_t = 0
for _ in range(n):
    t0 = time.time()
    clf = GridSearchCV(logistic, distributions, n_jobs=None, cv=ls, refit = False)
    search = clf.fit(iris.data, iris.target)
    tot_t += time.time() - t0
print(f"avg = {tot_t / n}")

tot_t = 0
for _ in range(n):
    t0 = time.time()
    clf = logistic
    search = clf.fit(train_X, train_Y)
    score_test = clf.score(test_X,test_Y)
    tot_t += time.time() - t0
print(f"avg = {tot_t / n}")

我得到：

avg = 0.0036052584648132322
avg = 0.0019430875778198241

这样，一旦您考虑了GridSearchCV进行的其他计算，那么差异就不那么巨大了。

在您的情况下，我相信您的GridSearchCV是由LogisticRegress中的N_Job = 36引起的，和gridsearchcv。在测试许多超参数的情况下，您很可能只想用GridSearchCV调用一次。

I see that you are calling 36 threads for the logistic regression part, and then on top of this, trying to parallelize this in GridSearchCV with n_jobs=36 . This is parallelizing twice and might slow down your processes.

For logistic regression, parallel only works if you have multi class, as from the documentation:

n_jobs int, default=None
Number of CPU cores used when parallelizing over classes if multi_class=’ovr’”.

If you have 2 or 3 classes, like in iris, it doesn't quite matter, so you can do:

logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,random_state=0, n_jobs=None)

GridSearchCV does more than fitting the model, it calculates the score of the model, and also summarizes it across. So you need to set refit = False to ensure it doesn't refit the best model on the full dataset. Also in this example, setting cv=None ensures it doesn't run parallel processes. We can also include a test dataset so that we perform the scoring as well:

test_X = iris.data[ls[0][1]]
test_Y = iris.target[ls[0][1]]

For example if I do :

n = 20
tot_t = 0
for _ in range(n):
    t0 = time.time()
    clf = GridSearchCV(logistic, distributions, n_jobs=None, cv=ls, refit = False)
    search = clf.fit(iris.data, iris.target)
    tot_t += time.time() - t0
print(f"avg = {tot_t / n}")

tot_t = 0
for _ in range(n):
    t0 = time.time()
    clf = logistic
    search = clf.fit(train_X, train_Y)
    score_test = clf.score(test_X,test_Y)
    tot_t += time.time() - t0
print(f"avg = {tot_t / n}")

I get:

avg = 0.0036052584648132322
avg = 0.0019430875778198241

So the difference is not so huge, once you account for the other calculations made by GridSearchCV.

In your case, I believe your GridSearchCV is caused by calling n_job = 36 in both logisticRegression and GridSearchCV . Most likely you only want to call it once with GridSearchCV, in the case of testing many hyperparameters.

回复收藏 0 原文

~没有更多了~