GridSearchCV拟合的时间比基本分类器长19倍,即使在那里只有一组可能的参数
我写了一个简单的基准测试,该基准表明,使用GridSearchCV拟合功能将Scikit-Learn与基本分类器作为LogisticRegression中使用,并且只有一组可能的超参数需要至少8次,多达19倍长达19倍,而不是使用使用的拟合功能基本分类器。知道为什么这种巨大的差异会发生吗?这是代码:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
iris = load_iris()
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
random_state=0, n_jobs=36)
distributions = dict(C=[1], penalty=['l1'])
ls = [next(ShuffleSplit(n_splits=1, test_size=.25, random_state=0).split(iris.data))]
train_X = iris.data[ls[0][0]]
train_Y = iris.target[ls[0][0]]
n = 20
tot_t = 0
for _ in range(n):
t0 = time.time()
clf = GridSearchCV(logistic, distributions, n_jobs=36, cv=ls)
search = clf.fit(iris.data, iris.target)
tot_t += time.time() - t0
print(f"avg = {tot_t / n}")
tot_t = 0
for _ in range(n):
t0 = time.time()
clf = logistic
search = clf.fit(train_X, train_Y)
tot_t += time.time() - t0
print(f"avg = {tot_t / n}")
clf = logistic的结果,clf = gscv分别为0.324和0.017(慢19倍)。请注意,对于GSCV,只有一组HyperParams可能(C = 1,numinty ='l1'),因此,基本上,GSCV仅适用于一个CLF,而不适合多个,并且没有CV(只给出了一组拆分对此),但是花了更多的时间!
如果我将iris.data和itis.target大100倍:
iris.data = np.repeat(iris.data, 100, axis=0)
iris.target = np.repeat(iris.target, 100, axis=0)
我得到这些结果:0.528和0.064(慢8倍)。 Iris和Iris较大的1000倍。Target:2.70和0.34(慢8倍)。
我用正常的iris.data和iris.target进行了随机化CV的测试:
clf = RandomizedSearchCV(logistic, distributions, random_state=2, n_jobs=36, cv=ls)
并得到这些结果:0.337和0.013(慢26倍)。
I have written a simple benchmark that shows that using the GridSearchCV fit function in scikit-learn with the base classifier as LogisticRegression and only one set of possible hyperparameters takes at least 8 times and up to 19 times longer than just using the fit function of the base classifier. Any idea why this big difference is happening? Here's the code:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
iris = load_iris()
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
random_state=0, n_jobs=36)
distributions = dict(C=[1], penalty=['l1'])
ls = [next(ShuffleSplit(n_splits=1, test_size=.25, random_state=0).split(iris.data))]
train_X = iris.data[ls[0][0]]
train_Y = iris.target[ls[0][0]]
n = 20
tot_t = 0
for _ in range(n):
t0 = time.time()
clf = GridSearchCV(logistic, distributions, n_jobs=36, cv=ls)
search = clf.fit(iris.data, iris.target)
tot_t += time.time() - t0
print(f"avg = {tot_t / n}")
tot_t = 0
for _ in range(n):
t0 = time.time()
clf = logistic
search = clf.fit(train_X, train_Y)
tot_t += time.time() - t0
print(f"avg = {tot_t / n}")
The results for clf = logistic, and clf = GSCV are 0.324 and 0.017, respectively (19 times slower). Note that for GSCV, there's only one set of hyperparams possible (C=1, penalty='l1'), so basically, GSCV has to fit only one clf and not multiple, and there's no CV (only one set of splits is given to it), yet it's taking much more time!
If I make iris.data and itis.target 100 times larger:
iris.data = np.repeat(iris.data, 100, axis=0)
iris.target = np.repeat(iris.target, 100, axis=0)
I get these results: 0.528 and 0.064 (8 times slower).
With 1000 times larger iris.data and iris.target: 2.70 and 0.34 (8 times slower).
I tested with the normal iris.data and iris.target with RandomizedCV:
clf = RandomizedSearchCV(logistic, distributions, random_state=2, n_jobs=36, cv=ls)
and got these results: 0.337 and 0.013 (26 times slower).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我看到您正在为逻辑回归零件调用36个线程,然后在此基础上使用
n_jobs = 36
在GridSearchCV中并行化。这是两次并行的,并且可能会减慢您的流程。对于logistic回归,仅在文档中使用多个类时,并行才能有效:
如果您有2或3个类,例如Iris中,这并不重要,因此您可以做到:
GridSearchCV不仅适合模型,它计算模型的分数,并在整个过程中总结了它。因此,您需要设置
revit = false
以确保它不会在完整数据集中重新装修最佳模型。另外,在此示例中,设置CV =无确保它不会运行并行进程。我们还可以包含一个测试数据集,以便我们也执行评分:例如,如果我这样做:
我得到:
这样,一旦您考虑了GridSearchCV进行的其他计算,那么差异就不那么巨大了。
在您的情况下,我相信您的
GridSearchCV
是由LogisticRegress中的N_Job = 36引起的,
和gridsearchcv
。在测试许多超参数的情况下,您很可能只想用GridSearchCV
调用一次。I see that you are calling 36 threads for the logistic regression part, and then on top of this, trying to parallelize this in GridSearchCV with
n_jobs=36
. This is parallelizing twice and might slow down your processes.For logistic regression, parallel only works if you have multi class, as from the documentation:
If you have 2 or 3 classes, like in iris, it doesn't quite matter, so you can do:
GridSearchCV does more than fitting the model, it calculates the score of the model, and also summarizes it across. So you need to set
refit = False
to ensure it doesn't refit the best model on the full dataset. Also in this example, setting cv=None ensures it doesn't run parallel processes. We can also include a test dataset so that we perform the scoring as well:For example if I do :
I get:
So the difference is not so huge, once you account for the other calculations made by GridSearchCV.
In your case, I believe your
GridSearchCV
is caused by calling n_job = 36 in bothlogisticRegression
andGridSearchCV
. Most likely you only want to call it once withGridSearchCV
, in the case of testing many hyperparameters.