梯度下降算法不会收敛
我正在尝试为斯坦福机器学习讲座中解释的梯度下降算法编写一些代码(第 2 课 25:00 左右)。下面是我最初使用的实现,我认为它是从讲座中正确复制的,但是当我向训练集中添加大量数字(>8
)时它不会收敛。
我正在输入一个数字X
,并且点(X,X)
被添加到训练集中,所以目前,我只是想得到它收敛到 y=ax+b
,其中 a=1=theta\[1\]
和 b=0=theta\[0\]
>。 训练集是数组x
和y
,其中(x[i],y[i])
是一个点。
void train()
{
double delta;
for (int i = 0; i < x.size(); i++)
{
delta = y[i]-hypothesis(x[i]);
theta[1] += alpha*delta*x[i];
theta[0] += alpha*delta*1;
}
}
void C_Approx::display()
{
std::cout<<theta[1]<<"x + "<<theta[0]<<" \t "<<"f(x)="<<hypothesis(1)<<std::endl;
}
我得到的一些结果: 我输入一个数字,它运行 train()
几次,然后 display()
1
0.33616x + 0.33616 f(x)=0.67232
1
0.482408x + 0.482408 f(x)=0.964816
1
0.499381x + 0.499381 f(x)=0.998762
1
0.499993x + 0.499993 f(x)=0.999986
1
0.5x + 0.5 f(x)=1
它在通过 8
后发散的一个例子:
1
0.33616x + 0.33616 f(x)=0.67232
2
0.705508x + 0.509914 f(x)=1.21542
3
0.850024x + 0.449928 f(x)=1.29995
4
0.936062x + 0.330346 f(x)=1.26641
5
0.951346x + 0.231295 f(x)=1.18264
6
0.992876x + 0.137739 f(x)=1.13062
7
0.932206x + 0.127372 f(x)=1.05958
8
1.00077x + 0.000493063 f(x)=1.00126
9
-0.689325x + -0.0714712 f(x)=-0.760797
10
4.10321e+08x + 4.365e+07 f(x)=4.53971e+08
11
1.79968e+22x + 1.61125e+21 f(x)=1.9608e+22
12
-3.9452e+41x + -3.26957e+40 f(x)=-4.27216e+41
I尝试了此处提出的缩放步骤的解决方案,最终得到具有类似的结果。 我做错了什么?
I'm trying to write out a bit of code for the gradient descent algorithm explained in the Stanford Machine Learning lecture (lecture 2 at around 25:00). Below is the implementation I used at first, and I think it's properly copied over from the lecture, but it doesn't converge when I add large numbers (>8
) to the training set.
I'm inputting a number X
, and the point (X,X)
is added to the training set, so at the moment, I'm only trying to get it to converge to y=ax+b
where a=1=theta\[1\]
and b=0=theta\[0\]
.
The training set is the array x
and y
, where (x[i],y[i])
is a point.
void train()
{
double delta;
for (int i = 0; i < x.size(); i++)
{
delta = y[i]-hypothesis(x[i]);
theta[1] += alpha*delta*x[i];
theta[0] += alpha*delta*1;
}
}
void C_Approx::display()
{
std::cout<<theta[1]<<"x + "<<theta[0]<<" \t "<<"f(x)="<<hypothesis(1)<<std::endl;
}
some of the results I'm getting:
I input a number, it runs train()
a few times, then display()
1
0.33616x + 0.33616 f(x)=0.67232
1
0.482408x + 0.482408 f(x)=0.964816
1
0.499381x + 0.499381 f(x)=0.998762
1
0.499993x + 0.499993 f(x)=0.999986
1
0.5x + 0.5 f(x)=1
An example of it diverging after it passed 8
:
1
0.33616x + 0.33616 f(x)=0.67232
2
0.705508x + 0.509914 f(x)=1.21542
3
0.850024x + 0.449928 f(x)=1.29995
4
0.936062x + 0.330346 f(x)=1.26641
5
0.951346x + 0.231295 f(x)=1.18264
6
0.992876x + 0.137739 f(x)=1.13062
7
0.932206x + 0.127372 f(x)=1.05958
8
1.00077x + 0.000493063 f(x)=1.00126
9
-0.689325x + -0.0714712 f(x)=-0.760797
10
4.10321e+08x + 4.365e+07 f(x)=4.53971e+08
11
1.79968e+22x + 1.61125e+21 f(x)=1.9608e+22
12
-3.9452e+41x + -3.26957e+40 f(x)=-4.27216e+41
I tried the solution proposed here of scaling the step and ended up with similar results.
What am I doing wrong?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
你的实施很好。一般来说,当 α 太大时,随机梯度下降可能会发散。对于大型数据集,您要做的就是采取合理大小的随机样本,找到为您提供最佳结果的 α,然后将其用于其余部分。
Your implementation is good. Generally, stochastic gradient descent might diverge when α is too large. What you would do with a large dataset is take a reasonably sized random sample, find α that gives you the best results, and then use it for the rest.
我也遇到过同样的问题(尽管是在 Java 中),因为我的学习率太大。
简而言之,我使用
α = 0.001
,并且必须将其推至0.000001
才能看到实际的收敛情况。当然,这些值链接到您的数据集。
I have experienced the same problem (albeit in Java) because my learning rate was too big.
For short, I was using
α = 0.001
and I had to push it to0.000001
to see actual convergence.Of course these values are linked to your dataset.
当成本函数增加或上下循环时,
alpha
值通常太大。您使用什么alpha
?从
alpha = 0.001
开始,看看是否收敛?如果没有,请尝试各种alpha
(0.003, 0.01, 0.03, 0.1, 0.3, 1)
并找到一个快速收敛的。缩放数据(归一化)不会只帮助您处理 1 个特征(您的
theta[1]
),因为归一化仅适用于2+
特征(多元线性回归)。另请记住,对于少数特征,您可以使用正规方程来获得正确的答案。
When your cost function increases or cycles up and down, you usually have too large a value for
alpha
. Whatalpha
are you using?Start out with an
alpha = 0.001
and see if that converges? If not try variousalphas
(0.003, 0.01, 0.03, 0.1, 0.3, 1)
and find one that converges quickly.Scaling the data (normalization) won't help you with only 1 feature (your
theta[1]
) as normalization only applies to2+
features (multivariate linear regression).Also bear in mind that for a small number of features you can use the Normal Equation to get the correct answer.
使用回溯线搜索来保证收敛。实施起来非常简单。请参阅 Stephen Boyd,凸优化以供参考。您可以选择一些标准的 alpha、beta 值进行回溯线搜索,例如 0.3 和 0.8。
use backtracking line search to guaranty convergence. It is very simple to implement. See Stephen Boyd, Convex Optimization for reference. You can choose some standard alpha, beta values for backtracking line search, for example 0.3 and 0.8.
从您的描述来看,您要解决的问题并不明确。
此外,发布外部资源链接也是非常危险的——你可能会被 stackoverflow 屏蔽。
在任何情况下 - 梯度下降方法和具有固定步长(ML 社区称之为学习率)的(次梯度下降)都不必收敛。
附注
机器学习社区对“收敛条件”和“收敛到什么”不感兴趣——他们感兴趣的是创建通过交叉验证并获得良好结果的“东西”。
如果您对优化感到好奇 - 开始研究凸优化。不幸的是,很难找到这方面的工作,但它为各种数学优化中发生的事情提供了清晰的视野。
这是演示简单二次目标的源代码:
It's not clean from your description what problem you're solving.
Also it's very dangerous to post links to external resources - you can be blocked in stackoverflow.
In any case - gradient descend method and (subgradient descend too) with fixed step size (ML community call it learning rate) should not necesseray converge.
p.s.
Machine Learning community is not interesting in "convergence condition" and "convergence to what" - they are interested in create "something" which pass cross-validation with good result.
If you're curious about optimization - start to look in convex optimization. Unfortunately it's hard to find job on it, but it append clean vision into what happens in various math optimization things.
Here is source code which demonstrate it for simple quadratic objective:
如果我理解正确的话,您的训练集仅在线条边缘具有非零梯度?除非您从该线开始(实际上是从您的训练点之一开始),否则您将找不到该线。你总是处于局部最小值。
If I understand you correctly, your training set only has a non-zero gradient at the edge of a line? Unless you start at the line (actually start exactly at one of your training points) you won't find the line. You are always at a local minimum.