梯度下降算法不会收敛

发布于 2024-12-09 23:57:19 字数 1852 浏览 3 评论 0原文

我正在尝试为斯坦福机器学习讲座中解释的梯度下降算法编写一些代码（第 2 课 25:00 左右）。下面是我最初使用的实现，我认为它是从讲座中正确复制的，但是当我向训练集中添加大量数字（>8）时它不会收敛。

我正在输入一个数字X，并且点(X,X)被添加到训练集中，所以目前，我只是想得到它收敛到 y=ax+b，其中 a=1=theta\[1\] 和 b=0=theta\[0\] >。训练集是数组x和y，其中(x[i],y[i])是一个点。

void train()
{
    double delta;
    for (int i = 0; i < x.size(); i++)
    {
        delta = y[i]-hypothesis(x[i]);
        theta[1] += alpha*delta*x[i];
        theta[0] += alpha*delta*1;
    }
}

void C_Approx::display()
{
    std::cout<<theta[1]<<"x + "<<theta[0]<<" \t "<<"f(x)="<<hypothesis(1)<<std::endl;
}

我得到的一些结果：我输入一个数字，它运行 train() 几次，然后 display()

1
0.33616x + 0.33616   f(x)=0.67232
1
0.482408x + 0.482408     f(x)=0.964816
1
0.499381x + 0.499381     f(x)=0.998762
1
0.499993x + 0.499993     f(x)=0.999986
1
0.5x + 0.5   f(x)=1

它在通过 8 后发散的一个例子：

1
0.33616x + 0.33616   f(x)=0.67232
2
0.705508x + 0.509914     f(x)=1.21542
3
0.850024x + 0.449928     f(x)=1.29995
4
0.936062x + 0.330346     f(x)=1.26641
5
0.951346x + 0.231295     f(x)=1.18264
6
0.992876x + 0.137739     f(x)=1.13062
7
0.932206x + 0.127372     f(x)=1.05958
8
1.00077x + 0.000493063   f(x)=1.00126
9
-0.689325x + -0.0714712      f(x)=-0.760797
10
4.10321e+08x + 4.365e+07     f(x)=4.53971e+08
11
1.79968e+22x + 1.61125e+21   f(x)=1.9608e+22
12
-3.9452e+41x + -3.26957e+40      f(x)=-4.27216e+41

I尝试了此处提出的缩放步骤的解决方案，最终得到具有类似的结果。我做错了什么？

原文

I'm trying to write out a bit of code for the gradient descent algorithm explained in the Stanford Machine Learning lecture (lecture 2 at around 25:00). Below is the implementation I used at first, and I think it's properly copied over from the lecture, but it doesn't converge when I add large numbers (>8) to the training set.

I'm inputting a number X, and the point (X,X) is added to the training set, so at the moment, I'm only trying to get it to converge to y=ax+b where a=1=theta\[1\] and b=0=theta\[0\].
The training set is the array x and y, where (x[i],y[i]) is a point.

void train()
{
    double delta;
    for (int i = 0; i < x.size(); i++)
    {
        delta = y[i]-hypothesis(x[i]);
        theta[1] += alpha*delta*x[i];
        theta[0] += alpha*delta*1;
    }
}

void C_Approx::display()
{
    std::cout<<theta[1]<<"x + "<<theta[0]<<" \t "<<"f(x)="<<hypothesis(1)<<std::endl;
}

some of the results I'm getting:
I input a number, it runs train() a few times, then display()

1
0.33616x + 0.33616   f(x)=0.67232
1
0.482408x + 0.482408     f(x)=0.964816
1
0.499381x + 0.499381     f(x)=0.998762
1
0.499993x + 0.499993     f(x)=0.999986
1
0.5x + 0.5   f(x)=1

An example of it diverging after it passed 8:

1
0.33616x + 0.33616   f(x)=0.67232
2
0.705508x + 0.509914     f(x)=1.21542
3
0.850024x + 0.449928     f(x)=1.29995
4
0.936062x + 0.330346     f(x)=1.26641
5
0.951346x + 0.231295     f(x)=1.18264
6
0.992876x + 0.137739     f(x)=1.13062
7
0.932206x + 0.127372     f(x)=1.05958
8
1.00077x + 0.000493063   f(x)=1.00126
9
-0.689325x + -0.0714712      f(x)=-0.760797
10
4.10321e+08x + 4.365e+07     f(x)=4.53971e+08
11
1.79968e+22x + 1.61125e+21   f(x)=1.9608e+22
12
-3.9452e+41x + -3.26957e+40      f(x)=-4.27216e+41

I tried the solution proposed here of scaling the step and ended up with similar results.
What am I doing wrong?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

树深时见影 2024-12-16 23:57:20

你的实施很好。一般来说，当 α 太大时，随机梯度下降可能会发散。对于大型数据集，您要做的就是采取合理大小的随机样本，找到为您提供最佳结果的 α，然后将其用于其余部分。

回复收藏 0 原文

煮茶煮酒煮时光 2024-12-16 23:57:20

我也遇到过同样的问题（尽管是在 Java 中），因为我的学习率太大。
简而言之，我使用 α = 0.001，并且必须将其推至 0.000001 才能看到实际的收敛情况。

当然，这些值链接到您的数据集。

回复收藏 0 原文

挽你眉间 2024-12-16 23:57:20

当成本函数增加或上下循环时，alpha 值通常太大。您使用什么alpha？

从 alpha = 0.001 开始，看看是否收敛？如果没有，请尝试各种 alpha (0.003, 0.01, 0.03, 0.1, 0.3, 1) 并找到一个快速收敛的。

缩放数据（归一化）不会只帮助您处理 1 个特征（您的 theta[1]），因为归一化仅适用于 2+ 特征（多元线性回归）。

另请记住，对于少数特征，您可以使用正规方程来获得正确的答案。

回复收藏 0 原文

故乡的云 2024-12-16 23:57:20

使用回溯线搜索来保证收敛。实施起来非常简单。请参阅 Stephen Boyd，凸优化以供参考。您可以选择一些标准的 alpha、beta 值进行回溯线搜索，例如 0.3 和 0.8。

回复收藏 0 原文

冷︶言冷语的世界 2024-12-16 23:57:20

从您的描述来看，您要解决的问题并不明确。
此外，发布外部资源链接也是非常危险的——你可能会被 stackoverflow 屏蔽。

在任何情况下 - 梯度下降方法和具有固定步长（ML 社区称之为学习率）的（次梯度下降）都不必收敛。

附注
机器学习社区对“收敛条件”和“收敛到什么”不感兴趣——他们感兴趣的是创建通过交叉验证并获得良好结果的“东西”。

如果您对优化感到好奇 - 开始研究凸优化。不幸的是，很难找到这方面的工作，但它为各种数学优化中发生的事情提供了清晰的视野。

这是演示简单二次目标的源代码：

#!/usr/bin/env python
# Gradiend descend method (without stepping) is not converged for convex         
# objective

alpha = 0.1

#k = 10.0 # jumping around minimum
k = 20.0   # diverge
#k = 0.001  # algorithm converged but gap to the optimal is big

def f(x): return k*x*x
def g(x): return 2*k*x

x0 = 12
xNext = x0
i = 0
threshold = 0.01

while True:
    i += 1
    xNext = xNext + alpha*(-1)*(g(xNext))
    obj = (xNext)
    print "Iteration: %i, Iterate: %f, Objective: %f, Optimality Gap: %f" % (i, xNext, obj, obj - f(0.0))

    if (abs(g(xNext)) < threshold):
        break
    if i > 50:
        break

print "\nYou launched application with x0=%f,threshold=%f" % (x0, threshold)

It's not clean from your description what problem you're solving.
Also it's very dangerous to post links to external resources - you can be blocked in stackoverflow.

In any case - gradient descend method and (subgradient descend too) with fixed step size (ML community call it learning rate) should not necesseray converge.

p.s.
Machine Learning community is not interesting in "convergence condition" and "convergence to what" - they are interested in create "something" which pass cross-validation with good result.

If you're curious about optimization - start to look in convex optimization. Unfortunately it's hard to find job on it, but it append clean vision into what happens in various math optimization things.

Here is source code which demonstrate it for simple quadratic objective:

#!/usr/bin/env python
# Gradiend descend method (without stepping) is not converged for convex         
# objective

alpha = 0.1

#k = 10.0 # jumping around minimum
k = 20.0   # diverge
#k = 0.001  # algorithm converged but gap to the optimal is big

def f(x): return k*x*x
def g(x): return 2*k*x

x0 = 12
xNext = x0
i = 0
threshold = 0.01

while True:
    i += 1
    xNext = xNext + alpha*(-1)*(g(xNext))
    obj = (xNext)
    print "Iteration: %i, Iterate: %f, Objective: %f, Optimality Gap: %f" % (i, xNext, obj, obj - f(0.0))

    if (abs(g(xNext)) < threshold):
        break
    if i > 50:
        break

print "\nYou launched application with x0=%f,threshold=%f" % (x0, threshold)

回复收藏 0 原文