返回介绍

Optimizers

发布于 2025-02-25 23:43:54 字数 12241 浏览 0 评论 0 收藏 0

Newton-Conjugate Gradient

First a note about the interpretations of Newton’s method in 1-D:

In the lecture on 1-D optimization, Newton’s method was presented as a method of finding zeros. That is what it is, but it may also be interpreted as a method of optimization. In the latter case, we are really looking for zeroes of the first derivative.

Let’s compare the formulas for clarification:

\[\begin{split}\begin{array}{|c|c|c|c|} \hline \text{Finding roots of } f & \text{Geometric Interpretation} & \text{Finding Extrema of } f & \text{Geometric Interpretation} \\ \hline x_{n+1} = x_n -\frac{f(x_n)}{f'(x_n)} &\text{Invert linear approximation to }f & x_{n+1} = x_n -\frac{f'(x_n)}{f''(x_n)}& \text{Use quadratic approximation of } f \\ \hline \end{array}\end{split}\]

These are two ways of looking at exactly the same problem. For instance, the linear approximation in the root finding problem is simply the derivative function of the quadratic approximation in the optimization problem.

Hessians, Gradients and Forms - Oh My!

Let’s review the theory of optimization for multivariate functions. Recall that in the single-variable case, extreme values (local extrema) occur at points where the first derivative is zero, however, the vanishing of the first derivative is not a sufficient condition for a local max or min. Generally, we apply the second derivative test to determine whether a candidate point is a max or min (sometimes it fails - if the second derivative either does not exist or is zero). In the multivariate case, the first and second derivatives are matrices. In the case of a scalar-valued function on \(\mathbb{R}^n\), the first derivative is an \(n\times 1\) vector called the gradient (denoted \(\nabla f\)). The second derivative is an \(n\times n\) matrix called the Hessian (denoted \(H\))

Just to remind you, the gradient and Hessian are given by:

\[\begin{split}\nabla f(x) = \left(\begin{matrix}\frac{\partial f}{\partial x_1}\\ \vdots \\\frac{\partial f}{\partial x_n}\end{matrix}\right)\end{split}\] \[\begin{split}H = \left(\begin{matrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_1\,\partial x_n} \\[2.2ex] \dfrac{\partial^2 f}{\partial x_2\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \cdots & \dfrac{\partial^2 f}{\partial x_2\,\partial x_n} \\[2.2ex] \vdots & \vdots & \ddots & \vdots \\[2.2ex] \dfrac{\partial^2 f}{\partial x_n\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_n\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2} \end{matrix}\right)\end{split}\]

One of the first things to note about the Hessian - it’s symmetric. This structure leads to some useful properties in terms of interpreting critical points.

The multivariate analog of the test for a local max or min turns out to be a statement about the gradient and the Hessian matrix. Specifically, a function \(f:\mathbb{R}^n\rightarrow \mathbb{R}\) has a critical point at \(x\) if \(\nabla f(x) = 0\) (where zero is the zero vector!). Furthermore, the second derivative test at a critical point is as follows:

  • If \(H(x)\) is positive-definite, \(f\) has a local minimum at \(x\)
  • If \(H(x)\) is negative-definite, \(f\) has a local maximum at \(x\)
  • If \(H(x)\) has both positive and negative eigenvalues, \(f\) has a saddle point at \(x\).

Newton CG Algorithm

Features:

  • Minimizes a ‘true’ quadratic on \(\mathbb{R}^n\) in \(n\) steps
  • Does NOT require storage or inversion of an \(n \times n\) matrix.

We begin with \(:\mathbb{R}^n\rightarrow \mathbb{R}\). Take a quadratic approximation to \(f\):

\[f(x) \approx \frac12 x^T H x + b^Tx + c\]

Note that in the neighborhood of a minimum, \(H\) will be positive-definite (and symmetric). (If we are maximizing, just consider \(-H\)).

This reduces the optimization problem to finding the zeros of

\[Hx = -b\]

This is a linear problem, which is nice. The dimension \(n\) may be very large - which is not so nice. Also, a priori it looks like we need to know \(H\). We actually don’t but that will become clear only after a bit of explanation.

General Inner Product

Recall the axiomatic definition of an inner product \(<,>_A\):

  • For any two vectors \(v,w\) we have \[\begin{split}<v,w>_A = <w,v>_A\end{split}\]
  • For any vector \(v\) \[\begin{split}<v,v>_A \;\geq 0\end{split}\]

    with equality \(\iff\) \(v=0\).

  • For \(c\in\mathbb{R}\) and \(u,v,w\in\mathbb{R}^n\), we have \[\begin{split}<cv+w,u> = c<v,u> + <w,u>\end{split}\]

These properties are known as symmetric, positive definite and bilinear, respectively.

Fact: If we denote the standard inner product on \(\mathbb{R}^n\) as \(<,>\) (this is the ‘dot product’), any symmetric, positive definite \(n\times n\) matrix \(A\) defines an inner product on \(\mathbb{R}^n\) via:

\[\begin{split}<v,w>_A \; = <v,Aw> = v^TAw\end{split}\]

Just as with the standard inner product, general inner products define for us a notion of ‘orthogonality’. Recall that with respect to the standard product, 2 vectors are orthogonal if their product vanishes. The same applies to \(<,>_A\):

\[\begin{split}<v,w>_A = 0\end{split}\]

means that \(v\) and \(w\) are orthogonal under the inner product induced by \(A\). Equivalently, if \(v,w\) are orthogonal under \(A\), we have:

\[v^TAw = 0\]

This is also called conjugate (thus the name of the method).

Conjugate Vectors

Suppose we have a set of \(n\) vectors \(p_1,...,p_n\) that are mutually conjugate. These vectors form a basis of \(\mathbb{R}^n\). Getting back to the problem at hand, this means that our solution vector \(x\) to the linear problem may be written as follows:

\[x = \sum\limits_{i=1}^n \alpha_i p_i\]

So, finding \(x\) reduces to finding a conjugate basis and the coefficients for \(x\) in that basis.

Note that:

\[{p}_k^{T} {b}={p}_k^{T} {A}{x}\]

and because \(x = \sum\limits_{i=1}^n \alpha_i p_i\), we have:

\[p^TAx = \sum\limits_{i=1}^n \alpha_i p^TA p_i\]

we can solve for \(\alpha_k\):

\[\alpha_k = \frac{{p}_k^{T}{b}}{{p}_k^{T} {A}{p}_k} = \frac{\langle {p}_k, {b}\rangle}{\,\,\,\langle {p}_k, {p}_k\rangle_{A}} = \frac{\langle{p}_k, {b}\rangle}{\,\,\,\|{p}_k\|_{A}^2}.\]

Now, all we need are the \(p_k\)‘s.

A nice initial guess would be the gradient at some initial point \(x_1\). So, we set \(p_1 = \nabla f(x_1)\). Then set:

\[x_2 = x_1 + \alpha_1p_1\]

This should look familiar. In fact, it is gradient descent. For \(p_2\), we want \(p_1\) and \(p_2\) to be conjugate (under \(A\)). That just means orthogonal under the inner product induced by \(A\). We set

\[p_2 = \nabla f(x_1) - \frac{p_1^TA\nabla f(x_1)}{{p}_1^{T}{A}{p}_1} {p}_1\]

I.e. We take the gradient at \(x_1\) and subtract its projection onto \(p_1\). This is the same as Gram-Schmidt orthogonalization.

The \(k^{th}\) conjugate vector is:

\[p_{k+1} = \nabla f(x_k) - \sum\limits_{i=1}^k\frac{p_i^T A \nabla f(x_k)}{p_i^TAp_i} p_i\]

The ‘trick’ is that in general, we do not need all \(n\) conjugate vectors.

Convergence rate is dependent on sparsity and condition number of \(A\). Worst case is \(n^2\).

BFGS - Broyden–Fletcher–Goldfarb–Shanno

BFGS is a ‘quasi’ Newton method of optimization. Such methods are variants of the Newton method, where the Hessian \(H\) is replaced by some approximation. We we wish to solve the equation:

\[B_k{p}_k = -\nabla f({x}_k)\]

for \(p_k\). This gives our search direction, and the next candidate point is given by:

\[x_{k+1} = x_k + \alpha_k p_k\]

.

where \(\alpha_k\) is a step size.

At each step, we require that the new approximate \(H\) meets the secant condition:

\[B_{k+1}(x_{k+1}-x_k) = \nabla f(x_{k+1}) -\nabla f(x_k)\]

There is a unique, rank one update that satisfies the above:

\[B_{k+1} = B_k + c_k v_kv_k^T\]

where

\[c_k = -\frac{1}{\left(B_k(x_{k+1}-x_k) - (\nabla f(x_{k+1})-\nabla f(x_k)\right)^T (x_{k+1}-x_k) }\]

and

\[v_k = B_k(x_{k+1}-x_k) - (\nabla f(x_{k+1})-\nabla f(x_k))\]

Note that the update does NOT preserve positive definiteness if \(c_k<0\). In this case, there are several options for the rank one correction, but we will not address them here. Instead, we will describe the BFGS method, which almost always guarantees a positive-definite correction. Specifically:

\[B_{k+1} = B_k + b_k g_k g_k^T + c_k B_k d_k d_k^TB_k\]

where we have introduced the shorthand:

\[g_k = \nabla f(x_{k+1}) - \nabla f(x_k) \;\;\;\;\;\;\;\ \mathrm{ and }\;\;\;\;\;\;\; d_k = x_{k+1} - x_k\]

If we set:

\[b_k = \frac{1}{g_k^Td_k} \;\;\;\;\; \mathrm{ and } \;\;\;\;\; c_k = \frac{1}{d_k^TB_kd_k}\]

we satisfy the secant condition.

Nelder-Mead Simplex

While Newton’s method is considered a ‘second order method’ (requires the second derivative), and quasi-Newton methods are first order (require only first derivatives), Nelder-Mead is a zero-order method. I.e. NM requires only the function itself - no derivatives.

For \(f:\mathbb{R}^n\rightarrow \mathbb{R}\), the algorithm computes the values of the function on a simplex of dimension \(n\), constructed from \(n+1\) vertices. For a univariate function, the simplex is a line segment. In two dimensions, the simplex is a triangle, in 3D, a tetrahedral solid, and so on.

The algorithm begins with \(n+1\) starting points and then the follwing steps are repeated until convergence:

  • Compute the function at each of the points
  • Sort the function values so that \[f(x_1)\leq ...\leq f(x_{n+1})\]
  • Compute the centroid \(x_c\) of the n-dimensional region defined by \(x_1,...,x_n\)
  • Reflect \(x_{n+1}\) about the centroid to get \(x_r\) \[x_r = x_c + \alpha (x_c - x_{n+1})\]
  • Create a new simplex according to the following rules:
    • If \(f(x_1)\leq f(x_r) < f(x_n)\), replace \(x_{n+1}\) with \(x_r\)
    • If \(f(x_r)<f(x_1)\), expand the simplex through \(x_r\): \[x_e = x_c + \gamma (x_c - x_{n+1})\]

      If \(f(x_e)<f(x_r)\), replace \(x_{n+1}\) with \(x_e\), otherwise, replace \(x_{n+1}\) with \(x_r\)

    • If \(f({x}_{r}) \geq f({x}_{n})\), compute \(x_p = x_c + \rho(x_c - x_{n+1})\). If \(f({x}_{p}) < f({x}_{n+1})\), replace \(x_{n+1}\) with \(x_p\)
    • If all else fails, replace all points except \(x_1\) according to \[x_i = {x}_{1} + \sigma({x}_{i} - {x}_{1})\]

The default values of \(\alpha, \gamma,\rho\) and \(\sigma\) in scipy are not listed in the documentation, nor are they inputs to the function.

Powell’s Method

Powell’s method is another derivative-free optimization method that is similar to conjugate-gradient. The algorithm steps are as follows:

Begin with a point \(p_0\) (an initial guess) and a set of vectors \(\xi_1,...,\xi_n\), initially the standard basis of \(\mathbb{R}^n\).

  • Compute for \(i=1,...,n\), find \(\lambda_i\) that minimizes \(f(p_{i-1} +\lambda_i \xi_i)\) and set \(p_i = p_{i-1} + \lambda_i\xi_i\)
  • For \(i=1,...,n-1\), replace \(\xi_{i}\) with \(\xi_{i+1}\) and then replace \(\xi_n\) with \(p_n - p_0\)
  • Choose \(\lambda\) so that \(f(p_0 + \lambda(p_n-p_0)\) is minimum and replace \(p_0\) with \(p_0 + \lambda(p_n-p_0)\)

Essentially, the algorithm performs line searches and tries to find fruitful directions to search.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文