Introduction to Python
- Variables
- Operators
- Iterators
- Conditional Statements
- Functions
- Strings and String Handling
- Lists, Tuples, Dictionaries
- Classes
- Modules
- The standard library
- Keeping the Anaconda distribution up-to-date
- Exercises
Getting started with Python and the IPython notebook
- Cells
- Code Cells
- Magic Commands
- Python as Glue
- Python <-> R <-> Matlab <-> Octave
- More Glue: Julia and Perl
Functions are first class objects
Data science is OSEMN
- Obtaining data
- Scrubbing data
- Exercises
Working with text
- String methods
- Splitting and joining strings
- The string module
- Regular expressions
- The NLTK toolkit
- Exercises
Preprocessing text data
- Example: Counting words in a document
Working with structured data
Using SQLite3
- Basic concepts of database normalization
Using HDF5
- Interfacing withPandas
Using numpy
- References
- Example
- NDArray
- Broadcasting, row, column and matrix operations
- Universal functions (Ufuncs)
- Generalized ufucns
- Random numbers
- Linear algebra
- Exercises
Using Pandas
- Series
- DataFrame
- Panels
- Split-Apply-Combine
- Using statsmodels
Computational problems in statistics
- Textbook example - is coin fair?
- Bayesian approach
- Comment
Computer numbers and mathematics
- Some examples of numbers behaving badly
- Finite representation of numbers
- Using arbitrary precision libraries
- From numbers to Functions: Stability and conditioning
- Exercises
Algorithmic complexity
- Profling and benchmarking
- Measuring algorithmic complexity
- Space complexity
Linear Algebra and Linear Systems
- Simultaneous Equations
- Linear Independence
- Norms and Distance of Vectors
- Trace and Determinant of Matrices
- Column space, Row space, Rank and Kernel
- Matrices as Linear Transformations
- Matrix Norms
- Special Matrices
- Exercises
Linear Algebra and Matrix Decompositions
- Large Linear Systems
- Example: Netflix Competition (circa 2006-2009)
- Matrix Decompositions
- Matrix Decompositions for PCA and Least Squares
- Singular Value Decomposition
- Stabilty and Condition Number
- Exercises
Change of Basis
- Variance and covariance
- Eigendecomposition of the covariance matrix
- PCA
- Change of basis via PCA
- Graphical illustration of change of basis
- Dimension reduction via PCA
- Using Singular Value Decomposition (SVD) for PCA
Optimization and Non-linear Methods
- Example: Maximum Likelihood Estimation (MLE)
- Bisection Method
- Secant Method
- Newton-Rhapson Method
- Gauss-Newton
- Inverse Quadratic Interpolation
- Brent’s Method
Practical Optimizatio Routines
- Finding roots
- Optimization Primer
- Using scipy.optimize
- Gradient deescent
- Newton’s method and variants
- Constrained optimization
- Curve fitting
- Finding paraemeters for ODE models
- Optimization of graph node placement
- Optimization of standard statistical models
- Fitting ODEs with the Levenberg–Marquardt algorithm
- 1D example
- 2D example
Algorithms for Optimization and Root Finding for Multivariate Problems
- Optimizers
- Solvers
- GLM Estimation and IRLS
Expectation Maximizatio (EM) Algorithm
- Jensen’s inequality
- Maximum likelihood with complete information
- Incomplete information
- Gaussian mixture models
- Using EM
- Vectorized version
- Vectorization with Einstein summation notation
- Comparison of EM routines
Monte Carlo Methods
- Pseudorandom number generators (PRNG)
- Monte Carlo swindles (Variance reduction techniques)
- Quasi-random numbers
Resampling methods
- Resampling
- Simulations
- Setting the random seed
- Sampling with and without replacement
- Calculation of Cook’s distance
- Permutation resampling
- Design of simulation experiments
- Example: Simulations to estimate power
- Check with R
- Estimating the CDF
- Estimating the PDF
- Kernel density estimation
- Multivariate kerndel density estimation
Markov Chain Monte Carlo (MCMC)
- Bayesian Data Analysis
- Metropolis-Hastings sampler
- Gibbs sampler
- Slice sampler
- Hierarchical models
Using PyMC2
- Coin toss
- Estimating mean and standard deviation of normal distribution
- Estimating parameters of a linear regreession model
- Estimating parameters of a logistic model
- Using a hierarchcical model
Using PyMC3
- Coin toss
- Estimating mean and standard deviation of normal distribution
- Estimating parameters of a linear regreession model
- Estimating parameters of a logistic model
- Using a hierarchcical model
Using PyStan
- References
- Simple Logistic model
C Crash Course
- Hello world
- A tutorial example - coding a Fibonacci function in C
- Types in C
- Operators
- Control of program flow
- Arrays and pointers
- Functions
- Function pointers
- Using make to compile C programs
- Exercise
Code Optimization
- Profiling
- Using better algorihtms and data structures
- I/O Bound problems
- Problem set for optimization
Using C code in Python
- Example: The Fibonacci Sequence
- Using clang and bitey
- Using gcc and ctypes
- Using Cython
- Benchmark
Using functions from various compiled languages in Python
- C
- C++
- Fortran
- Benchmarking
- Wrapping a function from a C library for use in Python
- Wrapping functions from C++ library for use in Pyton
Julia and Python
- Defining a function in Julia
- Using it in Python
- Using Python libraries in Julia
Converting Python Code to C for speed
- Example: Fibonacci
- Example: Matrix multiplication
- Example: Pairwise distance matrix
- Profiling code
- Numba
- Cython
- Comparison with optimized C from scipy
Optimization bake-off
- Python version
- Numpy version
- Numexpr version
- Numba version
- NumbaPro version
- Parakeet version
- Cython version
- C version
- C++ version
- Fortran version
- Bake-off
- Summary
- Recommendations for optimizing Python code
Writing Parallel Code
- Concepts
- Embarassingly parallel programs
- Using Multiprocessing
- Using IPython parallel for interactive parallel computing
- Other parallel programming approaches not covered
- References
Massively parallel programming with GPUs
- Programming GPUs
- GPU Architecture
- CUDA Python
- Getting Started with CUDA
- Vector addition - the ‘Hello, world’ of CUDA
- Performing a reduction on CUDA
- Recreational
- More examples
Writing CUDA in C
- Review of GPU Architechture - A Simplification
- Cuda C program - an Outline
Distributed computing for Big Data
- Why and when does distributed computing matter?
- Ingredients for effiicient distributed computing
- What is Hadoop?
- Review of functional programming
- The Hadoop MapReduce workflow
- Using Hadoop MapReduce
- Spark
Hadoop MapReduce on AWS EMR with mrjob
- MapReduce code
- Configuration file
- Launching job
Spark on a local mahcine using 4 nodes
- Using Spark in standalone prograsm
- Introduction to Spark concepts with a data manipulation example
- Using the MLlib for Regression
- References
Modules and Packaging
- Modules
- Distributing your package
Tour of the Jupyter (IPython3) notebook
- Installing Jupyter
- Installing other kernels
- Installing extensions
- Installing Python3 while keeping Python2
- Now, restart your notebook server
Polyglot programming
- Python 2
- Python 3
- Bash
- R
- Scala
- Julia
- Processing
What you should know and learn more about
- Statistical foundations
- Computing foundations
- Mathematical foundations
- Statistical algorithms
- Libraries worth knowing about after numpy, scipy and matplotlib
Wrapping R libraries with Rpy

文江博客开发文档 Computational Statistics in Python 文章详情

文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

Optimizers

发布于 2025-02-25 23:43:54 字数 12241 浏览 0 评论 0 收藏 0

Newton-Conjugate Gradient

First a note about the interpretations of Newton’s method in 1-D:

In the lecture on 1-D optimization, Newton’s method was presented as a method of finding zeros. That is what it is, but it may also be interpreted as a method of optimization. In the latter case, we are really looking for zeroes of the first derivative.

Let’s compare the formulas for clarification:

\[\begin{split}\begin{array}{|c|c|c|c|} \hline \text{Finding roots of } f & \text{Geometric Interpretation} & \text{Finding Extrema of } f & \text{Geometric Interpretation} \\ \hline x_{n+1} = x_n -\frac{f(x_n)}{f'(x_n)} &\text{Invert linear approximation to }f & x_{n+1} = x_n -\frac{f'(x_n)}{f''(x_n)}& \text{Use quadratic approximation of } f \\ \hline \end{array}\end{split}\]

These are two ways of looking at exactly the same problem. For instance, the linear approximation in the root finding problem is simply the derivative function of the quadratic approximation in the optimization problem.

Hessians, Gradients and Forms - Oh My!

Let’s review the theory of optimization for multivariate functions. Recall that in the single-variable case, extreme values (local extrema) occur at points where the first derivative is zero, however, the vanishing of the first derivative is not a sufficient condition for a local max or min. Generally, we apply the second derivative test to determine whether a candidate point is a max or min (sometimes it fails - if the second derivative either does not exist or is zero). In the multivariate case, the first and second derivatives are matrices. In the case of a scalar-valued function on \(\mathbb{R}^n\), the first derivative is an \(n\times 1\) vector called the gradient (denoted \(\nabla f\)). The second derivative is an \(n\times n\) matrix called the Hessian (denoted \(H\))

Just to remind you, the gradient and Hessian are given by:

\[\begin{split}\nabla f(x) = \left(\begin{matrix}\frac{\partial f}{\partial x_1}\\ \vdots \\\frac{\partial f}{\partial x_n}\end{matrix}\right)\end{split}\] \[\begin{split}H = \left(\begin{matrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_1\,\partial x_n} \\[2.2ex] \dfrac{\partial^2 f}{\partial x_2\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \cdots & \dfrac{\partial^2 f}{\partial x_2\,\partial x_n} \\[2.2ex] \vdots & \vdots & \ddots & \vdots \\[2.2ex] \dfrac{\partial^2 f}{\partial x_n\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_n\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2} \end{matrix}\right)\end{split}\]

One of the first things to note about the Hessian - it’s symmetric. This structure leads to some useful properties in terms of interpreting critical points.

The multivariate analog of the test for a local max or min turns out to be a statement about the gradient and the Hessian matrix. Specifically, a function \(f:\mathbb{R}^n\rightarrow \mathbb{R}\) has a critical point at \(x\) if \(\nabla f(x) = 0\) (where zero is the zero vector!). Furthermore, the second derivative test at a critical point is as follows:

If \(H(x)\) is positive-definite, \(f\) has a local minimum at \(x\)
If \(H(x)\) is negative-definite, \(f\) has a local maximum at \(x\)
If \(H(x)\) has both positive and negative eigenvalues, \(f\) has a saddle point at \(x\).

Newton CG Algorithm

Features:

Minimizes a ‘true’ quadratic on \(\mathbb{R}^n\) in \(n\) steps
Does NOT require storage or inversion of an \(n \times n\) matrix.

We begin with \(:\mathbb{R}^n\rightarrow \mathbb{R}\). Take a quadratic approximation to \(f\):

\[f(x) \approx \frac12 x^T H x + b^Tx + c\]

Note that in the neighborhood of a minimum, \(H\) will be positive-definite (and symmetric). (If we are maximizing, just consider \(-H\)).

This reduces the optimization problem to finding the zeros of

\[Hx = -b\]

This is a linear problem, which is nice. The dimension \(n\) may be very large - which is not so nice. Also, a priori it looks like we need to know \(H\). We actually don’t but that will become clear only after a bit of explanation.

General Inner Product

Recall the axiomatic definition of an inner product \(<,>_A\):

For any two vectors \(v,w\) we have \[\begin{split}<v,w>_A = <w,v>_A\end{split}\]
For any vector \(v\) \[\begin{split}<v,v>_A \;\geq 0\end{split}\]
with equality \(\iff\) \(v=0\).
For \(c\in\mathbb{R}\) and \(u,v,w\in\mathbb{R}^n\), we have \[\begin{split}<cv+w,u> = c<v,u> + <w,u>\end{split}\]

These properties are known as symmetric, positive definite and bilinear, respectively.

Fact: If we denote the standard inner product on \(\mathbb{R}^n\) as \(<,>\) (this is the ‘dot product’), any symmetric, positive definite \(n\times n\) matrix \(A\) defines an inner product on \(\mathbb{R}^n\) via:

\[\begin{split}<v,w>_A \; = <v,Aw> = v^TAw\end{split}\]

Just as with the standard inner product, general inner products define for us a notion of ‘orthogonality’. Recall that with respect to the standard product, 2 vectors are orthogonal if their product vanishes. The same applies to \(<,>_A\):

\[\begin{split}<v,w>_A = 0\end{split}\]

means that \(v\) and \(w\) are orthogonal under the inner product induced by \(A\). Equivalently, if \(v,w\) are orthogonal under \(A\), we have:

\[v^TAw = 0\]

This is also called conjugate (thus the name of the method).

Conjugate Vectors

Suppose we have a set of \(n\) vectors \(p_1,...,p_n\) that are mutually conjugate. These vectors form a basis of \(\mathbb{R}^n\). Getting back to the problem at hand, this means that our solution vector \(x\) to the linear problem may be written as follows:

\[x = \sum\limits_{i=1}^n \alpha_i p_i\]

So, finding \(x\) reduces to finding a conjugate basis and the coefficients for \(x\) in that basis.

Note that:

\[{p}_k^{T} {b}={p}_k^{T} {A}{x}\]

and because \(x = \sum\limits_{i=1}^n \alpha_i p_i\), we have:

\[p^TAx = \sum\limits_{i=1}^n \alpha_i p^TA p_i\]

we can solve for \(\alpha_k\):

\[\alpha_k = \frac{{p}_k^{T}{b}}{{p}_k^{T} {A}{p}_k} = \frac{\langle {p}_k, {b}\rangle}{\,\,\,\langle {p}_k, {p}_k\rangle_{A}} = \frac{\langle{p}_k, {b}\rangle}{\,\,\,\|{p}_k\|_{A}^2}.\]

Now, all we need are the \(p_k\)‘s.

A nice initial guess would be the gradient at some initial point \(x_1\). So, we set \(p_1 = \nabla f(x_1)\). Then set:

\[x_2 = x_1 + \alpha_1p_1\]

This should look familiar. In fact, it is gradient descent. For \(p_2\), we want \(p_1\) and \(p_2\) to be conjugate (under \(A\)). That just means orthogonal under the inner product induced by \(A\). We set

\[p_2 = \nabla f(x_1) - \frac{p_1^TA\nabla f(x_1)}{{p}_1^{T}{A}{p}_1} {p}_1\]

I.e. We take the gradient at \(x_1\) and subtract its projection onto \(p_1\). This is the same as Gram-Schmidt orthogonalization.

The \(k^{th}\) conjugate vector is:

\[p_{k+1} = \nabla f(x_k) - \sum\limits_{i=1}^k\frac{p_i^T A \nabla f(x_k)}{p_i^TAp_i} p_i\]

The ‘trick’ is that in general, we do not need all \(n\) conjugate vectors.

Convergence rate is dependent on sparsity and condition number of \(A\). Worst case is \(n^2\).

BFGS - Broyden–Fletcher–Goldfarb–Shanno

BFGS is a ‘quasi’ Newton method of optimization. Such methods are variants of the Newton method, where the Hessian \(H\) is replaced by some approximation. We we wish to solve the equation:

\[B_k{p}_k = -\nabla f({x}_k)\]

for \(p_k\). This gives our search direction, and the next candidate point is given by:

\[x_{k+1} = x_k + \alpha_k p_k\]

where \(\alpha_k\) is a step size.

At each step, we require that the new approximate \(H\) meets the secant condition:

\[B_{k+1}(x_{k+1}-x_k) = \nabla f(x_{k+1}) -\nabla f(x_k)\]

There is a unique, rank one update that satisfies the above:

\[B_{k+1} = B_k + c_k v_kv_k^T\]

where

\[c_k = -\frac{1}{\left(B_k(x_{k+1}-x_k) - (\nabla f(x_{k+1})-\nabla f(x_k)\right)^T (x_{k+1}-x_k) }\]

and

\[v_k = B_k(x_{k+1}-x_k) - (\nabla f(x_{k+1})-\nabla f(x_k))\]

Note that the update does NOT preserve positive definiteness if \(c_k<0\). In this case, there are several options for the rank one correction, but we will not address them here. Instead, we will describe the BFGS method, which almost always guarantees a positive-definite correction. Specifically:

\[B_{k+1} = B_k + b_k g_k g_k^T + c_k B_k d_k d_k^TB_k\]

where we have introduced the shorthand:

\[g_k = \nabla f(x_{k+1}) - \nabla f(x_k) \;\;\;\;\;\;\;\ \mathrm{ and }\;\;\;\;\;\;\; d_k = x_{k+1} - x_k\]

If we set:

\[b_k = \frac{1}{g_k^Td_k} \;\;\;\;\; \mathrm{ and } \;\;\;\;\; c_k = \frac{1}{d_k^TB_kd_k}\]

we satisfy the secant condition.

Nelder-Mead Simplex

While Newton’s method is considered a ‘second order method’ (requires the second derivative), and quasi-Newton methods are first order (require only first derivatives), Nelder-Mead is a zero-order method. I.e. NM requires only the function itself - no derivatives.

For \(f:\mathbb{R}^n\rightarrow \mathbb{R}\), the algorithm computes the values of the function on a simplex of dimension \(n\), constructed from \(n+1\) vertices. For a univariate function, the simplex is a line segment. In two dimensions, the simplex is a triangle, in 3D, a tetrahedral solid, and so on.

The algorithm begins with \(n+1\) starting points and then the follwing steps are repeated until convergence:

Compute the function at each of the points
Sort the function values so that \[f(x_1)\leq ...\leq f(x_{n+1})\]
Compute the centroid \(x_c\) of the n-dimensional region defined by \(x_1,...,x_n\)
Reflect \(x_{n+1}\) about the centroid to get \(x_r\) \[x_r = x_c + \alpha (x_c - x_{n+1})\]
Create a new simplex according to the following rules:
- If \(f(x_1)\leq f(x_r) < f(x_n)\), replace \(x_{n+1}\) with \(x_r\)
- If \(f(x_r)<f(x_1)\), expand the simplex through \(x_r\): \[x_e = x_c + \gamma (x_c - x_{n+1})\]
  If \(f(x_e)<f(x_r)\), replace \(x_{n+1}\) with \(x_e\), otherwise, replace \(x_{n+1}\) with \(x_r\)
- If \(f({x}_{r}) \geq f({x}_{n})\), compute \(x_p = x_c + \rho(x_c - x_{n+1})\). If \(f({x}_{p}) < f({x}_{n+1})\), replace \(x_{n+1}\) with \(x_p\)
- If all else fails, replace all points except \(x_1\) according to \[x_i = {x}_{1} + \sigma({x}_{i} - {x}_{1})\]

The default values of \(\alpha, \gamma,\rho\) and \(\sigma\) in scipy are not listed in the documentation, nor are they inputs to the function.

Powell’s Method

Powell’s method is another derivative-free optimization method that is similar to conjugate-gradient. The algorithm steps are as follows:

Begin with a point \(p_0\) (an initial guess) and a set of vectors \(\xi_1,...,\xi_n\), initially the standard basis of \(\mathbb{R}^n\).

Compute for \(i=1,...,n\), find \(\lambda_i\) that minimizes \(f(p_{i-1} +\lambda_i \xi_i)\) and set \(p_i = p_{i-1} + \lambda_i\xi_i\)
For \(i=1,...,n-1\), replace \(\xi_{i}\) with \(\xi_{i+1}\) and then replace \(\xi_n\) with \(p_n - p_0\)
Choose \(\lambda\) so that \(f(p_0 + \lambda(p_n-p_0)\) is minimum and replace \(p_0\) with \(p_0 + \lambda(p_n-p_0)\)

Essentially, the algorithm performs line searches and tries to find fruitful directions to search.

分享到QQ

分享到微博