Optimization of standard statistical models

发布于 2025-02-25 23:43:54 字数 9386 浏览 0 评论 0 收藏 0

When we solve standard statistical problems, an optimization procedure similar to the ones discussed here is performed. For example, consider multivariate logistic regression - typically, a Newton-like alogirhtm known as iteratively reweighted least squares (IRLS) is used to find the maximum likelihood estimate for the generalized linear model family. However, using one of the multivariate scalar minimization methods shown above will also work, for example, the BFGS minimization algorithm.

The take home message is that there is nothing magic going on when Python or R fits a statistical model using a formula - all that is happening is that the objective function is set to be the negative of the log likelihood, and the minimum found using some first or second order optimzation algorithm.

import statsmodels.api as sm

Logistic regression as optimization

Suppose we have a binary outcome measure \(Y \in {0,1}\) that is conditinal on some input variable (vector) \(x \in (-\infty, +\infty)\). Let the conditioanl probability be \(p(x) = P(Y=y | X=x)\). Given some data, one simple probability model is \(p(x) = \beta_0 + x\cdot\beta\) - i.e. linear regression. This doesn’t really work for the obvious reason that \(p(x)\) must be between 0 and 1 as \(x\) ranges across the real line. One simple way to fix this is to use the transformation \(g(x) = \frac{p(x)}{1 - p(x)} = \beta_0 + x.\beta\). Solving for \(p\), we get

\[p(x) = \frac{1}{1 + e^{-(\beta_0 + x\cdot\beta)}}\]

As you all know very well, this is logistic regression.

Suppose we have \(n\) data points \((x_i, y_i)\) where \(x_i\) is a vector of features and \(y_i\) is an observed class (0 or 1). For each event, we either have “success” (\(y = 1\)) or “failure” (\(Y = 0\)), so the likelihood looks like the product of Bernoulli random variables. According to the logistic model, the probability of success is \(p(x_i)\) if \(y_i = 1\) and \(1-p(x_i)\) if \(y_i = 0\). So the likelihood is

\[L(\beta_0, \beta) = \prod_{i=1}^n p(x_i)^y(1-p(x_i))^{1-y}\]

and the log-likelihood is

Using the standard ‘trick’, if we augment the matrix \(X\) with a column of 1s, we can write \(\beta_0 + x_i\cdot\beta\) as just \(X\beta\).

df_ = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
df_.head()

	admit	gre	gpa	rank
0	0	380	3.61	3
1	1	660	3.67	3
2	1	800	4.00	1
3	1	640	3.19	4
4	0	520	2.93	4

# We will ignore the rank categorical value

cols_to_keep = ['admit', 'gre', 'gpa']
df = df_[cols_to_keep]
df.insert(1, 'dummy', 1)
df.head()

	admit	dummy	gre	gpa
0	0	1	380	3.61
1	1	1	660	3.67
2	1	1	800	4.00
3	1	1	640	3.19
4	0	1	520	2.93

Solving as a GLM with IRLS

This is very similar to what you would do in R, only using Python’s statsmodels package. The GLM solver uses a special variant of Newton’s method known as iteratively reweighted least squares (IRLS), which will be further desribed in the lecture on multivarite and constrained optimizaiton.

model = sm.GLM.from_formula('admit ~ gre + gpa',
                            data=df, family=sm.families.Binomial())
fit = model.fit()
fit.summary()

Generalized Linear Model Regression Results
Dep. Variable:	admit	No. Observations:	400
Model:	GLM	Df Residuals:	397
Model Family:	Binomial	Df Model:	2
Link Function:	logit	Scale:	1.0
Method:	IRLS	Log-Likelihood:	-240.17
Date:	Wed, 11 Feb 2015	Deviance:	480.34
Time:	17:29:26	Pearson chi2:	398.
No. Iterations:	5

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-4.9494	1.075	-4.604	0.000	-7.057 -2.842
gre	0.0027	0.001	2.544	0.011	0.001 0.005
gpa	0.7547	0.320	2.361	0.018	0.128 1.381

Solving as logistic model with bfgs

Note that you can choose any of the scipy.optimize algotihms to fit the maximum likelihood model. This knows about higher order derivatives, so will be more accurate than homebrew version.

model2 = sm.Logit.from_formula('admit ~ %s' % '+'.join(df.columns[2:]), data=df)
fit2 = model2.fit(method='bfgs', maxiter=100)
fit2.summary()

Optimization terminated successfully.
         Current function value: 0.600430
         Iterations: 23
         Function evaluations: 65
         Gradient evaluations: 54

Logit Regression Results
Dep. Variable:	admit	No. Observations:	400
Model:	Logit	Df Residuals:	397
Method:	MLE	Df Model:	2
Date:	Wed, 11 Feb 2015	Pseudo R-squ.:	0.03927
Time:	17:31:19	Log-Likelihood:	-240.17
converged:	True	LL-Null:	-249.99
		LLR p-value:	5.456e-05

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-4.9494	1.075	-4.604	0.000	-7.057 -2.842
gre	0.0027	0.001	2.544	0.011	0.001 0.005
gpa	0.7547	0.320	2.361	0.018	0.128 1.381

Home-brew logistic regression using a generic minimization function

This is to show that there is no magic going on - you can write the function to minimize directly from the log-likelihood eqaution and run a minimizer. It will be more accurate if you also provide the derivative (+/- the Hessian for seocnd order methods), but using just the function and numerical approximations to the derivative will also work. As usual, this is for illustration so you understand what is going on - when there is a library function available, youu should probably use that instead.

def f(beta, y, x):
    """Minus log likelihood function for logistic regression."""
    return -((-np.log(1 + np.exp(np.dot(x, beta)))).sum() + (y*(np.dot(x, beta))).sum())

beta0 = np.zeros(3)
opt.minimize(f, beta0, args=(df['admit'], df.ix[:, 'dummy':]), method='BFGS', options={'gtol':1e-2})

  status: 0
 success: True
    njev: 16
    nfev: 80
hess_inv: array([[  1.1525e+00,  -2.7800e-04,  -2.8160e-01],
      [ -2.7800e-04,   1.1663e-06,  -1.2190e-04],
      [ -2.8160e-01,  -1.2190e-04,   1.0259e-01]])
     fun: 240.1719908951104
       x: array([ -4.9493e+00,   2.6903e-03,   7.5473e-01])
 message: 'Optimization terminated successfully.'
     jac: array([  9.1553e-05,  -3.2158e-03,   4.5776e-04])