- Introduction to Python
- Getting started with Python and the IPython notebook
- Functions are first class objects
- Data science is OSEMN
- Working with text
- Preprocessing text data
- Working with structured data
- Using SQLite3
- Using HDF5
- Using numpy
- Using Pandas
- Computational problems in statistics
- Computer numbers and mathematics
- Algorithmic complexity
- Linear Algebra and Linear Systems
- Linear Algebra and Matrix Decompositions
- Change of Basis
- Optimization and Non-linear Methods
- Practical Optimizatio Routines
- Finding roots
- Optimization Primer
- Using scipy.optimize
- Gradient deescent
- Newton’s method and variants
- Constrained optimization
- Curve fitting
- Finding paraemeters for ODE models
- Optimization of graph node placement
- Optimization of standard statistical models
- Fitting ODEs with the Levenberg–Marquardt algorithm
- 1D example
- 2D example
- Algorithms for Optimization and Root Finding for Multivariate Problems
- Expectation Maximizatio (EM) Algorithm
- Monte Carlo Methods
- Resampling methods
- Resampling
- Simulations
- Setting the random seed
- Sampling with and without replacement
- Calculation of Cook’s distance
- Permutation resampling
- Design of simulation experiments
- Example: Simulations to estimate power
- Check with R
- Estimating the CDF
- Estimating the PDF
- Kernel density estimation
- Multivariate kerndel density estimation
- Markov Chain Monte Carlo (MCMC)
- Using PyMC2
- Using PyMC3
- Using PyStan
- C Crash Course
- Code Optimization
- Using C code in Python
- Using functions from various compiled languages in Python
- Julia and Python
- Converting Python Code to C for speed
- Optimization bake-off
- Writing Parallel Code
- Massively parallel programming with GPUs
- Writing CUDA in C
- Distributed computing for Big Data
- Hadoop MapReduce on AWS EMR with mrjob
- Spark on a local mahcine using 4 nodes
- Modules and Packaging
- Tour of the Jupyter (IPython3) notebook
- Polyglot programming
- What you should know and learn more about
- Wrapping R libraries with Rpy
Estimating parameters of a linear regreession model
We will show how to estimate regression parameters using a simple linear modesl
\[y \sim ax + b\]
We can restate the linear model
\[y = ax + b + \epsilon\]
as sampling from a probability distribution
\[y \sim \mathcal{N}(ax + b, \sigma^2)\]
Now we can use pymc to estimate the paramters \(a\), \(b\) and \(\sigma\) (pymc2 uses precision \(\tau\) which is \(1/\sigma^2\) so we need to do a simple transformation). We will assume the following priors
\[\begin{split}a \sim \mathcal{N}(0, 100) \\ b \sim \mathcal{N}(0, 100) \\ \tau \sim \text{Gamma}(0.1, 0.1)\end{split}\]
# observed data n = 11 _a = 6 _b = 2 x = np.linspace(0, 1, n) y = _a*x + _b + np.random.randn(n) with pm.Model() as model: a = pm.Normal('a', mu=0, sd=20) b = pm.Normal('b', mu=0, sd=20) sigma = pm.Uniform('sigma', lower=0, upper=20) y_est = a*x + b # simple auxiliary variables likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y) # inference start = pm.find_MAP() step = pm.NUTS() # Hamiltonian MCMC with No U-Turn Sampler trace = pm.sample(niter, step, start, random_seed=123, progressbar=True) pm.traceplot(trace);
[-----------------100%-----------------] 1000 of 1000 complete in 8.9 sec
Alternative fromulation using GLM formulas
data = dict(x=x, y=y) with pm.Model() as model: pm.glm.glm('y ~ x', data) step = pm.NUTS() trace = pm.sample(2000, step, progressbar=True)
[-----------------100%-----------------] 2000 of 2000 complete in 8.1 sec
pm.traceplot(trace);
plt.figure(figsize=(7, 7)) plt.scatter(x, y, s=30, label='data') pm.glm.plot_posterior_predictive(trace, samples=100, label='posterior predictive regression lines', c='blue', alpha=0.2) plt.plot(x, _a*x + _b, label='true regression line', lw=3., c='red') plt.legend(loc='best');
Simple Logistic model
We have observations of height and weight and want to use a logistic model to guess the sex.
# observed data df = pd.read_csv('HtWt.csv') df.head()
male | height | weight | |
---|---|---|---|
0 | 0 | 63.2 | 168.7 |
1 | 0 | 68.7 | 169.8 |
2 | 0 | 64.8 | 176.6 |
3 | 0 | 67.9 | 246.8 |
4 | 1 | 68.9 | 151.6 |
niter = 1000 with pm.Model() as model: pm.glm.glm('male ~ height + weight', df, family=pm.glm.families.Binomial()) trace = pm.sample(niter, step=pm.Slice(), random_seed=123, progressbar=True)
[-----------------100%-----------------] 1000 of 1000 complete in 3.2 sec
# note that height and weigth in trace refer to the coefficients df_trace = pm.trace_to_dataframe(trace) pd.scatter_matrix(df_trace[-1000:], diagonal='kde');
plt.figure(figsize=(12, 4)) plt.subplot(121) plt.plot(df_trace.ix[-1000:, 'height'], linewidth=0.7) plt.subplot(122) plt.plot(df_trace.ix[-1000:, 'weight'], linewidth=0.7);
There is no convergence!
Becaue of ths strong correlation between height and weight, a one-at-a-time sampler such as the slice or Gibbs sampler will take a long time to converge. The HMC does much better.
niter = 1000 with pm.Model() as model: pm.glm.glm('male ~ height + weight', df, family=pm.glm.families.Binomial()) trace = pm.sample(niter, step=pm.NUTS(), random_seed=123, progressbar=True)
[-----------------100%-----------------] 1001 of 1000 complete in 27.0 sec
df_trace = pm.trace_to_dataframe(trace) pd.scatter_matrix(df_trace[-1000:], diagonal='kde');
pm.summary(trace);
Intercept: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------- -51.393 11.299 0.828 [-73.102, -29.353] Posterior quantiles: 2.5 25 50 75 97.5 |--------------|==============|==============|--------------| -76.964 -58.534 -50.383 -43.856 -30.630 height: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------- 0.747 0.170 0.012 [0.422, 1.096] Posterior quantiles: 2.5 25 50 75 97.5 |--------------|==============|==============|--------------| 0.453 0.630 0.732 0.853 1.139 weight: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------- 0.011 0.012 0.001 [-0.012, 0.034] Posterior quantiles: 2.5 25 50 75 97.5 |--------------|==============|==============|--------------| -0.012 0.002 0.010 0.019 0.034
import seaborn as sn sn.kdeplot(trace['weight'], trace['height']) plt.xlabel('Weight', fontsize=20) plt.ylabel('Height', fontsize=20) plt.style.use('ggplot')
We can use the logistic regression results to classify subjects as male or female based on their height and weight, using 0.5 as a cutoff, as shown in the plot below. Green = true positive male, yellow = true positive female, red halo = misclassification. Blue line shows the 0.5 cutoff.
intercept, height, weight = df_trace[-niter//2:].mean(0) def predict(w, h, height=height, weight=weight): """Predict gender given weight (w) and height (h) values.""" v = intercept + height*h + weight*w return np.exp(v)/(1+np.exp(v)) # calculate predictions on grid xs = np.linspace(df.weight.min(), df.weight.max(), 100) ys = np.linspace(df.height.min(), df.height.max(), 100) X, Y = np.meshgrid(xs, ys) Z = predict(X, Y) plt.figure(figsize=(6,6)) # plot 0.5 contour line - classify as male if above this line plt.contour(X, Y, Z, levels=[0.5]) # classify all subjects colors = ['lime' if i else 'yellow' for i in df.male] ps = predict(df.weight, df.height) errs = ((ps < 0.5) & df.male) |((ps >= 0.5) & (1-df.male)) plt.scatter(df.weight[errs], df.height[errs], facecolors='red', s=150) plt.scatter(df.weight, df.height, facecolors=colors, edgecolors='k', s=50, alpha=1); plt.xlabel('Weight', fontsize=16) plt.ylabel('Height', fontsize=16) plt.title('Gender classification by weight and height', fontsize=16) plt.tight_layout();
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论