- Introduction to Python
- Getting started with Python and the IPython notebook
- Functions are first class objects
- Data science is OSEMN
- Working with text
- Preprocessing text data
- Working with structured data
- Using SQLite3
- Using HDF5
- Using numpy
- Using Pandas
- Computational problems in statistics
- Computer numbers and mathematics
- Algorithmic complexity
- Linear Algebra and Linear Systems
- Linear Algebra and Matrix Decompositions
- Change of Basis
- Optimization and Non-linear Methods
- Practical Optimizatio Routines
- Finding roots
- Optimization Primer
- Using scipy.optimize
- Gradient deescent
- Newton’s method and variants
- Constrained optimization
- Curve fitting
- Finding paraemeters for ODE models
- Optimization of graph node placement
- Optimization of standard statistical models
- Fitting ODEs with the Levenberg–Marquardt algorithm
- 1D example
- 2D example
- Algorithms for Optimization and Root Finding for Multivariate Problems
- Expectation Maximizatio (EM) Algorithm
- Monte Carlo Methods
- Resampling methods
- Resampling
- Simulations
- Setting the random seed
- Sampling with and without replacement
- Calculation of Cook’s distance
- Permutation resampling
- Design of simulation experiments
- Example: Simulations to estimate power
- Check with R
- Estimating the CDF
- Estimating the PDF
- Kernel density estimation
- Multivariate kerndel density estimation
- Markov Chain Monte Carlo (MCMC)
- Using PyMC2
- Using PyMC3
- Using PyStan
- C Crash Course
- Code Optimization
- Using C code in Python
- Using functions from various compiled languages in Python
- Julia and Python
- Converting Python Code to C for speed
- Optimization bake-off
- Writing Parallel Code
- Massively parallel programming with GPUs
- Writing CUDA in C
- Distributed computing for Big Data
- Hadoop MapReduce on AWS EMR with mrjob
- Spark on a local mahcine using 4 nodes
- Modules and Packaging
- Tour of the Jupyter (IPython3) notebook
- Polyglot programming
- What you should know and learn more about
- Wrapping R libraries with Rpy
References
Coin toss
We’ll repeat the example of determining the bias of a coin from observed coin tosses. The likelihood is binomial, and we use a beta prior.
coin_code = """ data { int<lower=0> n; // number of tosses int<lower=0> y; // number of heads } transformed data {} parameters { real<lower=0, upper=1> p; } transformed parameters {} model { p ~ beta(2, 2); y ~ binomial(n, p); } generated quantities {} """ coin_dat = { 'n': 100, 'y': 61, } fit = pystan.stan(model_code=coin_code, data=coin_dat, iter=1000, chains=1)
Loading from a file
The string in coin_code can also be in a file - say coin_code.stan
- then we can use it like so
fit = pystan.stan(file='coin_code.stan', data=coin_dat, iter=1000, chains=1)
print(fit)
Inference for Stan model: anon_model_7f1947cd2d39ae427cd7b6bb6e6ffd77. 1 chains, each with iter=1000; warmup=500; thin=1; post-warmup draws per chain=500, total post-warmup draws=500. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat p 0.61 4.9e-3 0.05 0.51 0.57 0.61 0.64 0.69 91.0 1.0 lp__ -70.22 0.06 0.66 -71.79 -70.43 -69.97 -69.79 -69.74 134.0 1.0 Samples were drawn using NUTS(diag_e) at Wed Mar 18 08:54:14 2015. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
coin_dict = fit.extract() coin_dict.keys() # lp_ is the log posterior
[u'mu', u'sigma', u'lp__']
fit.plot('p'); plt.tight_layout()
Estimating mean and standard deviation of normal distribution
\[X \sim \mathcal{N}(\mu, \sigma^2)\]
norm_code = """ data { int<lower=0> n; real y[n]; } transformed data {} parameters { real<lower=0, upper=100> mu; real<lower=0, upper=10> sigma; } transformed parameters {} model { y ~ normal(mu, sigma); } generated quantities {} """ norm_dat = { 'n': 100, 'y': np.random.normal(10, 2, 100), } fit = pystan.stan(model_code=norm_code, data=norm_dat, iter=1000, chains=1)
print fit
Inference for Stan model: anon_model_3318343d5265d1b4ebc1e443f0228954. 1 chains, each with iter=1000; warmup=500; thin=1; post-warmup draws per chain=500, total post-warmup draws=500. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat mu 10.09 0.02 0.19 9.72 9.97 10.09 10.22 10.49 120.0 1.0 sigma 2.02 0.01 0.15 1.74 1.92 2.01 2.12 2.32 119.0 1.01 lp__ -117.2 0.11 1.08 -120.0 -117.5 -116.8 -116.4 -116.2 105.0 1.0 Samples were drawn using NUTS(diag_e) at Wed Mar 18 08:54:50 2015. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
trace = fit.extract()
plt.figure(figsize=(10,4)) plt.subplot(1,2,1); plt.hist(trace['mu'][:], 25, histtype='step'); plt.subplot(1,2,2); plt.hist(trace['sigma'][:], 25, histtype='step');
Optimization (finding MAP)
sm = pystan.StanModel(model_code=norm_code) op = sm.optimizing(data=norm_dat) op
OrderedDict([(u'mu', array(10.3016473417206)), (u'sigma', array(1.8823589782831152))])
Reusing fitted objects
new_dat = { 'n': 100, 'y': np.random.normal(10, 2, 100), }
fit2 = pystan.stan(fit=fit, data=new_dat, chains=1)
print fit2
Inference for Stan model: anon_model_3318343d5265d1b4ebc1e443f0228954. 1 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=1000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat mu 9.89 0.01 0.19 9.54 9.76 9.9 10.02 10.27 250.0 1.0 sigma 1.99 9.3e-3 0.15 1.72 1.89 1.98 2.07 2.33 250.0 1.0 lp__ -115.4 0.08 1.01 -118.1 -115.8 -115.1 -114.7 -114.5 153.0 1.0 Samples were drawn using NUTS(diag_e) at Wed Mar 18 08:58:32 2015. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
Saving compiled models
We can also compile Stan models and save them to file, so as to reload them for later use without needing to recompile.
def save(obj, filename): """Save compiled models for reuse.""" import pickle with open(filename, 'w') as f: pickle.dump(obj, f, protocol=pickle.HIGHEST_PROTOCOL) def load(filename): """Reload compiled models for reuse.""" import pickle return pickle.load(open(filename, 'r'))
model = pystan.StanModel(model_code=norm_code) save(model, 'norm_model.pic')
new_model = load('norm_model.pic') fit4 = new_model.sampling(new_dat, chains=1) print fit4
Inference for Stan model: anon_model_3318343d5265d1b4ebc1e443f0228954. 1 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=1000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat mu 9.91 0.01 0.2 9.5 9.78 9.91 10.05 10.3 283.0 1.0 sigma 2.0 9.3e-3 0.15 1.73 1.9 1.99 2.09 2.31 244.0 1.0 lp__ -115.5 0.08 1.03 -118.2 -115.8 -115.1 -114.8 -114.5 153.0 1.01 Samples were drawn using NUTS(diag_e) at Wed Mar 18 09:18:30 2015. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
Estimating parameters of a linear regreession model
We will show how to estimate regression parameters using a simple linear modesl
\[y \sim ax + b\]
We can restate the linear model
\[y = ax + b + \epsilon\]
as sampling from a probability distribution
\[y \sim \mathcal{N}(ax + b, \sigma^2)\]
We will assume the following priors
\[\begin{split}a \sim \mathcal{N}(0, 100) \\ b \sim \mathcal{N}(0, 100) \\ \sigma \sim \mathcal{U}(0, 20)\end{split}\]
lin_reg_code = """ data { int<lower=0> n; real x[n]; real y[n]; } transformed data {} parameters { real a; real b; real sigma; } transformed parameters { real mu[n]; for (i in 1:n) { mu[i] <- a*x[i] + b; } } model { sigma ~ uniform(0, 20); y ~ normal(mu, sigma); } generated quantities {} """ n = 11 _a = 6 _b = 2 x = np.linspace(0, 1, n) y = _a*x + _b + np.random.randn(n) lin_reg_dat = { 'n': n, 'x': x, 'y': y } fit = pystan.stan(model_code=lin_reg_code, data=lin_reg_dat, iter=1000, chains=1)
print fit
fit.plot(['a', 'b']); plt.tight_layout()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论