Introduction to Python
- Variables
- Operators
- Iterators
- Conditional Statements
- Functions
- Strings and String Handling
- Lists, Tuples, Dictionaries
- Classes
- Modules
- The standard library
- Keeping the Anaconda distribution up-to-date
- Exercises
Getting started with Python and the IPython notebook
- Cells
- Code Cells
- Magic Commands
- Python as Glue
- Python <-> R <-> Matlab <-> Octave
- More Glue: Julia and Perl
Functions are first class objects
Data science is OSEMN
- Obtaining data
- Scrubbing data
- Exercises
Working with text
- String methods
- Splitting and joining strings
- The string module
- Regular expressions
- The NLTK toolkit
- Exercises
Preprocessing text data
- Example: Counting words in a document
Working with structured data
Using SQLite3
- Basic concepts of database normalization
Using HDF5
- Interfacing withPandas
Using numpy
- References
- Example
- NDArray
- Broadcasting, row, column and matrix operations
- Universal functions (Ufuncs)
- Generalized ufucns
- Random numbers
- Linear algebra
- Exercises
Using Pandas
- Series
- DataFrame
- Panels
- Split-Apply-Combine
- Using statsmodels
Computational problems in statistics
- Textbook example - is coin fair?
- Bayesian approach
- Comment
Computer numbers and mathematics
- Some examples of numbers behaving badly
- Finite representation of numbers
- Using arbitrary precision libraries
- From numbers to Functions: Stability and conditioning
- Exercises
Algorithmic complexity
- Profling and benchmarking
- Measuring algorithmic complexity
- Space complexity
Linear Algebra and Linear Systems
- Simultaneous Equations
- Linear Independence
- Norms and Distance of Vectors
- Trace and Determinant of Matrices
- Column space, Row space, Rank and Kernel
- Matrices as Linear Transformations
- Matrix Norms
- Special Matrices
- Exercises
Linear Algebra and Matrix Decompositions
- Large Linear Systems
- Example: Netflix Competition (circa 2006-2009)
- Matrix Decompositions
- Matrix Decompositions for PCA and Least Squares
- Singular Value Decomposition
- Stabilty and Condition Number
- Exercises
Change of Basis
- Variance and covariance
- Eigendecomposition of the covariance matrix
- PCA
- Change of basis via PCA
- Graphical illustration of change of basis
- Dimension reduction via PCA
- Using Singular Value Decomposition (SVD) for PCA
Optimization and Non-linear Methods
- Example: Maximum Likelihood Estimation (MLE)
- Bisection Method
- Secant Method
- Newton-Rhapson Method
- Gauss-Newton
- Inverse Quadratic Interpolation
- Brent’s Method
Practical Optimizatio Routines
- Finding roots
- Optimization Primer
- Using scipy.optimize
- Gradient deescent
- Newton’s method and variants
- Constrained optimization
- Curve fitting
- Finding paraemeters for ODE models
- Optimization of graph node placement
- Optimization of standard statistical models
- Fitting ODEs with the Levenberg–Marquardt algorithm
- 1D example
- 2D example
Algorithms for Optimization and Root Finding for Multivariate Problems
- Optimizers
- Solvers
- GLM Estimation and IRLS
Expectation Maximizatio (EM) Algorithm
- Jensen’s inequality
- Maximum likelihood with complete information
- Incomplete information
- Gaussian mixture models
- Using EM
- Vectorized version
- Vectorization with Einstein summation notation
- Comparison of EM routines
Monte Carlo Methods
- Pseudorandom number generators (PRNG)
- Monte Carlo swindles (Variance reduction techniques)
- Quasi-random numbers
Resampling methods
- Resampling
- Simulations
- Setting the random seed
- Sampling with and without replacement
- Calculation of Cook’s distance
- Permutation resampling
- Design of simulation experiments
- Example: Simulations to estimate power
- Check with R
- Estimating the CDF
- Estimating the PDF
- Kernel density estimation
- Multivariate kerndel density estimation
Markov Chain Monte Carlo (MCMC)
- Bayesian Data Analysis
- Metropolis-Hastings sampler
- Gibbs sampler
- Slice sampler
- Hierarchical models
Using PyMC2
- Coin toss
- Estimating mean and standard deviation of normal distribution
- Estimating parameters of a linear regreession model
- Estimating parameters of a logistic model
- Using a hierarchcical model
Using PyMC3
- Coin toss
- Estimating mean and standard deviation of normal distribution
- Estimating parameters of a linear regreession model
- Estimating parameters of a logistic model
- Using a hierarchcical model
Using PyStan
- References
- Simple Logistic model
C Crash Course
- Hello world
- A tutorial example - coding a Fibonacci function in C
- Types in C
- Operators
- Control of program flow
- Arrays and pointers
- Functions
- Function pointers
- Using make to compile C programs
- Exercise
Code Optimization
- Profiling
- Using better algorihtms and data structures
- I/O Bound problems
- Problem set for optimization
Using C code in Python
- Example: The Fibonacci Sequence
- Using clang and bitey
- Using gcc and ctypes
- Using Cython
- Benchmark
Using functions from various compiled languages in Python
- C
- C++
- Fortran
- Benchmarking
- Wrapping a function from a C library for use in Python
- Wrapping functions from C++ library for use in Pyton
Julia and Python
- Defining a function in Julia
- Using it in Python
- Using Python libraries in Julia
Converting Python Code to C for speed
- Example: Fibonacci
- Example: Matrix multiplication
- Example: Pairwise distance matrix
- Profiling code
- Numba
- Cython
- Comparison with optimized C from scipy
Optimization bake-off
- Python version
- Numpy version
- Numexpr version
- Numba version
- NumbaPro version
- Parakeet version
- Cython version
- C version
- C++ version
- Fortran version
- Bake-off
- Summary
- Recommendations for optimizing Python code
Writing Parallel Code
- Concepts
- Embarassingly parallel programs
- Using Multiprocessing
- Using IPython parallel for interactive parallel computing
- Other parallel programming approaches not covered
- References
Massively parallel programming with GPUs
- Programming GPUs
- GPU Architecture
- CUDA Python
- Getting Started with CUDA
- Vector addition - the ‘Hello, world’ of CUDA
- Performing a reduction on CUDA
- Recreational
- More examples
Writing CUDA in C
- Review of GPU Architechture - A Simplification
- Cuda C program - an Outline
Distributed computing for Big Data
- Why and when does distributed computing matter?
- Ingredients for effiicient distributed computing
- What is Hadoop?
- Review of functional programming
- The Hadoop MapReduce workflow
- Using Hadoop MapReduce
- Spark
Hadoop MapReduce on AWS EMR with mrjob
- MapReduce code
- Configuration file
- Launching job
Spark on a local mahcine using 4 nodes
- Using Spark in standalone prograsm
- Introduction to Spark concepts with a data manipulation example
- Using the MLlib for Regression
- References
Modules and Packaging
- Modules
- Distributing your package
Tour of the Jupyter (IPython3) notebook
- Installing Jupyter
- Installing other kernels
- Installing extensions
- Installing Python3 while keeping Python2
- Now, restart your notebook server
Polyglot programming
- Python 2
- Python 3
- Bash
- R
- Scala
- Julia
- Processing
What you should know and learn more about
- Statistical foundations
- Computing foundations
- Mathematical foundations
- Statistical algorithms
- Libraries worth knowing about after numpy, scipy and matplotlib
Wrapping R libraries with Rpy

文江博客开发文档 Computational Statistics in Python 文章详情

文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

References

发布于 2025-02-25 23:43:58 字数 7846 浏览 0 评论 0 收藏 0

Coin toss

We’ll repeat the example of determining the bias of a coin from observed coin tosses. The likelihood is binomial, and we use a beta prior.

coin_code = """
data {
    int<lower=0> n; // number of tosses
    int<lower=0> y; // number of heads
}
transformed data {}
parameters {
    real<lower=0, upper=1> p;
}
transformed parameters {}
model {
    p ~ beta(2, 2);
    y ~ binomial(n, p);
}
generated quantities {}
"""

coin_dat = {
             'n': 100,
             'y': 61,
            }

fit = pystan.stan(model_code=coin_code, data=coin_dat, iter=1000, chains=1)

Loading from a file

The string in coin_code can also be in a file - say coin_code.stan - then we can use it like so

fit = pystan.stan(file='coin_code.stan', data=coin_dat, iter=1000, chains=1)

print(fit)

Inference for Stan model: anon_model_7f1947cd2d39ae427cd7b6bb6e6ffd77.
1 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=500.

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
p      0.61  4.9e-3   0.05   0.51   0.57   0.61   0.64   0.69   91.0    1.0
lp__ -70.22    0.06   0.66 -71.79 -70.43 -69.97 -69.79 -69.74  134.0    1.0

Samples were drawn using NUTS(diag_e) at Wed Mar 18 08:54:14 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

coin_dict = fit.extract()
coin_dict.keys()
# lp_ is the log posterior

[u'mu', u'sigma', u'lp__']

fit.plot('p');
plt.tight_layout()

Estimating mean and standard deviation of normal distribution

\[X \sim \mathcal{N}(\mu, \sigma^2)\]

norm_code = """
data {
    int<lower=0> n;
    real y[n];
}
transformed data {}
parameters {
    real<lower=0, upper=100> mu;
    real<lower=0, upper=10> sigma;
}
transformed parameters {}
model {
    y ~ normal(mu, sigma);
}
generated quantities {}
"""

norm_dat = {
             'n': 100,
             'y': np.random.normal(10, 2, 100),
            }

fit = pystan.stan(model_code=norm_code, data=norm_dat, iter=1000, chains=1)

print fit

Inference for Stan model: anon_model_3318343d5265d1b4ebc1e443f0228954.
1 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=500.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
mu     10.09    0.02   0.19   9.72   9.97  10.09  10.22  10.49  120.0    1.0
sigma   2.02    0.01   0.15   1.74   1.92   2.01   2.12   2.32  119.0   1.01
lp__  -117.2    0.11   1.08 -120.0 -117.5 -116.8 -116.4 -116.2  105.0    1.0

Samples were drawn using NUTS(diag_e) at Wed Mar 18 08:54:50 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

trace = fit.extract()

plt.figure(figsize=(10,4))
plt.subplot(1,2,1);
plt.hist(trace['mu'][:], 25, histtype='step');
plt.subplot(1,2,2);
plt.hist(trace['sigma'][:], 25, histtype='step');

Optimization (finding MAP)

sm = pystan.StanModel(model_code=norm_code)
op = sm.optimizing(data=norm_dat)
op

OrderedDict([(u'mu', array(10.3016473417206)), (u'sigma', array(1.8823589782831152))])

Reusing fitted objects

new_dat = {
             'n': 100,
             'y': np.random.normal(10, 2, 100),
            }

fit2 = pystan.stan(fit=fit, data=new_dat, chains=1)

print fit2

Inference for Stan model: anon_model_3318343d5265d1b4ebc1e443f0228954.
1 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=1000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
mu      9.89    0.01   0.19   9.54   9.76    9.9  10.02  10.27  250.0    1.0
sigma   1.99  9.3e-3   0.15   1.72   1.89   1.98   2.07   2.33  250.0    1.0
lp__  -115.4    0.08   1.01 -118.1 -115.8 -115.1 -114.7 -114.5  153.0    1.0

Samples were drawn using NUTS(diag_e) at Wed Mar 18 08:58:32 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Saving compiled models

We can also compile Stan models and save them to file, so as to reload them for later use without needing to recompile.

def save(obj, filename):
    """Save compiled models for reuse."""
    import pickle
    with open(filename, 'w') as f:
        pickle.dump(obj, f, protocol=pickle.HIGHEST_PROTOCOL)

def load(filename):
    """Reload compiled models for reuse."""
    import pickle
    return pickle.load(open(filename, 'r'))

model = pystan.StanModel(model_code=norm_code)
save(model, 'norm_model.pic')

new_model = load('norm_model.pic')
fit4 = new_model.sampling(new_dat, chains=1)
print fit4

Inference for Stan model: anon_model_3318343d5265d1b4ebc1e443f0228954.
1 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=1000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
mu      9.91    0.01    0.2    9.5   9.78   9.91  10.05   10.3  283.0    1.0
sigma    2.0  9.3e-3   0.15   1.73    1.9   1.99   2.09   2.31  244.0    1.0
lp__  -115.5    0.08   1.03 -118.2 -115.8 -115.1 -114.8 -114.5  153.0   1.01

Samples were drawn using NUTS(diag_e) at Wed Mar 18 09:18:30 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Estimating parameters of a linear regreession model

We will show how to estimate regression parameters using a simple linear modesl

\[y \sim ax + b\]

We can restate the linear model

\[y = ax + b + \epsilon\]

as sampling from a probability distribution

\[y \sim \mathcal{N}(ax + b, \sigma^2)\]

We will assume the following priors

\[\begin{split}a \sim \mathcal{N}(0, 100) \\ b \sim \mathcal{N}(0, 100) \\ \sigma \sim \mathcal{U}(0, 20)\end{split}\]

lin_reg_code = """
data {
    int<lower=0> n;
    real x[n];
    real y[n];
}
transformed data {}
parameters {
    real a;
    real b;
    real sigma;
}
transformed parameters {
    real mu[n];
    for (i in 1:n) {
        mu[i] <- a*x[i] + b;
        }
}
model {
    sigma ~ uniform(0, 20);
    y ~ normal(mu, sigma);
}
generated quantities {}
"""

n = 11
_a = 6
_b = 2
x = np.linspace(0, 1, n)
y = _a*x + _b + np.random.randn(n)

lin_reg_dat = {
             'n': n,
             'x': x,
             'y': y
            }

fit = pystan.stan(model_code=lin_reg_code, data=lin_reg_dat, iter=1000, chains=1)

print fit

fit.plot(['a', 'b']);
plt.tight_layout()

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据