- Introduction to Python
- Getting started with Python and the IPython notebook
- Functions are first class objects
- Data science is OSEMN
- Working with text
- Preprocessing text data
- Working with structured data
- Using SQLite3
- Using HDF5
- Using numpy
- Using Pandas
- Computational problems in statistics
- Computer numbers and mathematics
- Algorithmic complexity
- Linear Algebra and Linear Systems
- Linear Algebra and Matrix Decompositions
- Change of Basis
- Optimization and Non-linear Methods
- Practical Optimizatio Routines
- Finding roots
- Optimization Primer
- Using scipy.optimize
- Gradient deescent
- Newton’s method and variants
- Constrained optimization
- Curve fitting
- Finding paraemeters for ODE models
- Optimization of graph node placement
- Optimization of standard statistical models
- Fitting ODEs with the Levenberg–Marquardt algorithm
- 1D example
- 2D example
- Algorithms for Optimization and Root Finding for Multivariate Problems
- Expectation Maximizatio (EM) Algorithm
- Monte Carlo Methods
- Resampling methods
- Resampling
- Simulations
- Setting the random seed
- Sampling with and without replacement
- Calculation of Cook’s distance
- Permutation resampling
- Design of simulation experiments
- Example: Simulations to estimate power
- Check with R
- Estimating the CDF
- Estimating the PDF
- Kernel density estimation
- Multivariate kerndel density estimation
- Markov Chain Monte Carlo (MCMC)
- Using PyMC2
- Using PyMC3
- Using PyStan
- C Crash Course
- Code Optimization
- Using C code in Python
- Using functions from various compiled languages in Python
- Julia and Python
- Converting Python Code to C for speed
- Optimization bake-off
- Writing Parallel Code
- Massively parallel programming with GPUs
- Writing CUDA in C
- Distributed computing for Big Data
- Hadoop MapReduce on AWS EMR with mrjob
- Spark on a local mahcine using 4 nodes
- Modules and Packaging
- Tour of the Jupyter (IPython3) notebook
- Polyglot programming
- What you should know and learn more about
- Wrapping R libraries with Rpy
文章来源于网络收集而来,版权归原创者所有,如有侵权请及时联系!
Using statsmodels
Many of the basic statistical tools available in R are replicted in the statsmodels
package. We will only show one example.
# Simulate the genotype for 4 SNPs in a case-control study using an additive genetic model n = 1000 status = np.random.choice([0,1], n ) genotype = np.random.choice([0,1,2], (n,4)) genotype[status==0] = np.random.choice([0,1,2], (sum(status==0), 4), p=[0.33, 0.33, 0.34]) genotype[status==1] = np.random.choice([0,1,2], (sum(status==1), 4), p=[0.2, 0.3, 0.5]) df = DataFrame(np.hstack([status[:, np.newaxis], genotype]), columns=['status', 'SNP1', 'SNP2', 'SNP3', 'SNP4']) df.head(6)
status | SNP1 | SNP2 | SNP3 | SNP4 | |
---|---|---|---|---|---|
0 | 0 | 2 | 1 | 2 | 0 |
1 | 1 | 1 | 0 | 2 | 2 |
2 | 1 | 0 | 1 | 2 | 1 |
3 | 1 | 2 | 2 | 1 | 2 |
4 | 1 | 1 | 2 | 0 | 1 |
5 | 1 | 0 | 0 | 1 | 2 |
# Use statsmodels to fit a logistic regression to the data fit1 = sm.Logit.from_formula('status ~ %s' % '+'.join(df.columns[1:]), data=df).fit() fit1.summary()
Optimization terminated successfully. Current function value: 0.642824 Iterations 5
Dep. Variable: | status | No. Observations: | 1000 |
---|---|---|---|
Model: | Logit | Df Residuals: | 995 |
Method: | MLE | Df Model: | 4 |
Date: | Thu, 22 Jan 2015 | Pseudo R-squ.: | 0.07259 |
Time: | 15:34:43 | Log-Likelihood: | -642.82 |
converged: | True | LL-Null: | -693.14 |
LLR p-value: | 7.222e-21 |
coef | std err | z | P>|z| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | -1.7409 | 0.203 | -8.560 | 0.000 | -2.140 -1.342 |
SNP1 | 0.4306 | 0.083 | 5.173 | 0.000 | 0.267 0.594 |
SNP2 | 0.3155 | 0.081 | 3.882 | 0.000 | 0.156 0.475 |
SNP3 | 0.2255 | 0.082 | 2.750 | 0.006 | 0.065 0.386 |
SNP4 | 0.5341 | 0.083 | 6.404 | 0.000 | 0.371 0.698 |
# Alternative using GLM - similar to R fit2 = sm.GLM.from_formula('status ~ SNP1 + SNP2 + SNP3 + SNP4', data=df, family=sm.families.Binomial()).fit() print fit2.summary() print chisqprob(fit2.null_deviance - fit2.deviance, fit2.df_model) print(fit2.null_deviance - fit2.deviance, fit2.df_model)
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: status No. Observations: 1000 Model: GLM Df Residuals: 995 Model Family: Binomial Df Model: 4 Link Function: logit Scale: 1.0 Method: IRLS Log-Likelihood: -642.82 Date: Thu, 22 Jan 2015 Deviance: 1285.6 Time: 15:34:43 Pearson chi2: 1.01e+03 No. Iterations: 5 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept -1.7409 0.203 -8.560 0.000 -2.140 -1.342 SNP1 0.4306 0.083 5.173 0.000 0.267 0.594 SNP2 0.3155 0.081 3.882 0.000 0.156 0.475 SNP3 0.2255 0.082 2.750 0.006 0.065 0.386 SNP4 0.5341 0.083 6.404 0.000 0.371 0.698 ============================================================================== 7.22229516479e-21 (100.63019840179481, 4)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论