- Introduction to Python
- Getting started with Python and the IPython notebook
- Functions are first class objects
- Data science is OSEMN
- Working with text
- Preprocessing text data
- Working with structured data
- Using SQLite3
- Using HDF5
- Using numpy
- Using Pandas
- Computational problems in statistics
- Computer numbers and mathematics
- Algorithmic complexity
- Linear Algebra and Linear Systems
- Linear Algebra and Matrix Decompositions
- Change of Basis
- Optimization and Non-linear Methods
- Practical Optimizatio Routines
- Finding roots
- Optimization Primer
- Using scipy.optimize
- Gradient deescent
- Newton’s method and variants
- Constrained optimization
- Curve fitting
- Finding paraemeters for ODE models
- Optimization of graph node placement
- Optimization of standard statistical models
- Fitting ODEs with the Levenberg–Marquardt algorithm
- 1D example
- 2D example
- Algorithms for Optimization and Root Finding for Multivariate Problems
- Expectation Maximizatio (EM) Algorithm
- Monte Carlo Methods
- Resampling methods
- Resampling
- Simulations
- Setting the random seed
- Sampling with and without replacement
- Calculation of Cook’s distance
- Permutation resampling
- Design of simulation experiments
- Example: Simulations to estimate power
- Check with R
- Estimating the CDF
- Estimating the PDF
- Kernel density estimation
- Multivariate kerndel density estimation
- Markov Chain Monte Carlo (MCMC)
- Using PyMC2
- Using PyMC3
- Using PyStan
- C Crash Course
- Code Optimization
- Using C code in Python
- Using functions from various compiled languages in Python
- Julia and Python
- Converting Python Code to C for speed
- Optimization bake-off
- Writing Parallel Code
- Massively parallel programming with GPUs
- Writing CUDA in C
- Distributed computing for Big Data
- Hadoop MapReduce on AWS EMR with mrjob
- Spark on a local mahcine using 4 nodes
- Modules and Packaging
- Tour of the Jupyter (IPython3) notebook
- Polyglot programming
- What you should know and learn more about
- Wrapping R libraries with Rpy
Permutation resampling
Permuatation resampling is used ot generate the null distribtuion of labeled data by switching lebals. Because the number of permuations grows so fast, it is typically only feasible to use a Monte Carlo sample of the possible set of permuations in computation.
# Growth of the factorial function (number of permutations) using Stirling's approximation def stirling(n): """Stirling's approximation to the factorial.""" return np.sqrt(2*np.pi*n)*(n/np.e)**n n = np.arange(1, 51) zip(n, stirling(n))
[(1, 0.9221), (2, 1.9190), (3, 5.8362), (4, 23.5062), (5, 118.0192), (6, 710.0782), (7, 4980.3958), (8, 39902.3955), (9, 359536.8728), (10, 3598695.6187), (11, 39615625.0506), (12, 475687486.4728), (13, 6187239475.1927), (14, 86661001740.5988), (15, 1300430722199.4680), (16, 20814114415223.1367), (17, 353948328666101.1250), (18, 6372804626194313.0000), (19, 121112786592294192.0000), (20, 2422786846761135104.0000), (21, 50888617325509746688.0000), (22, 1119751494628237770752.0000), (23, 25758525370529310834688.0000), (24, 618297927022794799841280.0000), (25, 15459594834691181359661056.0000), (26, 402000993060955330726330368.0000), (27, 10855315170319531497075245056.0000), (28, 303982326243341862218743414784.0000), (29, 8816392105377489957715009601536.0000), (30, 264517095922965156800687262138368.0000), (31, 8200764697241122458512884083195904.0000), (32, 262446514081933026899914856968749056.0000), (33, 8661418381417958431306228879169945600.0000), (34, 294510096099824346859521185203942850560.0000), (35, 10308575166584033336103974733365808988160.0000), (36, 371133249087415837775601850534254065221632.0000), (37, 13732789283357647537712986585599118967570432.0000), (38, 521876921190057472102855717030406837275459584.0000), (39, 20354344348300692639357061611803222912972881920.0000), (40, 814217264494623640116433847565750863601126604800.0000), (41, 33384604069916415591100983328410827967169870954496.0000), (42, 1402221223524367565643573793611811471275760323395584.0000), (43, 60298294706657120904833322706173396147148627061506048.0000), (44, 2653241820650555398466033792063490393009887673090834432.0000), (45, 119400906860443613913174999733586801766394224950600794112.0000), (46, 5492662822140225801570025546069546965027287393900087476224.0000), (47, 258165102848257451938594016029422319832480942931582618959872.0000), (48, 12392382664425307299255962057935500269489259106719931528380416.0000), (49, 607248264576510288229888440783681300390904221083157786483752960.0000), (50, 30363445939381680077841740787498028998394965264769813642857152512.0000)]
Suppose you have 2 data sets from unknown distribution and you want to test if some arbitrary statistic (e.g 7th percentile) is the same in the 2 data sets - what can you do?
An appropirate test statistic is the difference between the 7th percentile, and if we knew the null distribution of this statisic, we could test for the null hypothesis that the statistic = 0. Permuting the labels of the 2 data sets allows us to create the empirical null distribution.
x = np.concatenate([np.random.exponential(size=200), np.random.normal(0, 1, size=100)]) y = np.concatenate([np.random.exponential(size=250), np.random.normal(0, 1, size=50)])
n1, n2 = map(len, (x, y)) reps = 10000 data = np.concatenate([x, y]) ps = np.array([np.random.permutation(n1+n2) for i in range(reps)])
xp = data[ps[:, :n1]] yp = data[ps[:, n1:]] samples = np.percentile(xp, 7, axis=1) - np.percentile(yp, 7, axis=1)
plt.hist(samples, 25, histtype='step') test_stat = np.percentile(x, 7) - np.percentile(y, 7) plt.axvline(test_stat) plt.axvline(np.percentile(samples, 2.5), linestyle='--') plt.axvline(np.percentile(samples, 97.5), linestyle='--') print "p-value =", 2*np.sum(samples >= np.abs(test_stat))/reps
p-value = 0.0124
We will make up some data - a tpical example is trying to identify genes that are differentially expressed in two groups of people, pehraps those who are helathy and those who are sick. For each gene, we can perform a t-test to see if the gene is differnetially expressed across the two groups at some nominal significanc level, typically 0.05. When we have many genes, this is unsatisfactory since 5% of the genes will be found to be differentially expressed just by chance.
One possible solution is to use the family-wise error rate instead - most simply using the Bonferroni adjusted p-value. An alternative is to use the non-parmaetric method originally proposed by Young and Westfall that uses permuation resampling to estimate the adjusted p-value without the assumptions of independence that the Bonferroni method makes.
See http://www3.stat.sinica.edu.tw/statistica/oldpdf/A12n16.pdf for an overview of statistical procedures in the context of gene expressiona array analysis, including descriptions of the p-value ajdustements shown here.
np.random.seed(52) ngenes = 100 ncases = 500 nctrls = 500 nsamples = ncases + nctrls x = np.random.normal(0, 1, (ngenes, nsamples))
import scipy.stats as st
t, p0 = st.ttest_ind(x[:, :ncases], x[:, ncases:], axis=1) idx = p0 < 0.05 zip(np.nonzero(idx)[0], p0[idx])
[(0, 0.0119), (10, 0.0368), (33, 0.0117), (36, 0.0144), (39, 0.0247), (44, 0.0051), (68, 0.0253), (97, 0.0366)]
vmin = x.min() vmax = x.max() plt.subplot(121) plt.imshow(x[:, :ncases], extent=[0, 1, 0, 2], interpolation='nearest', vmin=vmin, vmax=vmax, cmap='jet') plt.xticks([]) plt.yticks([]) plt.title('Controls') plt.subplot(122) plt.imshow(x[:, ncases:], extent=[0, 1, 0, 2], interpolation='nearest', vmin=vmin, vmax=vmax, cmap='jet') plt.xticks([]) plt.yticks([]) plt.title('Cases') plt.colorbar();
p1 = np.clip(ngenes * p0, 0, 1) idx = p1 < 0.05 zip(np.nonzero(idx)[0], p1[idx])
[]
The basic idea of resampling based p-value adjustemnt is quite simple to understand. Suppose we want to know the ajdusted p-value for the lowest observed p-value.
Repeat B times
- Permutate the case control labels
- Calcuate the lowest p-value for the permutted data
The adjusted p-value is simply the number of permutation samples in which the lowest permuted p-value is smaller than the observed lowest p-value, divided by the number of permutations.
The next lowest adjusted p-value is more complicated, since we need to maintain the ordering (i.e the second adjusted p-value must be larger than the smallest adjusted p-value), and one of several related algorithms developed by Young and Westfall is usually used. We will not cover this in this class as the goal is to understand how permuation resampling works rather than the complexiities of correcting for multiple testing, but an implementation is provided for those interested.
# Let's see if the smallest adjusted p-value is significant k = 0 p0s = np.array(sorted(p0)) print "Gene\tUnadjusted p" print np.argsort(p0)[k], '\t', p0s[k]
Gene Unadjusted p 44 0.00509946274404
# Do many permutations nperms = 10000 ps = np.zeros(nperms) for i in range(nperms): sidx = np.random.permutation(nsamples) y = x[:, sidx] pvals = st.ttest_ind(y[:, :ncases], y[:, ncases:], axis=1)[1] pvals.sort() ps[i] = pvals[k] print "Gene\tUnadjusted p\t\tAdjusted p" print np.argsort(p0[k]), '\t', p0s[k], '\t', np.sum(ps < p0s[k])/nperms
Gene Unadjusted p Adjusted p 0 0.00509946274404 0.4016
# This is the maxT step-down method # Assumes that the distribution of T-statistics is the same for all genes nperms = 10000 k = ngenes counts = np.zeros((nperms, k)) ranks = np.argsort(np.abs(t))[::-1] for i in range(nperms): u = np.zeros(k) sidx = np.random.permutation(nsamples) y = x[:, sidx] tb, pb = st.ttest_ind(y[:, :ncases], y[:, ncases:], axis=1) u[k-1] = np.abs(tb[ranks[k-1]]) for j in range(k-2, -1, -1): u[j] = max(u[j+1], np.abs(tb[ranks[j]])) counts[i] = (u >= np.abs(t[ranks])) p2 = np.sum(counts, axis=0)/nperms for i in range(1, k): p2[i] = max(p2[i],p2[i-1]) idx = p2 < 0.05 zip(ranks, p2[idx])
[]
plt.plot(sorted(p0), label='No correction') plt.plot(sorted(p1), label='Bonferroni') plt.plot(sorted(p2), label='Westfall-Young') plt.ylim([0,1]) plt.legend(loc='best');
The Bonferrroni assumes that tests are independent. However, often test resutls are strongly correlated (e.g. genes in the same pathway behave similalry) and the Bonferroni will be too conservative. However the permuation-resampling method still works in the presence of correaltions.
np.random.seed(52) ngenes = 100 ncases = 500 nctrls = 500 nsamples = ncases + nctrls x = np.repeat(np.random.normal(0, 1, (1, nsamples)), ngenes, axis=0)
# In this extreme case, we measure the same gene 100 times x[:5, :5]
array([[0.519, -1.269, 0.240, -0.804, 0.017], [0.519, -1.269, 0.240, -0.804, 0.017], [0.519, -1.269, 0.240, -0.804, 0.017], [0.519, -1.269, 0.240, -0.804, 0.017], [0.519, -1.269, 0.240, -0.804, 0.017]])
t, p0 = st.ttest_ind(x[:, :ncases], x[:, ncases:], axis=1) idx = p0 < 0.05 print 'Minimum p-value', p0.min(), '# significant', idx.sum()
Minimum p-value 0.0119317780363 # significant 100
Bonferroni tells us none of the adjusted p-values are significant, which we know is the wrong answer.
p1 = np.clip(len(p0) * p0, 0, 1) idx = p1 < 0.05 print 'Minimum p-value', p1.min(), '# significant', idx.sum()
Minimum p-value 1.0 # significant 0
This tells us that every gene is significant, which is the correct answer.
nperms = 10000 counts = np.zeros((nperms, k)) ranks = np.argsort(np.abs(t))[::-1] for i in range(nperms): u = np.zeros(k) sidx = np.random.permutation(nsamples) y = x[:, sidx] tb, pb = st.ttest_ind(y[:, :ncases], y[:, ncases:], axis=1) u[k-1] = np.abs(tb[ranks[k-1]]) for j in range(k-2, -1, -1): u[j] = max(u[j+1], np.abs(tb[ranks[j]])) counts[i] = (u >= np.abs(t[ranks])) p2 = np.sum(counts, axis=0)/nperms for i in range(1, k): p2[i] = max(p2[i],p2[i-1]) idx = p2 < 0.05 print 'Minimum p-value', p2.min(), '# significant', idx.sum()
Minimum p-value 0.0118 # significant 100
plt.plot(sorted(p1), label='Bonferroni') plt.plot(sorted(p2), label='Westfall-Young') plt.ylim([-0.05,1.05]) plt.legend(loc='best');
Monte Carlo Simulations
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论