返回介绍

Permutation resampling

发布于 2025-02-25 23:43:56 字数 11337 浏览 0 评论 0 收藏 0

Permuatation resampling is used ot generate the null distribtuion of labeled data by switching lebals. Because the number of permuations grows so fast, it is typically only feasible to use a Monte Carlo sample of the possible set of permuations in computation.

# Growth of the factorial function (number of permutations) using Stirling's approximation

def stirling(n):
    """Stirling's approximation to the factorial."""
    return np.sqrt(2*np.pi*n)*(n/np.e)**n

n = np.arange(1, 51)
zip(n, stirling(n))
[(1, 0.9221),
 (2, 1.9190),
 (3, 5.8362),
 (4, 23.5062),
 (5, 118.0192),
 (6, 710.0782),
 (7, 4980.3958),
 (8, 39902.3955),
 (9, 359536.8728),
 (10, 3598695.6187),
 (11, 39615625.0506),
 (12, 475687486.4728),
 (13, 6187239475.1927),
 (14, 86661001740.5988),
 (15, 1300430722199.4680),
 (16, 20814114415223.1367),
 (17, 353948328666101.1250),
 (18, 6372804626194313.0000),
 (19, 121112786592294192.0000),
 (20, 2422786846761135104.0000),
 (21, 50888617325509746688.0000),
 (22, 1119751494628237770752.0000),
 (23, 25758525370529310834688.0000),
 (24, 618297927022794799841280.0000),
 (25, 15459594834691181359661056.0000),
 (26, 402000993060955330726330368.0000),
 (27, 10855315170319531497075245056.0000),
 (28, 303982326243341862218743414784.0000),
 (29, 8816392105377489957715009601536.0000),
 (30, 264517095922965156800687262138368.0000),
 (31, 8200764697241122458512884083195904.0000),
 (32, 262446514081933026899914856968749056.0000),
 (33, 8661418381417958431306228879169945600.0000),
 (34, 294510096099824346859521185203942850560.0000),
 (35, 10308575166584033336103974733365808988160.0000),
 (36, 371133249087415837775601850534254065221632.0000),
 (37, 13732789283357647537712986585599118967570432.0000),
 (38, 521876921190057472102855717030406837275459584.0000),
 (39, 20354344348300692639357061611803222912972881920.0000),
 (40, 814217264494623640116433847565750863601126604800.0000),
 (41, 33384604069916415591100983328410827967169870954496.0000),
 (42, 1402221223524367565643573793611811471275760323395584.0000),
 (43, 60298294706657120904833322706173396147148627061506048.0000),
 (44, 2653241820650555398466033792063490393009887673090834432.0000),
 (45, 119400906860443613913174999733586801766394224950600794112.0000),
 (46, 5492662822140225801570025546069546965027287393900087476224.0000),
 (47, 258165102848257451938594016029422319832480942931582618959872.0000),
 (48, 12392382664425307299255962057935500269489259106719931528380416.0000),
 (49, 607248264576510288229888440783681300390904221083157786483752960.0000),
 (50, 30363445939381680077841740787498028998394965264769813642857152512.0000)]

Suppose you have 2 data sets from unknown distribution and you want to test if some arbitrary statistic (e.g 7th percentile) is the same in the 2 data sets - what can you do?

An appropirate test statistic is the difference between the 7th percentile, and if we knew the null distribution of this statisic, we could test for the null hypothesis that the statistic = 0. Permuting the labels of the 2 data sets allows us to create the empirical null distribution.

x = np.concatenate([np.random.exponential(size=200),
                    np.random.normal(0, 1, size=100)])
y = np.concatenate([np.random.exponential(size=250),
                    np.random.normal(0, 1, size=50)])
n1, n2 = map(len, (x, y))
reps = 10000

data = np.concatenate([x, y])
ps = np.array([np.random.permutation(n1+n2) for i in range(reps)])
xp = data[ps[:, :n1]]
yp = data[ps[:, n1:]]
samples = np.percentile(xp, 7, axis=1) - np.percentile(yp, 7, axis=1)
plt.hist(samples, 25, histtype='step')
test_stat = np.percentile(x, 7) - np.percentile(y, 7)
plt.axvline(test_stat)
plt.axvline(np.percentile(samples, 2.5), linestyle='--')
plt.axvline(np.percentile(samples, 97.5), linestyle='--')
print "p-value =", 2*np.sum(samples >= np.abs(test_stat))/reps
p-value = 0.0124

We will make up some data - a tpical example is trying to identify genes that are differentially expressed in two groups of people, pehraps those who are helathy and those who are sick. For each gene, we can perform a t-test to see if the gene is differnetially expressed across the two groups at some nominal significanc level, typically 0.05. When we have many genes, this is unsatisfactory since 5% of the genes will be found to be differentially expressed just by chance.

One possible solution is to use the family-wise error rate instead - most simply using the Bonferroni adjusted p-value. An alternative is to use the non-parmaetric method originally proposed by Young and Westfall that uses permuation resampling to estimate the adjusted p-value without the assumptions of independence that the Bonferroni method makes.

See http://www3.stat.sinica.edu.tw/statistica/oldpdf/A12n16.pdf for an overview of statistical procedures in the context of gene expressiona array analysis, including descriptions of the p-value ajdustements shown here.

np.random.seed(52)

ngenes = 100
ncases = 500
nctrls = 500
nsamples = ncases + nctrls
x = np.random.normal(0, 1, (ngenes, nsamples))
import scipy.stats as st
t, p0 = st.ttest_ind(x[:, :ncases], x[:, ncases:], axis=1)
idx = p0 < 0.05
zip(np.nonzero(idx)[0], p0[idx])
[(0, 0.0119),
 (10, 0.0368),
 (33, 0.0117),
 (36, 0.0144),
 (39, 0.0247),
 (44, 0.0051),
 (68, 0.0253),
 (97, 0.0366)]
vmin = x.min()
vmax = x.max()

plt.subplot(121)
plt.imshow(x[:, :ncases], extent=[0, 1, 0, 2], interpolation='nearest',
           vmin=vmin, vmax=vmax, cmap='jet')
plt.xticks([])
plt.yticks([])
plt.title('Controls')
plt.subplot(122)
plt.imshow(x[:, ncases:], extent=[0, 1, 0, 2], interpolation='nearest',
           vmin=vmin, vmax=vmax, cmap='jet')
plt.xticks([])
plt.yticks([])
plt.title('Cases')
plt.colorbar();

p1 = np.clip(ngenes * p0, 0, 1)
idx = p1 < 0.05
zip(np.nonzero(idx)[0], p1[idx])
[]

The basic idea of resampling based p-value adjustemnt is quite simple to understand. Suppose we want to know the ajdusted p-value for the lowest observed p-value.

Repeat B times

  • Permutate the case control labels
  • Calcuate the lowest p-value for the permutted data

The adjusted p-value is simply the number of permutation samples in which the lowest permuted p-value is smaller than the observed lowest p-value, divided by the number of permutations.

The next lowest adjusted p-value is more complicated, since we need to maintain the ordering (i.e the second adjusted p-value must be larger than the smallest adjusted p-value), and one of several related algorithms developed by Young and Westfall is usually used. We will not cover this in this class as the goal is to understand how permuation resampling works rather than the complexiities of correcting for multiple testing, but an implementation is provided for those interested.

# Let's see if the smallest adjusted p-value is significant
k = 0
p0s = np.array(sorted(p0))
print "Gene\tUnadjusted p"
print np.argsort(p0)[k], '\t', p0s[k]
Gene        Unadjusted p
44  0.00509946274404
# Do many permutations
nperms = 10000

ps = np.zeros(nperms)
for i in range(nperms):
    sidx = np.random.permutation(nsamples)
    y = x[:, sidx]
    pvals = st.ttest_ind(y[:, :ncases], y[:, ncases:], axis=1)[1]
    pvals.sort()
    ps[i] = pvals[k]

print "Gene\tUnadjusted p\t\tAdjusted p"
print np.argsort(p0[k]), '\t', p0s[k], '\t', np.sum(ps < p0s[k])/nperms
Gene        Unadjusted p            Adjusted p
0   0.00509946274404        0.4016
# This is the maxT step-down method
# Assumes that the distribution of T-statistics is the same for all genes

nperms = 10000
k = ngenes

counts = np.zeros((nperms, k))
ranks = np.argsort(np.abs(t))[::-1]
for i in range(nperms):
    u = np.zeros(k)
    sidx = np.random.permutation(nsamples)
    y = x[:, sidx]
    tb, pb = st.ttest_ind(y[:, :ncases], y[:, ncases:], axis=1)
    u[k-1] = np.abs(tb[ranks[k-1]])
    for j in range(k-2, -1, -1):
        u[j] = max(u[j+1], np.abs(tb[ranks[j]]))
    counts[i] = (u >= np.abs(t[ranks]))

p2 = np.sum(counts, axis=0)/nperms
for i in range(1, k):
    p2[i] = max(p2[i],p2[i-1])
idx = p2 < 0.05
zip(ranks, p2[idx])
[]
plt.plot(sorted(p0), label='No correction')
plt.plot(sorted(p1), label='Bonferroni')
plt.plot(sorted(p2), label='Westfall-Young')
plt.ylim([0,1])
plt.legend(loc='best');

The Bonferrroni assumes that tests are independent. However, often test resutls are strongly correlated (e.g. genes in the same pathway behave similalry) and the Bonferroni will be too conservative. However the permuation-resampling method still works in the presence of correaltions.

np.random.seed(52)

ngenes = 100
ncases = 500
nctrls = 500
nsamples = ncases + nctrls

x = np.repeat(np.random.normal(0, 1, (1, nsamples)), ngenes, axis=0)
# In this extreme case, we measure the same gene 100 times
x[:5, :5]
array([[0.519, -1.269, 0.240, -0.804, 0.017],
       [0.519, -1.269, 0.240, -0.804, 0.017],
       [0.519, -1.269, 0.240, -0.804, 0.017],
       [0.519, -1.269, 0.240, -0.804, 0.017],
       [0.519, -1.269, 0.240, -0.804, 0.017]])
t, p0 = st.ttest_ind(x[:, :ncases], x[:, ncases:], axis=1)
idx = p0 < 0.05
print 'Minimum p-value', p0.min(), '# significant', idx.sum()
Minimum p-value 0.0119317780363 # significant 100

Bonferroni tells us none of the adjusted p-values are significant, which we know is the wrong answer.

p1 = np.clip(len(p0) * p0, 0, 1)
idx = p1 < 0.05
print 'Minimum p-value', p1.min(), '# significant', idx.sum()
Minimum p-value 1.0 # significant 0

This tells us that every gene is significant, which is the correct answer.

nperms = 10000

counts = np.zeros((nperms, k))
ranks = np.argsort(np.abs(t))[::-1]
for i in range(nperms):
    u = np.zeros(k)
    sidx = np.random.permutation(nsamples)
    y = x[:, sidx]
    tb, pb = st.ttest_ind(y[:, :ncases], y[:, ncases:], axis=1)
    u[k-1] = np.abs(tb[ranks[k-1]])
    for j in range(k-2, -1, -1):
        u[j] = max(u[j+1], np.abs(tb[ranks[j]]))
    counts[i] = (u >= np.abs(t[ranks]))

p2 = np.sum(counts, axis=0)/nperms
for i in range(1, k):
    p2[i] = max(p2[i],p2[i-1])
idx = p2 < 0.05

print 'Minimum p-value', p2.min(), '# significant', idx.sum()
Minimum p-value 0.0118 # significant 100
plt.plot(sorted(p1), label='Bonferroni')
plt.plot(sorted(p2), label='Westfall-Young')
plt.ylim([-0.05,1.05])
plt.legend(loc='best');

Monte Carlo Simulations

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文