为什么我无法获得小于 2.2e-16 的 p 值?
我在 R 中的 t 检验和卡方中发现了这个问题,但我认为这个问题通常适用于其他测试。如果我这样做:
a <- 1:10
b <- 100:110
t.test(a,b)
我得到:t = -64.6472,df = 18.998,p值< 2.2e-16。我从评论中知道 2.2e-16
是 .Machine$double.eps
的值 - 满足 1 + x != 的最小浮点数1
,但当然 R 可以表示比它小得多的数字。我还从 R FAQ 中知道 R 必须将浮点数舍入为 53 位二进制数字精度:R 常见问题解答。
有几个问题:(1) 我是否正确地将其读取为 精度 的 53 个二进制数字,或者是 R 中的值? .Machine$double.eps
计算不准确? (2) 为什么在进行此类计算时,即使精度有所损失,R 也不提供显示较小 p 值的方法? (3) 有没有办法显示更小的 p 值,即使我失去了一些精度?对于单个测试,2 位小数有效数字就可以了,对于我要 Bonferroni 纠正的值,我需要更多。当我说“失去一些精度”时,我想< 53 个二进制数字,但是 (4) 我完全错了,任何 p 值 < .Machine$double.eps
非常不准确? (5) R 是否只是诚实的,而其他统计数据包则不然?
在我的领域,非常小的 p 值是常态,一些例子:http://www. ncbi.nlm.nih.gov/pubmed/20154341,http://www.plos Genetics.org/article/info%3Adoi %2F10.1371%2Fjournal.pgen.1002215 这就是为什么我想代表这么小的p 值。
感谢您的帮助,抱歉问了这么曲折的问题。
I've found this issue with t-tests and chi-squared in R but I assume this issue applies generally to other tests. If I do:
a <- 1:10
b <- 100:110
t.test(a,b)
I get: t = -64.6472, df = 18.998, p-value < 2.2e-16
. I know from the comments that 2.2e-16
is the value of .Machine$double.eps
- the smallest floating point number such that 1 + x != 1
, but of course R can represent numbers much smaller than that. I know also from the R FAQ that R has to round floats to 53 binary digits accuracy: R FAQ.
A few questions: (1) am I correct in reading that as 53 binary digits of precision or are values in R < .Machine$double.eps
not calculated accurately? (2) Why, when doing such calculations does R not provide a means to display a smaller value for the p-value, even with some loss of precision? (3) Is there a way to display a smaller p-value, even if I lose some precision? For a single test 2 decimal significant figures would be fine, for values I am going to Bonferroni correct I'll need more. When I say "lose some precision" I think < 53 binary digits, but (4) am I completely mistaken and any p-value < .Machine$double.eps
is wildly inaccurate? (5) Is R just being honest and other stats packages are not?
In my field very small p-values are the norm, some examples: http://www.ncbi.nlm.nih.gov/pubmed/20154341, http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1002215 and this is why I want to represent such small p-values.
Thanks for your help, sorry for such a tortuous question.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
尝试类似
t.test(a,b)$p.value
的操作,看看是否能够满足您所需的准确性。我相信它与结果的打印有关,而不是与实际存储的计算机值有关,后者应该具有必要的精度。Try something like this
t.test(a,b)$p.value
see if that gives you the accuracy you need. I believe it has more to do with the printing of the result than it does the actual stored computer value which should have the necessary precision.一些 R 包解决了这个问题。最好的方法是通过 pspearman 包。
[1] 3.819961e-294
Some R packages solve this issue. The best way is through package pspearman.
[1] 3.819961e-294
最近有同样的问题。统计学家研究员建议:
Recently had same problem. Fellow statistician recommends:
这是一个流行的问题,但令人惊讶的是没有提到使用对数表示作为解决方案的答案。
在一些研究领域,特别是生物信息学(尤其是基因组学,但在其他组学领域越来越多)精确的 log10(p 值)被用来将证据与无效证据进行比较。通过将
log.p=TRUE
传递给适当的分位数分布函数,可以在 R 中获得 p 值的对数以进行常见测试。t 测试
,您可以根据 log10(p) 的朴素计算进行评估:
Correlation
Pearson
方便地,这也使用 t 统计量,并且可以直接从 cor.test 结果中提取统计量以及自由度参数。
比较:
Spearman
这需要更多的手动工作,因为我们需要手动计算自由度 (
n-2
) 和统计数据。如果您对 t 分布近似感到满意,则可以使用以下公式计算检验统计量:
r * sqrt((n - 2) / (1 - r**2))
公式并重复使用相同的 pt 函数。比较:
This is a popular question but surprisingly no answer mentioned using logarithm representation as a solution.
In some research areas, notably bioinformatics (especially genomics, but increasingly more in other -omic fields) exact log10(p-value) are used to compare evidence against the null. Logs of p-values can be obtained in R for the common tests by passing
log.p=TRUE
to the appropriate quantile distribution function.t-test
which you can evaluate against naive computation of log10(p):
Correlation
Pearson
Conveniently this also uses a t-statistic and the statistic along with degrees of freedom parameter can be extracted directly from
cor.test
result.Comparison:
Spearman
This one requires more manual work as we need to compute degrees of freedom (
n-2
) and the statistic manually.If you are happy with t-distribution approximation, you can use compute the test statistic using:
r * sqrt((n - 2) / (1 - r**2))
formula and re-use the samept
function.Comparison:
在这里交换答案和评论时,我对一些事情感到困惑。
首先,当我尝试OP的原始示例时,我没有得到像这里讨论的那样小的p值(几个不同的2.13.x版本和R-devel):
第二,当我使组之间的差异变得更大时,我实际上得到了@eWizardII建议的结果:
t.test
中打印输出的行为是由它的调用驱动的stats:::print.htest
(也被其他统计测试函数调用,例如chisq.test
,如 OP 所示),它又调用format.pval
,它呈现的 p 值小于其eps
值(即.Machine$double.eps
)默认)为<每股收益。我很惊讶地发现自己不同意这些普遍精明的评论者......
最后,尽管担心非常小的p值的精确值似乎很愚蠢,但OP是正确的,这些值是在生物信息学文献中经常被用作证据强度的指数——例如,人们可能会测试 100,000 个候选基因并查看结果p值的分布(搜索“火山图”作为一个例子这种 程序)。
I'm puzzled by several things in the exchange of answers and comments here.
First of all, when I try the OP's original example I don't get a p value as small as the ones that are being debated here (several different 2.13.x versions and R-devel):
Second, when I make the difference between groups much bigger, I do in fact get the results suggested by @eWizardII:
The behavior of the printed output in
t.test
is driven by its call tostats:::print.htest
(which is also called by other statistical testing functions such aschisq.test
, as noted by the OP), which in turn callsformat.pval
, which presents p values less than its value ofeps
(which is.Machine$double.eps
by default) as< eps
. I'm surprised to find myself disagreeing with such generally astute commenters ...Finally, although it seems silly to worry about the precise value of a very small p value, the OP is correct that these values are often used as indices of strength of evidence in the bioinformatics literature -- for example, one might test 100,000 candidate genes and look at the distribution of resulting p values (search for "volcano plot" for one example of this sort of procedure).
两个问题:
1) 1e-16 和 1e-32 的 p 值之间的统计含义可能存在什么差异?如果您确实可以证明它的合理性,那么使用记录的值就是正确的方法。
2)当您对 R 的数值准确性感兴趣时,为什么使用维基百科?
R-FAQ 表示“其他[即非整数]数字必须四舍五入到(通常)53 位二进制数字的精度。” 16 位数字已经是极限了。这是在控制台时获得精度限制的方法:
当在 [0,1] 范围内解释时,该数字实际上为零
Two questions:
1) What possible difference in statistical implication would there be between p-values of 1e-16 and 1e-32? If you truly can justify it then using the logged values is the way to go.
2) Why do you use Wikipedia when your interest in in the numerical accuracy of R?
The R-FAQ says "Other [meaning non-integer] numbers have to be rounded to (typically) 53 binary digits accuracy." 16 digits is about the limit. This is how to get the limits of accuracy when at the console:
That number is effectively zero when interpreted on a range of [0,1]
您链接到的维基百科页面适用于 R 不使用的 Decimal64 类型 - 它使用标准发行的双精度数。
首先,来自
.Machine
帮助页面的一些定义。因此,您可以表示小于 2.2e-16 的数字,但它们的精度会降低,并且会导致计算问题。尝试一些数字接近最小可表示值的示例。
您在评论中提到您想要进行 bonferroni 修正。我建议您使用
p.adjust(your_p_value, method = "bonferroni")
而不是为此滚动自己的代码。pairwise.t.test
使用这个。The Wikipedia page you linked to was for the Decimal64 type which R does not use – it uses standard-issue doubles.
First, some definitions from the
.Machine
help page.So you can represent numbers smaller than 2.2e-16, but their accuracy is dimished, and it causes problems with calculations. Try some examples with numbers close to the smallest representable value.
You mentioned in a comment that you wanted to do bonferroni corrections. Rather than rolling your own code for this, I suggest that you use
p.adjust(your_p_value, method = "bonferroni")
instead.pairwise.t.test
uses this.