编写集成高斯的 Python 函数的最佳方法?
在尝试使用 scipy 的四边形方法对高斯进行积分时(假设有一个名为 gauss 的高斯方法),我在将所需参数传递给高斯并让四边形对正确的变量进行积分时遇到问题。 有谁有一个关于如何使用四元组和多维函数的好例子吗?
但这让我产生了一个更宏大的问题,即一般情况下对高斯积分的最佳方法。 我没有在 scipy 中找到高斯积分(令我惊讶)。 我的计划是编写一个简单的高斯函数并将其传递给quad(或者现在可能是固定宽度积分器)。 你会怎么办?
编辑:固定宽度意味着类似于 trapz 的东西,它使用固定 dx 来计算曲线下的面积。
到目前为止,我所接触到的方法是 make___gauss,它返回一个 lambda 函数,然后可以进入四元组。 这样我就可以在积分之前用我需要的平均值和方差创建一个正态函数。
def make_gauss(N, sigma, mu):
return (lambda x: N/(sigma * (2*numpy.pi)**.5) *
numpy.e ** (-(x-mu)**2/(2 * sigma**2)))
quad(make_gauss(N=10, sigma=2, mu=0), -inf, inf)
当我尝试传递一般高斯函数(需要使用 x、N、mu 和 sigma 调用)并使用四边形填充一些值时,如
quad(gen_gauss, -inf, inf, (10,2,0))
参数 10、2 和 0 不一定匹配 N=10, sigma=2,mu=0,这促使了更扩展的定义。
scipy.special 中的 erf(z) 需要我准确定义 t 最初是什么,但很高兴知道它在那里。
In attempting to use scipy's quad method to integrate a gaussian (lets say there's a gaussian method named gauss), I was having problems passing needed parameters to gauss and leaving quad to do the integration over the correct variable. Does anyone have a good example of how to use quad w/ a multidimensional function?
But this led me to a more grand question about the best way to integrate a gaussian in general. I didn't find a gaussian integrate in scipy (to my surprise). My plan was to write a simple gaussian function and pass it to quad (or maybe now a fixed width integrator). What would you do?
Edit: Fixed-width meaning something like trapz that uses a fixed dx to calculate areas under a curve.
What I've come to so far is a method make___gauss that returns a lambda function that can then go into quad. This way I can make a normal function with the average and variance I need before integrating.
def make_gauss(N, sigma, mu):
return (lambda x: N/(sigma * (2*numpy.pi)**.5) *
numpy.e ** (-(x-mu)**2/(2 * sigma**2)))
quad(make_gauss(N=10, sigma=2, mu=0), -inf, inf)
When I tried passing a general gaussian function (that needs to be called with x, N, mu, and sigma) and filling in some of the values using quad like
quad(gen_gauss, -inf, inf, (10,2,0))
the parameters 10, 2, and 0 did NOT necessarily match N=10, sigma=2, mu=0, which prompted the more extended definition.
The erf(z) in scipy.special would require me to define exactly what t is initially, but it nice to know it is there.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
好吧,你似乎对几件事很困惑。 让我们从头开始:您提到了“多维函数”,但随后继续讨论通常的单变量高斯曲线。 这不是一个多维函数:对它进行积分时,您仅对一个变量 (x) 进行积分。 区分很重要,因为有一个叫做“多元高斯分布”的怪物,它是一个真正的多维函数,如果积分,需要积分两个或多个变量(它使用昂贵的蒙特我之前提到过卡洛技术)。 但你似乎只是在谈论常规的一变量高斯,它更容易使用、积分等等。
一变量高斯分布有两个参数:
sigma
和mu
,并且是单个变量的函数,我们将其表示为x
。 您似乎还携带了一个标准化参数n
(这在一些应用程序中很有用)。 归一化参数通常不包含在计算中,因为您可以在最后将它们添加回去(请记住,积分是一个线性运算符:int(n*f(x), x) = n*int(f(x), x) ). 但如果您愿意,我们可以随身携带; 我喜欢的正态分布符号是
N(x | mu, sigma, n) := (n/(sigma*sqrt(2*pi))) * exp((-(x-mu)^ 2)/(2*sigma^2))
(读作“给定
sigma
、mu
的x
的正态分布” ,并且n
由...给出”)到目前为止,一切顺利; 这与您拥有的功能相匹配。 请注意,这里唯一的真正变量是x
:对于任何特定的高斯,其他三个参数都是固定。现在来看一个数学事实:所有高斯曲线都具有相同的形状,这已被证明是正确的,它们只是稍微移动了一点。 因此,我们可以使用称为“标准正态分布”的 N(x|0,1,1) ,并将结果转换回一般高斯曲线。 因此,如果您有 N(x|0,1,1) 的积分,则可以轻松计算任何高斯的积分。 这个积分出现得如此频繁,以至于它有一个特殊的名称:误差函数
erf
。 由于一些旧的约定,它不完全erf
; 还存在一些加法和乘法因素。如果 Phi(z) =积分(N(x|0,1,1), -inf, z); 也就是说,
Phi(z)
是标准正态分布从负无穷大到z
的积分,那么根据误差函数的定义,Phi(z) = 0.5 + 0.5 * erf(z / sqrt(2))
。同样,如果 Phi(z | mu, sigma, n) =积分( N(x|sigma, mu, n), -inf, z); 也就是说,
Phi(z | mu, sigma, n)
是给定参数mu
、sigma
和的正态分布的积分 为真
>n 从负无穷大到
z
,那么根据误差函数的定义,Phi(z | mu, sigma, n) = (n/2) * (1 + erf((x - mu) / (sigma * sqrt(2))))
。如果您需要更多详细信息或证明,请查看有关普通 CDF 的维基百科文章这个事实。
好了,背景解释已经足够了。 回到您的(编辑过的)帖子。 你说“scipy.special 中的 erf(z) 需要我准确定义 t 最初是什么”。 我不知道你这是什么意思;
t
(时间?)到底在哪里参与其中? 希望上面的解释已经稍微揭开了错误函数的神秘面纱,并且现在更清楚为什么错误函数是适合该工作的函数。你的Python代码没问题,但我更喜欢闭包而不是lambda:
使用闭包可以预先计算常量
k
和s
,因此返回的函数需要做的事情更少每次调用它时都会起作用(如果您要集成它,这可能很重要,这意味着它将被调用很多次)。 另外,我避免使用指数运算符**
,它比仅仅写出平方要慢,并将除法从内部循环中提升出来并用乘法代替。 我还没有看过它们在 Python 中的实现,但是从我上次使用原始 x87 程序集调整内部循环以获得纯速度以来,我似乎记得加、减或乘每个大约需要 4 个 CPU 周期,除以大约36,求幂大约是 200。那是几年前的事了,所以对这些数字持保留态度; 尽管如此,它还是说明了它们的相对复杂性。 同样,用暴力方式计算exp(x)也是一个非常糟糕的主意; 在编写exp(x)
的良好实现时,您可以采取一些技巧,使其比一般的a**b
样式求幂更快、更准确。我从未使用过常量 pi 和 e 的 numpy 版本; 我一直坚持使用简单的旧数学模块版本。 我不知道为什么你可能更喜欢其中之一。
我不确定您要使用
quad()
调用做什么。quad(gen_gauss, -inf, inf, (10,2,0))
应该将重整化高斯从负无穷大积分到正无穷大,并且应该始终输出 10(您的归一化因子),因为高斯在实线上积分为 1。 任何远离 10 的答案(我不会期望完全 10,因为quad()
毕竟只是一个近似值)意味着某个地方搞砸了......很难在不知道实际返回值以及可能的quad()
内部工作原理的情况下说出问题所在。希望这已经揭开了一些困惑,并解释了为什么错误函数是您问题的正确答案,以及如果您好奇的话如何自己完成这一切。 如果我的解释不清楚,我建议先快速浏览一下维基百科; 如果您仍有疑问,请随时询问。
Okay, you appear to be pretty confused about several things. Let's start at the beginning: you mentioned a "multidimensional function", but then go on to discuss the usual one-variable Gaussian curve. This is not a multidimensional function: when you integrate it, you only integrate one variable (x). The distinction is important to make, because there is a monster called a "multivariate Gaussian distribution" which is a true multidimensional function and, if integrated, requires integrating over two or more variables (which uses the expensive Monte Carlo technique I mentioned before). But you seem to just be talking about the regular one-variable Gaussian, which is much easier to work with, integrate, and all that.
The one-variable Gaussian distribution has two parameters,
sigma
andmu
, and is a function of a single variable we'll denotex
. You also appear to be carrying around a normalization parametern
(which is useful in a couple of applications). Normalization parameters are usually not included in calculations, since you can just tack them back on at the end (remember, integration is a linear operator:int(n*f(x), x) = n*int(f(x), x)
). But we can carry it around if you like; the notation I like for a normal distribution is thenN(x | mu, sigma, n) := (n/(sigma*sqrt(2*pi))) * exp((-(x-mu)^2)/(2*sigma^2))
(read that as "the normal distribution of
x
givensigma
,mu
, andn
is given by...") So far, so good; this matches the function you've got. Notice that the only true variable here isx
: the other three parameters are fixed for any particular Gaussian.Now for a mathematical fact: it is provably true that all Gaussian curves have the same shape, they're just shifted around a little bit. So we can work with
N(x|0,1,1)
, called the "standard normal distribution", and just translate our results back to the general Gaussian curve. So if you have the integral ofN(x|0,1,1)
, you can trivially calculate the integral of any Gaussian. This integral appears so frequently that it has a special name: the error functionerf
. Because of some old conventions, it's not exactlyerf
; there are a couple additive and multiplicative factors also being carried around.If
Phi(z) = integral(N(x|0,1,1), -inf, z)
; that is,Phi(z)
is the integral of the standard normal distribution from minus infinity up toz
, then it's true by the definition of the error function thatPhi(z) = 0.5 + 0.5 * erf(z / sqrt(2))
.Likewise, if
Phi(z | mu, sigma, n) = integral( N(x|sigma, mu, n), -inf, z)
; that is,Phi(z | mu, sigma, n)
is the integral of the normal distribution given parametersmu
,sigma
, andn
from minus infinity up toz
, then it's true by the definition of the error function thatPhi(z | mu, sigma, n) = (n/2) * (1 + erf((x - mu) / (sigma * sqrt(2))))
.Take a look at the Wikipedia article on the normal CDF if you want more detail or a proof of this fact.
Okay, that should be enough background explanation. Back to your (edited) post. You say "The erf(z) in scipy.special would require me to define exactly what t is initially". I have no idea what you mean by this; where does
t
(time?) enter into this at all? Hopefully the explanation above has demystified the error function a bit and it's clearer now as to why the error function is the right function for the job.Your Python code is OK, but I would prefer a closure over a lambda:
Using a closure enables precomputation of constants
k
ands
, so the returned function will need to do less work each time it's called (which can be important if you're integrating it, which means it'll be called many times). Also, I have avoided any use of the exponentiation operator**
, which is slower than just writing the squaring out, and hoisted the divide out of the inner loop and replaced it with a multiply. I haven't looked at all at their implementation in Python, but from my last time tuning an inner loop for pure speed using raw x87 assembly, I seem to remember that adds, subtracts, or multiplies take about 4 CPU cycles each, divides about 36, and exponentiation about 200. That was a couple years ago, so take those numbers with a grain of salt; still, it illustrates their relative complexity. As well, calculatingexp(x)
the brute-force way is a very bad idea; there are tricks you can take when writing a good implementation ofexp(x)
that make it significantly faster and more accurate than a generala**b
style exponentiation.I've never used the numpy version of the constants pi and e; I've always stuck with the plain old math module's versions. I don't know why you might prefer either one.
I'm not sure what you're going for with the
quad()
call.quad(gen_gauss, -inf, inf, (10,2,0))
ought to integrate a renormalized Gaussian from minus infinity to plus infinity, and should always spit out 10 (your normalization factor), since the Gaussian integrates to 1 over the real line. Any answer far from 10 (I wouldn't expect exactly 10 sincequad()
is only an approximation, after all) means something is screwed up somewhere... hard to say what's screwed up without knowing the actual return value and possibly the inner workings ofquad()
.Hopefully that has demystified some of the confusion, and explained why the error function is the right answer to your problem, as well as how to do it all yourself if you're curious. If any of my explanation wasn't clear, I suggest taking a quick look at Wikipedia first; if you still have questions, don't hesitate to ask.
scipy 附带“误差函数”,又名高斯积分:
scipy ships with the "error function", aka Gaussian integral:
高斯分布也称为正态分布。 scipynorm 模块中的 cdf 函数可以满足您的需求。
http://docs.scipy .org/doc/scipy/reference/ generated/scipy.stats.norm.html#scipy.stats.norm
The gaussian distribution is also called a normal distribution. The cdf function in the scipy norm module does what you want.
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm
为什么不总是从-无穷大到+无穷大进行积分,以便您始终知道答案? (开玩笑!)
我的猜测是,SciPy 中还没有固定的高斯函数的唯一原因是它是一个编写起来很简单的函数。 您关于编写自己的函数并将其传递给四元进行集成的建议听起来很棒。 它使用公认的 SciPy 工具来完成此操作,这对您来说是最少的代码工作,并且对于其他人来说非常容易阅读,即使他们从未见过 SciPy。
固定宽度积分器到底是什么意思? 您的意思是使用与 QUADPACK 使用的算法不同的算法吗?
编辑:为了完整起见,下面是我尝试的高斯函数,平均值为 0,标准差为 1,从 0 到 +无穷大:
这有点难看,因为高斯函数有点长,但对于写。
Why not just always do your integration from -infinity to +infinity, so that you always know the answer? (joking!)
My guess is that the only reason that there's not already a canned Gaussian function in SciPy is that it's a trivial function to write. Your suggestion about writing your own function and passing it to quad to integrate sounds excellent. It uses the accepted SciPy tool for doing this, it's minimal code effort for you, and it's very readable for other people even if they've never seen SciPy.
What exactly do you mean by a fixed-width integrator? Do you mean using a different algorithm than whatever QUADPACK is using?
Edit: For completeness, here's something like what I'd try for a Gaussian with the mean of 0 and standard deviation of 1 from 0 to +infinity:
That's a little ugly because the Gaussian function is a little long, but still pretty trivial to write.
我假设你正在处理多元高斯; 如果是这样,SciPy 已经有了您正在寻找的函数:它被称为 MVNDIST(“MultiVariate Normal DISTribution”)。SciPy 文档一如既往地糟糕,所以我什至找不到该函数被隐藏的位置,但是 它在某个地方。文档很容易成为 SciPy 中最糟糕的部分,并且过去一直让我感到沮丧。
单变量高斯函数只是使用旧的错误函数,其中许多实现都是可用的
。为了解决一般问题,是的,正如 James Thompson 提到的,您只想编写自己的高斯分布函数并将其提供给quad(),但是,如果您可以避免广义积分,那么这样做是个好主意 - - 针对特定函数的专门积分技术(如 MVNDIST 使用)将比标准蒙特卡罗多维积分快得多,而标准蒙特卡洛多维积分对于高精度来说可能非常慢。
I assume you're handling multivariate Gaussians; if so, SciPy already has the function you're looking for: it's called MVNDIST ("MultiVariate Normal DISTribution). The SciPy documentation is, as ever, terrible, so I can't even find where the function is buried, but it's in there somewhere. The documentation is easily the worst part of SciPy, and has frustrated me to no end in the past.
Single-variable Gaussians just use the good old error function, of which many implementations are available.
As for attacking the problem in general, yes, as James Thompson mentions, you just want to write your own gaussian distribution function and feed it to quad(). If you can avoid the generalized integration, though, it's a good idea to do so -- specialized integration techniques for a particular function (like MVNDIST uses) are going to be much faster than a standard Monte Carlo multidimensional integration, which can be extremely slow for high accuracy.