使用SCI-KIT学习的高斯过程回归
语境: 在高斯过程(GP)回归中,我们可以使用两种方法:
(i)通过最大似然(最大化数据可能性)拟合内核参数,并使用这些定义的GP 预测参数。
(ii)贝叶斯方法:在内核参数上放置一个参数先验分布。 此先前分布的参数称为超参数。 数据条件以获得内核参数的后验分布,现在
(IIA)通过最大化后核参数可能性(MAP参数)来符合内核参数 并使用MAP-参数定义的GP进行预测,或
(IIB)(完整的贝叶斯方法):使用集成所有由由由所有GPS定义的GPS进行预测 内核参数后分布的可允许的内核参数。
(IIB)是包装中引用的参考文献[RW2006]中提倡的主要方法。
关键是超参数仅存在于贝叶斯方法中,并且是先前的参数 内核参数上的分布。
因此,我对文档中“超参数”一词的使用感到困惑,例如 在这里 陈述的地方 “内核是由超参数的矢量参数化的”。
必须将其解释为一种通过数据的调节为超参数的一种间接参数化 请勿直接确定内核参数。 然后给出一个指数内核及其长度尺度参数的示例。 由于通常使用该术语,这绝对不是超参数。
似乎没有区分内核参数和超参数。 这是令人困惑的,现在尚不清楚该软件包是否完全使用贝叶斯方法。 例如,我们在哪里指定内核参数上的先前分布的参数族?
问题: Scikit-Learn使用方法(i)或(ii)?
这是我自己的暂定答案: 混乱源于以下事实:高斯过程通常被称为“函数的先验”,表明某种贝叶斯主义。更糟糕的是,该过程仍然是无限的维度,因此限制有限数据维度是某种“边缘化”。 这也令人困惑,因为一般而言,您只有在贝叶斯方法中进行边缘化,在该方法中,您具有数据和参数的联合分布, 因此,您经常将一个或另一个边缘化。
但是,此处的正确视图如下:高斯过程是模型,内核参数是模型参数,在sci-kit中,学习没有超参数,因为内核参数上没有先前的分布,so so so s so lml(log Marginal考虑到模型参数,可能性)是普通数据的可能性,而参数拟合是普通的最大数据样本。简而言之,方法是(i)而不是(ii)。
Context:
in Gaussian Process (GP) regression we can use two approaches:
(I) Fit the kernel parameters via Maximum Likelihood (maximize data likelihood) and use the GP defined by these
parameters for prediction.
(II) Bayesian approach: put a parametric prior distribution on the kernel parameters.
The parameters of this prior distribution are called the hyperparameters.
Condition on the data to obtain a posterior distribution for the kernel parameters and now either
(IIa) fit the kernel parameters by maximizing the posterior kernel-parameter likelihood (MAP parameters)
and use the GP defined by the MAP-parameters for prediction, or
(IIb) (the full Bayesian approach): predict using the mixture model which integrates all the GPs defined by
the admissible kernel parameters along the posterior distribution of kernel-parameters.
(IIb) is the principal approach advocated in the reference [RW2006] cited in the package.
The point is that hyperparameters exist only in the Bayesian approach and are the parameters of the prior
distribution on kernel parameters.
Therefore I am confused about the use of the term "hyperparameters" in the documentation, e.g.
here
where it is stated that
"Kernels are parameterized by a vector of hyperparameters".
This must be interpreted as a sort of indirect parameterization via conditioning on the data as the hyperparameters
do not directly determine the kernel parameters.
Then an example is given of the exponential kernel and its length-scale parameter.
This is definitely not a hyperparameter as this term is generally used.
No distinction seems to be drawn between kernel-parameters and hyperparameters.
This is confusing and it is now unclear if the package uses the Bayesian approach at all.
For example where do we specify the parametric family of prior distributions on kernel parameters?
Question: does scikit-learn use approach (I) or (II)?
Here is my own tentative answer:
the confusion comes from the fact that a Gaussian Process is often called a "prior on functions" indicating some sort of Bayesianism. Worse still the process is infinite dimensional so restricting to the finite data dimensions is some sort of "marginalization".
This is also confusing since in general you have marginalization only in the Bayesian approach where you have a joint distribution of data and parameters,
so you often marginalize out one or the other.
The correct view here however is the following: the Gaussian Process is the model, the kernel parameters are the model parameters, in sci-kit learn there are no hyperparameters since there is no prior distribution on kernel parameters, the so called LML (log marginal likelihood) is ordinary data likelihood given the model parameters and the parameter-fit is ordinary maximum data-likelihood. In short the approach is (I) and not (II).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您阅读了,您清楚地看到内核(超)参数是优化。例如,查看参数的描述
n_restarts_optimizer
:“优化器的重新启动数,用于查找最大化日志 - 划分可能性的内核参数。”在您的问题中(i)。我会注意到另外两件事:
gaussianProcessRegressor
class“露出方法log_marginal_likelihood(theta),该方法可外部用于选择超参数的其他方式,例如,通过马尔可夫链蒙特卡洛(Markov Chain Monte Carlo)。因此,从技术上讲,有可能使其“完全贝叶斯”(您的方法(ii)),但您必须提供推理方法。If you read the scikit-learn documentation on GP regression, you clearly see that the kernel (hyper)parameters are optimized. Take a look for example at the description of the argument
n_restarts_optimizer
: "The number of restarts of the optimizer for finding the kernel’s parameters which maximize the log-marginal likelihood." In your question that is approach (i).I would note two more things though:
GaussianProcessRegressor
class "exposes a method log_marginal_likelihood(theta), which can be used externally for other ways of selecting hyperparameters, e.g., via Markov chain Monte Carlo." So, technically it is possible to make it "fully Bayesian" (your approach (ii)) but you must provide the inference method.