我的问题的核心是“如何为学习问题设计核函数?”
作为一个快速的背景知识,我正在阅读有关支持向量机和核机的书籍,并且我所看到的所有作者都给出了核的示例(齐次和非齐次的多项式核、高斯核以及对基于文本的核的暗示等等) ,但都要么提供结果图片而不指定内核,要么含糊地声称“可以构建高效的内核”。我对为新问题设计内核时所发生的过程感兴趣。
最简单的例子可能是学习 XOR,这是嵌入真实平面的最小(4 点)非线性数据集。如何想出一个自然的(并且不平凡的)内核来线性分离这些数据?
作为一个更复杂的例子(参见 Cristianini,SVM 简介,图 6.2),如何设计一个内核来学习棋盘模式?克里斯蒂亚尼尼表示该图片是“使用高斯核”得出的,但他似乎使用了多个核,并且它们以未指定的方式组合和修改。
如果这个问题太宽泛,无法在这里回答,我希望能参考一个这样的内核函数的构造,尽管我希望示例稍微简单一些。
The meat of my question is "how does one design a kernel function for a learning problem?"
As a quick background, I'm reading books on support vector machines and kernel machines, and everywhere I look authors give examples of kernels (polynomial kernels both homogeneous and nonhomogeneous, gaussian kernels, and allusions to text-based kernels to name a few), but all either provide pictures of the results without specifying the kernel, or vaguely claim that "an efficient kernel can be constructed". I'm interested in the process that goes on when one designs a kernel for a new problem.
Probably the easiest example is learning XOR, a smallest (4 points) non-linear data set as embedded the real plane. How would one come up with a natural (and non-trivial) kernel to linearly separate this data?
As a more complex example (see Cristianini, Introduction to SVMs, figure 6.2), how would one design a kernel to learn a checkerboard pattern? Cristianini states the picture was derived "using Gaussian kernels" but it seems that he uses multiple, and they are combined and modified in an unspecified way.
If this question is too broad to answer here, I'd appreciate a reference to the construction of one such kernel function, though I'd prefer the example be somewhat simple.
发布评论
评论(6)
尝试常见的方法(线性、多项式、RBF)并使用效果最好的方法对于试图获得最准确预测模型的人来说确实是合理的建议。无论如何,对支持向量机的一个常见批评是,它们似乎有很多参数需要根据经验进行调整。所以至少你并不孤单。
如果您确实想为特定问题设计内核,那么您是对的,这本身就是一个机器学习问题。这被称为“模型选择问题”。我自己并不完全是专家,但对我来说深入了解内核方法的最佳来源是这本书“
Trying the usual suspects (linear, polynomial, RBF) and using whichever works the best really is sound advice for someone trying to get the most accurate predictive model they can. For what it's worth it's a common criticism of SVMs that they seem to have a lot of parameters that you need to tune empirically. So at least you're not alone.
If you really want to design a kernel for a specific problem then you are right, it is a machine learning problem all in itself. It's called the 'model selection problem'. I'm not exactly an expert myself here, but the best source of insight into kernel methods for me was the book 'Gaussian Processes' by Rasumussen and Williams (it's freely available online), particularly chapters 4 and 5. I'm sorry that I can't say much more than 'read this huge book full of maths' but it's a complicated problem and they do a really good job of explaining it.
(对于不熟悉机器学习中核函数使用的人来说,核只是将输入向量(构成数据集的数据点)映射到更高维的空间,即“特征空间”。然后 SVM 会找到一个在这个变换的空间中用最大边距(超平面和支持向量之间的距离)来分离超平面。)
好吧,从已知与 SVM 分类器一起工作的核开始,以解决感兴趣的问题。在这种情况下,我们知道带有经过训练的 SVM 的RBF(径向基函数)内核可以清楚地分离 XOR。您可以这样用 Python 编写 RBF 函数:
其中 gamma 是 1/特征数(数据集中的列),x、y 是笛卡尔对。
(径向基函数模块也在scipy.interpolate.Rbf中)
其次,如果您想要的不仅仅是使用可用的核函数来解决分类/回归问题,而是您想要构建您的我个人建议首先研究核函数的选择以及这些函数内部的参数如何影响分类器的性能。 SVM/SVC 常用的一小组内核函数是最好的起点。该组由以下部分组成(除了 RBF):
线性核
多项式
sigmoid
(For anyone not familiar with the use of kernel functions in Machine Learning, kernels just maps the input vectors (data points that comprise the data set) into a higher-dimensional space, aka, the "Feature Space". The SVM then finds a separating hyperplane with the maximal margin (distance between the hyperplane and the support vectors) in this transformed space.)
Well, start with kernels that are known to work with SVM classifiers to solve the problem of interest. In this case, we know that the RBF (radial basis function) kernel w/ a trained SVM, cleanly separates XOR. You can write an RBF function in Python this way:
In which gamma is 1/number of features (columns in the data set), and x, y are a Cartesian pair.
(A radial basis function module is also in scipy.interpolate.Rbf)
Second, if what you are after is not just using available kernel functions to solve classification/regression problems, but instead you want to build your own, i would suggest first studying how the choice of kernel function and the parameters inside those functions affect classifier performance. The small group of kernel functions in common use with SVM/SVC, is the best place to start. This group is comprised of (aside from RBF):
linear kernel
polynomial
sigmoid
我的方法是研究数据:如何分离异或问题中的点?当我开始学习一般的 ML,特别是 SVM 时,我就是这样做的,拿起玩具问题,手绘它,并尝试将类分开。
当我第一次查看 XOR 问题时,我发现两个紫色点(左下)的 X 和 Y 具有相同的符号,在一种情况下,一个正数为负数,而两个绿色点的 X 和 Y 均为相同符号相反的迹象。因此,对于绿色点,X 和 Y 的平方和将为 0(或非常小,在初始问题中带有一点噪声);对于紫色点,X 和 Y 的平方和为 2(或接近 2)。因此,添加第三个坐标
Z = np.sqrt(np.square(X + Y))
将很好地分离这两个集合:顺便说一句,
Z
与 doug 的 rbf 如果您认为np.sqrt(np.square(X + Y))
本质上与在本例中为np.abs(X + Y)
。我无法访问 Crisitanini 的论文,但我也会以类似的方式解决这个问题,从玩具版本开始(顺便说一下,棋盘代码感谢doug):
这里可能的直觉是黑色方块的行索引和列索引之和总是偶数,而白色方块的行索引和列索引总和总是奇数,因此添加像
(row_index + col_index) % 2
这样的第三个维度就可以了这个简单版本的技巧。在更大、更复杂的棋盘数据集中,就像我在网上找到的这样:事情并不那么简单,但也许可以级联聚类来找到 16 个聚类的平均 X 和 Y 位置(也许使用medoids聚类),然后应用“模核技巧”的一个版本?
免责声明,我没有处理过大量的分类问题,到目前为止,我发现在制作复杂问题的玩具版本时,我通常会对可能有效的解决方案类型获得“数字”直觉。
最后,正如对道格答案的评论中所发布的那样,我没有发现 像他这样的经验方法,通过使用相同算法(SVC)将所有可能的内核传递到嵌套交叉验证中的网格搜索并仅更改内核来研究它们的性能。您可以通过在转换后的特征空间中绘制相应的边距来添加该方法:例如,对于 rbf,使用 Doug 建议的方程(以及 Sebastian Raschka 绘制决策区域的例程 - 此处的单元格 13)。
2017 年 10 月 27 日更新
在我的 Slack 频道上的一次对话中,另一位地球物理学家问我异或门被设计为 0 和 1 而不是 -1 和 1 的情况(后者类似于勘探地球物理学中的一个经典问题,因此我最初的玩具例子)。
如果我要用 0 和 1 来处理 XOR 门,并且没有关于 rbf 内核的知识,在这种情况下,我也会坐下来根据这些问题的坐标来研究问题,看看是否可以我可以想出一个转变。
我在这里的第一个观察结果是 Os 位于
x=y
行,X 位于x=-y
行,所以区别xy 将为 0(或很小,带有一点噪音),在另一种情况下,分别为 +/-1。绝对值会处理符号,因此
Z = np.abs(XY)
可以工作。顺便说一下,这与 doug 的 非常相似rbf = np.exp(-gamma * np.abs (x - y)**2)
(支持他的答案的另一个原因);事实上,他的 rbf 是一个更通用的解决方案,适用于所有 XOR 情况。My approach would be to study the data: how would I separate the points in the XOR problem? When I started studying about M.L. in general, and SVM in particular that is what I did, took toy problem, draw it by hand, and try to separate the classes.
When I looked at the XOR problem the firs time, it occurred to me that both purple points (below, left) have X and Y of the same sign, in one case negative in one positive, whereas both green points have X and Y of opposite signs. Therefore, the squared sum of X and Y would be 0 (or very small with a bit of noise in the initial problem) for the green points, and 2 (or nearly 2) for the purple ones. Hence, adding a third coordinate
Z = np.sqrt(np.square(X + Y))
will nicely separate the two sets:On a side note,
Z
is not too dissimilar formulation from doug's rbf if you consider thatnp.sqrt(np.square(X + Y))
is essentially the same as asnp.abs(X + Y)
in this case.I do not have access to Crisitanini's paper but I'd approach that problem too in a similar way, starting with a toy version (by the way, checkerboard code thanks to none other than doug):
A possible intuition here is that the sum of row and column indices for the black squares would be always even, whereas for the white squares would be always odd, so adding as a third dimension something like
(row_index + col_index) % 2
would do the trick in this simple version. In a larger, more complex checkerboard dataset, like this I found on the web:things aren't so simple, but perhaps one could cascade clustering to find the the 16 clusters' mean X and Y locations (perhaps using medoids clustering), then apply a version of the "modulo kernel trick"?
With the disclaimer that I have not worked with a ton of classification problems, so far I have found that in making a toy version of a complex one, I've usually gained a 'numerical' intuition as to the kind of solution that might work.
Finally, as posted in a comment to doug's answer, I don't find anything wrong with an empirical approach like his, studying the performance of all possible kernels by passing them to grid search in nested crossvalidation with the same algorithm (SVC) and changing only the kernel. You could add to that approach by plotting the respective margins in the transformed feature spaces: for example, for rbf, using the equation suggested by Doug (and Sebastian Raschka's routine for plotting decision regions - cell 13 here).
UPDATE October 27/17
In a conversation on my slack channel, another geophysicist asked me about the case in which the XOR gate is designed as 0s and 1s as opposed to -1s and 1s (the latter being similar to a classic problem in exploration geophysics, hence my initial toy example).
If I were to tackle the XOR gate with 0s and 1s, and did not have at disposal the knowledge about the rbf kernel, in this case too I'd sit and loo at the problem in terms of the coordinates of those problems and see if I could come up with a transformation.
My first observation here was that the Os sit on the
x=y
line, the Xs on thex=-y
line, so the differencex-y
would be 0 (or small with a bit of noise) in on case, +/-1 in the other, respectively. The absolute value would take care of the sign, henceZ = np.abs(X-Y)
would work. Which, by the way, is very similar to doug'srbf = np.exp(-gamma * np.abs(x - y)**2)
(another reason to upvote his answer); and in fact, his rbf is a more general solution, working in all XOR cases.我正在通过示例寻找一些多项式内核工作,并偶然发现了这篇文章。如果您仍在寻找,可能会有所帮助的是这个工具包(http://www2.fml.tuebingen.mpg.de/raetsch/projects/shogun),它使用多个内核学习,您可以在其中选择多种内核核方法,然后学习将选择最适合问题的方法,所以你不必这样做。
选择内核的一种更简单、更传统的方法是使用不同内核方法的交叉验证来找到最佳的。
希望这可以帮助您或其他人阅读内核方法。
I am looking for some polynomial kernel work through examples and stumbled on this post. A couple of things that might help if you are still looking are this toolkit (http://www2.fml.tuebingen.mpg.de/raetsch/projects/shogun) which uses multiple kernel learning, where you can choose a wide selection of kernel methods and then the learning will choose the best for the problem, so you don't have to.
An easier more traditional method for your choice of kernel is to use cross-validation with different kernel methods to find the best.
Hope this helps you or anyone else reading around kernel methods.
XOR_problem 不是线性可分离的,因此例如使用 SVM - 使用多项式内核:
此处对线性内核进行了注释 - 可以取消注释以查看此错误内核使用的结果...
或者可以使用rbf_Kernel(例如此处) - 使用内核技巧(在链接中一般描述为拉格朗日对偶问题)...并且这里是 SVM 的一些局限性(增加了 SVM 无法获取新知识的事实)实时数据流中的新点 - 与 感知器 - SVM再次需要新的学习)
PS sklearn似乎仍然更方便的多类库多标签分类
XOR_problem is not linearly-separable, thus e.g. using SVM - use polynomial Kernel:
Linear Kernel is commented here - can uncomment to see results of this wrong Kernel usage...
or can use rbf_Kernel (e.g.here) - to use Kernel Trick (described in general at link as Lagrange dual problem)... And here are several limitations of SVMs (adding to the fact that SVM cannot acquire new knowledge with new points in real-time dataflow - vs. perceptron - SVM need new learning again)
P.S. sklearn still seems more convenient lib for Multi-Class & Multi-Label Classification
解决 XOR 问题的简单核函数由
$(x,y) -> 给出。 (x,y,xy)$
这与@MyCarta给出的解决方案有关,因为
$(x+y)^2 = x^2 + xy + y^2$
但是,到原点的距离 $x^2 + y^2$ 与区分四个象限无关,因此我们可以安全地省略它并得到更简单的表达式。
下面是应用此类内核的示例。
显然,边界超平面是$z=0$,相当于$xy=0$,它对应横轴和纵轴,$y=0$和$x=0$回到两个方面。
对于单极情况,将边界向两个方向移动半个单位就足够了,产生双曲线 $\left(x-\frac{1}{2}\right)\left(y-\frac{1}{2}\ right) = 0$
所需的项,$x$、$y$ 和 $xy$ 包含在二次多项式核中,但其他项不是必需的。因此,这个简化的 3D 内核可以更容易地可视化。
A simple kernel function that solves the XOR problem is given by
$(x,y) -> (x,y,xy)$
This is related to the solution given by @MyCarta, since
$(x+y)^2 = x^2 + xy + y^2$
However, the distance to the origin $x^2 + y^2$ is irrelevant for distinguishing the four quadrants, so we can safely ommit it and get a simpler expression.
Here is an illustration of applying such a kernel.
Clearly, the boundary hyperplane is $z=0$, equivalently $xy=0$, which corresponds to the horizontal and vertical axis, $y=0$ and $x=0$ back in two dimensions.
For the unipolar case, shifting the boundary half a unit in both directions suffices, yielding the hyperbole $\left(x-\frac{1}{2}\right)\left(y-\frac{1}{2}\right) = 0$
The terms needed, $x$, $y$ and $xy$ are included in the quadratic polynomial kernel, but the other terms are not necessary. Therefore, this reduced 3D kernel can be visualized more easily.