为了将数据分类为 N 个类别,是否有使用 N 个是/否分类器的替代方法?
TL;DR:有没有比是-否分类器更复杂的分类器?
我首先要说的是,我没有正在从事的具体项目,这更多的是我一直想知道的技术问题。
由于某种原因,我曾开发过一些机器学习应用程序。所有这些项目都旨在将数据分类为 N 个类别之一,并且它们都使用 N 个是/否分类器(如果这就是它们的名称)。每个分类器都会给一段数据一些分数(0 到 1,或 -1 到 1),该分数对应于该分类器训练的类别的可能性。然后由程序使用这些分数以某种方式确定最佳分类。
我在名义数据和连续数据上都看到了这一点,并且最终分类的实现不同。例如,我曾经写过一个小文档语言标识符,其中分类器接受英语、法语、德语等的训练,得分最高的分类器获胜。这对我来说很有意义。
另一个项目对连续尺度的数据进行分类,大部分是从 0 到 1.2,但有些数据高达 6。我们制作了 6 个左右的分类器,并将它们分配给 bin:0-0.2、0.2-0.4、...以及 1.0 及以上。一旦所有分类器返回某些数据,我们就会对分数进行二次拟合,并将峰值作为结果。这让我很不舒服,但我不知道为什么。
似乎应该有一种更好的方法,而不是仅仅轮询一组是/否分类器并尝试根据某种算法做出决定。举一个愚蠢的例子,考虑一个系统来决定图片显示的是洋葱还是蘑菇。 (这实际上是我想到的第一件事。)我认为一个物体越像洋葱,它看起来就越不像蘑菇,从本体论的角度来看,我想要一种分类方法这反映了这一点。如果我有两个是-否分类器,没有考虑到洋葱性与蘑菇性相反,那么对于从这两个分类器中都获得高分的图片我该怎么办?有没有办法获得一个单一的蘑菇或洋葱分类器,它以某种方式知道这两类植被之间没有重叠?或者我可以依靠真实数据训练是/否分类器来反映这一点,而无需任何特殊干预吗?
TL;DR: is there any kind of classifier more sophisticated than a yes-no classifier?
I'll say up front that I don't have a specific project I'm working on, and this is more of a technique question I've been wondering about.
I've worked on a few machine learning applications for one reason or another. All of these projects were intended to classify data into one of N classes, and they all used N yes-no classifiers (if that's what they're called). Each of these classifiers gives a piece of data some score (0 to 1, or -1 to 1) which corresponds to the likelyhood that it's the class that the classifier was trained for. It's then up to the program to use those scores to determine the best classification somehow.
I've seen this on both nominal and continuous data, with different implementations of the final classification. For example, I once wrote a small document language identifier in which classifiers were trained on English, French, German, etc, and whichever classifier gave the highest score won. This makes sense to me.
Another project classified data on a continuous scale, mostly from 0 to 1.2, but with some data up to 6. We made 6 or so classifiers and assigned them to bins: 0-0.2, 0.2-0.4, ... and 1.0 and above. Once all the classifiers returned for some piece of data, we then fit a quadratic to the scores and took the peak as the result. This makes me uncomfortable, but I don't know why.
It seems like there should be a better way than just polling a set of yes-no classifiers and trying to decide based on some algorithm. To take a silly example, consider a system to decide whether a picture shows an onion or a mushroom. (This is literally the first thing I thought of.) I would argue that the more an object looks like an onion, the less it looks like a mushroom, and from an ontological perspective I want a classification method that reflects that. If I have two yes-no classifiers that don't take into account that onionity opposes mushroomness, what do I do about a picture that gets high scores from both? Is there some way to get a single, mushroom-or-onion classifier that somehow knows that there is no overlap between these two classes of vegetation? Or can I count on training the yes-no classifiers with real data to reflect this without any special intervention?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
有两种广泛的分类学派:
1)判别性 - 在这里,我们尝试从训练示例中学习决策边界。然后根据测试示例位于空间的哪个部分(由决策边界确定),我们为其分配一个类。最先进的算法是 SVM,但是如果您的数据是不能用线分隔(例如可以用圆分隔)。
针对多类对 SVM 的修改(有多种方法,这里是一种):
让第 j 个(k 个)训练示例 xj 属于第 i 个(N 个)类。那么它的标签yj = i。
a) 特征向量:如果 xj = 属于第 i 类(N 类)的训练样本,则 xj 对应的特征向量为 phi(xj,yj) = [0 0 ... X .. 0]
注意: X 处于第 i 个“位置”。 phi 总共有 D*N 个分量,其中每个示例有 D 个特征,例如洋葱图片有 D = 640*480 个灰度整数
注意:对于其他类 p 即 y = p, phi(xj, y)特征向量中位置 p 具有“X”,其他均为零。
b) 约束:最小化 W^2(如 Vanilla SVM 中),使得:
1) 对于除 y1 之外的所有标签 y:W.phi(x1,y1) >= W.phi(x1, y) + 1
和 2)对于除 y2 之外的所有标签 y: W.phi(x2,y2) >= W.phi(x2, y) + 1
...
和 k) 对于除 y2 之外的所有标签 y: yk: W.phi(xk, yk) >= W.phi(xk, y) + 1
2) 生成 - 这里我们假设(这可能是无意义的)每个示例都是由该类别的概率分布(例如男性面孔的高斯分布和女性面孔的高斯分布,这在实践中效果很好)我们尝试通过计算与该类别对应的训练示例的均值、协方差来学习每个分布的参数(均值、协方差)。然后,对于测试示例,我们查看哪种分布给出最高概率并相应地进行分类。
两者都没有使用 N 个是-否分类器。
判别方法在分类实践中效果更好,但无法对概率答案进行建模。它还需要大量的训练示例来使优化步骤(最小化 W^2)收敛。有一种技术可以将两者结合起来,避免使用内核,称为最大熵判别。
回答你的另一个问题:
这更多是输入数据的问题,而不是学习算法本身的问题,因为学习算法本身只适用于数字矩阵。它可以反映领域中的噪音/不确定性(也就是说人类可以完美地区分蘑菇和洋葱吗??)。这可能通过更大/更好的(训练)数据集来解决。或者,在生成情况下,您可能选择了一个糟糕的分布来建模。
大多数人会在分类之前在称为特征选择的阶段对原始图像进行预处理。一种特征选择技术可能是捕捉蔬菜的轮廓,因为蘑菇和洋葱具有不同的形状,图像的其余部分可能是“噪声”。在自然语言处理等其他领域,您可以删除介词,并保留不同名词的计数。但有时性能可能不会提高,因为学习算法可能不会考虑所有特征。这实际上取决于您想要捕捉的内容 - 涉及创造力。特征选择算法也存在。
哥伦比亚大学Tony Jebara 的课程是机器学习的一个很好的资源
There are two broad schools of classification:
1) Discriminative - Here we try to learn a decision boundary from the training examples. Then based on which part of space the test example lies in, as determined by the decision boundary, we assign it a class. The state-of-the-art algorithm is the SVM, but you need kernels if your data is can't be separated by a line (for e.g it is separable by a circle).
Modifications to SVM for Multi-class (many ways of doing this, here's one):
Let the jth (of k) training example xj be in class i (of N). Then its label yj = i.
a) Feature Vector: If xj = a training example belonging to class i (of N) then the Feature Vector corresponding to xj is phi(xj,yj) = [0 0 ... X .. 0]
Note: X is in the ith "position". phi has a total of D*N components, where each example has D features e.g. a picture of an onion has D = 640*480 greyscale integers
Note: For other classes p i.e y = p, phi(xj, y) has "X" in the feature vector in position p, all other zero.
b) Constraints: Minimize W^2 (as in Vanilla SVM) such that:
1) For all labels y except y1: W.phi(x1,y1) >= W.phi(x1, y) + 1
and 2) For all labels y except y2: W.phi(x2,y2) >= W.phi(x2, y) + 1
...
and k) For all labels y except yk: W.phi(xk, yk) >= W.phi(xk, y) + 1
2) Generative - Here we ASSUME (which may turn out to be nonsense) that each example was generated by a probability distribution for that class (like a gaussian for males faces and one for female faces which works well in practice) & we try to learn the parameters - mean, covariance - of each distribution by calculating the mean, covariance of the training examples corresponding to that class. Then for a test example we see which distribution gives the highest probability and classify accordingly.
Neither uses N yes-no classifiers.
The discriminative method works better in practice for classification, but can't model probabilistic answers. It also needs a large number of training examples for the optimization step (minimize W^2) to converge. There is a technique to combine the two, avoiding kernels, called Maximum Entropy Discrimination.
To answer your other question:
This is more of a problem with the input data, not with the learning algorithm itself which just works on a matrix of numbers. It could reflect noise/uncertainty in the domain (aka can humans tell mushrooms apart from onions perfectly??). This maybe fixed by a larger/better (training) dataset. Or maybe you picked a bad distribution to model, in the generative case.
Most people would pre-process the raw images, prior to classification, in a stage called Feature Selection. One feature selection technique could be to capture the silhouette of the vegetable since mushrooms and onions have different shapes, the rest of the image maybe "noise". In other domains like Natural language processing, you could drop prepositions, and retain a count of the different nouns. But sometimes performance may not improve because the learning algorithm might not look at all the features anyway. It really depends on what you're trying to capture - creativity is involved. Feature selection algorithms also exist.
A good resource for machine learning are Tony Jebara's courses at Columbia University
您的示例背后的想法是每个问题都提供有关多个分类的信息。如果您可以为这些问题及其结果建立某种条件概率,那么您还可以为每个类别建立置信水平。
The idea behind your example is that each question gives information about more than one classification. If you can establish some kind of conditional probabilities for these questions and their results, then you can also establish a confidence level for each class.
听起来您似乎在专门讨论问题中的决策树。决策树是最常见的分类器类型之一;它们能够处理多个类别、离散和连续数据以及缺失值。基本决策树算法称为 ID3,是一种流行的改进 C4.5。决策树结果通常可以通过提升进一步改进。
It almost sounds like you are talking specifically about decision trees in your question. Decision trees are one of the most common types of classifiers; they are capable of handling multiple categories, descrete and continuous data, as well as missing values. The base decision tree algorithm is called ID3, and a popular improvement C4.5. Decision tree results can often be further improved with boosting.
您还可以简单地使用具有 c 个输出节点的前馈神经网络分类器,每个类别一个输出节点。
c 类神经网络可能比一组 2 类神经网络分类器需要更多的中间层隐藏节点。随后,特征选择指示哪些输入特征为您的分类任务提供主要的判别性能。
对于使用神经分类器进行图像处理,请参阅我的网站:
http://www.egmont-petersen.nl(点击“科学”,评论论文从 2002 年起)。
You can also simply use a feed-forward neural net classifier with c output nodes, one output node for each class.
It is likely that the c-class neural network will need more hidden nodes in the intermediate layer than a set of 2-class neural net classifiers. Subsequently, feature selection indicates which input features give the major discrimative performance for your classification task.
For image processing with neural classifiers, see for example my site:
http://www.egmont-petersen.nl (Click on 'Science', and the review-paper from 2002).