add "algorithm selection", and you probably have three most fundamental questions of classifier design.
As an aside note, it's a good thing that you do not have any domain expertise which would have allowed you to guide the selection of features and/or to assert the linearity of the feature space. That's the fun of data mining : to infer such info without a priori expertise. (BTW, and while domain expertise is good to double-check the outcome of the classifier, too much a priori insight may make you miss good mining opportunities). Without any such a priori knowledge you are forced to establish sound methodologies and apply careful scrutiny to the results.
It's hard to provide specific guidance, in part because many details are left out in the question, and also because I'm somewhat BS-ing my way through this ;-). Never the less I hope the following generic advice will be helpful
For each algorithm you try (or more precisely for each set of parameters for a given algorithm), you will need to run many tests. Theory can be very helpful, but there will remain a lot of "trial and error". You'll find Cross-Validation a valuable technique. In a nutshell, [and depending on the size of the available training data], you randomly split the training data in several parts and train the classifier on one [or several] of these parts, and then evaluate the classifier on its performance on another [or several] parts. For each such run you measure various indicators of performance such as Mis-Classification Error (MCE) and aside from telling you how the classifier performs, these metrics, or rather their variability will provide hints as to the relevance of the features selected and/or their lack of scale or linearity.
Independently of the linearity assumption, it is useful to normalize the values of numeric features. This helps with features which have an odd range etc. Within each dimension, establish the range within, say, 2.5 standard deviations on either side of the median, and convert the feature values to a percentage on the basis of this range.
Convert nominal attributes to binary ones, creating as many dimensions are there are distinct values of the nominal attribute. (I think many algorithm optimizers will do this for you)
Once you have identified one or a few classifiers with a relatively decent performance (say 33% MCE), perform the same test series, with such a classifier by modifying only one parameter at a time. For example remove some features, and see if the resulting, lower dimensionality classifier improves or degrades.
The loss factor is a very sensitive parameter. Try and stick with one "reasonnable" but possibly suboptimal value for the bulk of the tests, fine tune the loss at the end.
Learn to exploit the "dump" info provided by the SVM optimizers. These results provide very valuable info as to what the optimizer "thinks"
Remember that what worked very well wih a given dataset in a given domain may perform very poorly with data from another domain...
coffee's good, not too much. When all fails, make it Irish ;-)
Wow, so you have some training data and you don't know whether you are looking at features representing words in a document, or genese in a cell and need to tune a classifier. Well, since you don't have any semantic information, you are going to have to do this soley by looking at statistical properties of the data sets.
First, to formulate the problem, this is more than just linear vs non-linear. If you are really looking to classify this data, what you really need to do is to select a kernel function for the classifier which may be linear, or non-linear (gaussian, polynomial, hyperbolic, etc. In addition each kernel function may take one or more parameters that would need to be set. Determining an optimal kernel function and parameter set for a given classification problem is not really a solved problem, there are only useful heuristics and if you google 'selecting a kernel function' or 'choose kernel function', you will be treated to many research papers proposing and testing various approaches. While there are many approaches, one of the most basic and well travelled is to do a gradient descent on the parameters-- basically you try a kernel method and a parameter set , train on half your data points and see how you do. Then you try a different set of parameters and see how you do. You move the parameters in the direction of best improvement in accuracy until you get satisfactory results.
If you don't need to go through all this complexity to find a good kernel function, and simply want an answer to linear or non-linear. then the question mainly comes down to two things: Non linear classifiers will have a higher risk of overfitting (undergeneralizing) since they have more dimensions of freedom. They can suffer from the classifier merely memorizing sets of good data points, rather than coming up with a good generalization. On the other hand a linear classifier has less freedom to fit, and in the case of data that is not linearly seperable, will fail to find a good decision function and suffer from high error rates.
Unfortunately, I don't know a better mathematical solution to answer the question "is this data linearly seperable" other than to just try the classifier itself and see how it performs. For that you are going to need a smarter answer than mine.
Edit: This research paper describes an algorithm which looks like it should be able to determine how close a given data set comes to being linearly seperable.
发布评论
评论(2)
这实际上是两个问题合而为一;-)
添加“算法选择”,你可能会遇到分类器设计的三个最基本的问题。
顺便说一句,您没有任何领域专业知识,这将允许您指导特征的选择和/或断言特征空间的线性,这是一件好事。这就是数据挖掘的乐趣:无需先验专业知识即可推断出此类信息。 (顺便说一句,虽然领域专业知识有助于仔细检查分类器的结果,但过多的先验洞察可能会让您错过良好的挖掘机会)。如果没有任何此类先验知识,您就被迫建立合理的方法并对结果进行仔细审查。
很难提供具体指导,部分原因是问题中遗漏了许多细节,也因为我在这方面有点胡说八道;-)。尽管如此,我还是希望以下通用建议能够有所帮助
对于您尝试的每种算法(或更准确地说,对于给定算法的每组参数),您将需要运行许多测试。理论可能非常有帮助,但仍会存在很多“试验和错误”。您会发现交叉验证是一项很有价值的技术。
简而言之,[并且根据可用训练数据的大小],您将训练数据随机分成几个部分,并在其中一个[或多个]部分上训练分类器,然后评估分类器在另一部分上的性能[或几个]部分。对于每次这样的运行,您测量各种性能指标,例如误分类错误(MCE),除了告诉您分类器如何执行之外,这些指标,或者更确切地说,它们的可变性将提供有关所选功能的相关性和/或它们缺乏规模或线性。
独立于线性假设,对数字特征的标准化值很有用。这有助于处理具有奇数范围等的特征。
在每个维度内,在中位数两侧建立一个范围,例如 2.5 个标准差,并在此范围的基础上将特征值转换为百分比。
将名义属性转换为二进制属性,创建尽可能多的维度,名义属性具有不同的值。 (我认为许多算法优化器会为您执行此操作)
一旦你确定了一个或几个具有相对不错性能的分类器(比如 33% MCE),就可以通过仅修改一个分类器来执行相同的测试系列。一次参数。例如,删除一些特征,然后查看所得的较低维度分类器是否有所改善或退化。
损耗因子是一个非常敏感的参数。对于大部分测试,尝试并坚持使用一个“合理”但可能不是最佳的值,最后微调损失。
学习利用 SVM 优化器提供的“转储”信息。这些结果提供了关于优化器“想法”的非常有价值的信息
请记住,对于给定域中的给定数据集效果很好的方法可能对于来自另一个域的数据表现很差...
咖啡很好,但不是太多。当一切都失败时,就做爱尔兰咖啡;-)
This is in fact two questions in one ;-)
add "algorithm selection", and you probably have three most fundamental questions of classifier design.
As an aside note, it's a good thing that you do not have any domain expertise which would have allowed you to guide the selection of features and/or to assert the linearity of the feature space. That's the fun of data mining : to infer such info without a priori expertise. (BTW, and while domain expertise is good to double-check the outcome of the classifier, too much a priori insight may make you miss good mining opportunities). Without any such a priori knowledge you are forced to establish sound methodologies and apply careful scrutiny to the results.
It's hard to provide specific guidance, in part because many details are left out in the question, and also because I'm somewhat BS-ing my way through this ;-). Never the less I hope the following generic advice will be helpful
For each algorithm you try (or more precisely for each set of parameters for a given algorithm), you will need to run many tests. Theory can be very helpful, but there will remain a lot of "trial and error". You'll find Cross-Validation a valuable technique.
In a nutshell, [and depending on the size of the available training data], you randomly split the training data in several parts and train the classifier on one [or several] of these parts, and then evaluate the classifier on its performance on another [or several] parts. For each such run you measure various indicators of performance such as Mis-Classification Error (MCE) and aside from telling you how the classifier performs, these metrics, or rather their variability will provide hints as to the relevance of the features selected and/or their lack of scale or linearity.
Independently of the linearity assumption, it is useful to normalize the values of numeric features. This helps with features which have an odd range etc.
Within each dimension, establish the range within, say, 2.5 standard deviations on either side of the median, and convert the feature values to a percentage on the basis of this range.
Convert nominal attributes to binary ones, creating as many dimensions are there are distinct values of the nominal attribute. (I think many algorithm optimizers will do this for you)
Once you have identified one or a few classifiers with a relatively decent performance (say 33% MCE), perform the same test series, with such a classifier by modifying only one parameter at a time. For example remove some features, and see if the resulting, lower dimensionality classifier improves or degrades.
The loss factor is a very sensitive parameter. Try and stick with one "reasonnable" but possibly suboptimal value for the bulk of the tests, fine tune the loss at the end.
Learn to exploit the "dump" info provided by the SVM optimizers. These results provide very valuable info as to what the optimizer "thinks"
Remember that what worked very well wih a given dataset in a given domain may perform very poorly with data from another domain...
coffee's good, not too much. When all fails, make it Irish ;-)
哇,所以你有一些训练数据,但你不知道你是否正在查看代表文档中单词的特征,或者细胞中的基因,并且需要调整分类器。好吧,由于您没有任何语义信息,因此您将不得不仅通过查看数据集的统计属性来完成此操作。
首先,为了表述问题,这不仅仅是线性与非线性的问题。如果你真的想对这些数据进行分类,你真正需要做的是为分类器选择一个核函数,它可以是线性的,也可以是非线性的(高斯、多项式、双曲等)。此外,每个核函数可能需要需要设置一个或多个参数。为给定的分类问题确定最佳核函数和参数集并不是一个真正已解决的问题,只有有用的启发式方法,如果您搜索“选择核函数”或“选择核”函数”,您将看到许多提出和测试各种方法的研究论文,虽然有很多方法,但最基本且广为人知的方法之一是对参数进行梯度下降 - 基本上您会尝试一种核方法和一种方法。参数集,对一半的数据点进行训练,看看效果如何,然后尝试一组不同的参数,看看如何朝准确度最佳提高的方向移动参数,直到获得满意的结果。
如果您不需要经历所有这些复杂性来找到一个好的核函数,而只是想要线性或非线性的答案。那么问题主要归结为两件事:非线性分类器将具有更高的过度拟合(概括不足)的风险,因为它们具有更多的自由维度。他们可能会因为分类器仅仅记住一组好的数据点而无法得出好的概括而受到困扰。另一方面,线性分类器的拟合自由度较小,并且在数据不可线性分离的情况下,将无法找到良好的决策函数并遭受较高的错误率。
不幸的是,除了尝试分类器本身并看看它的表现之外,我不知道更好的数学解决方案来回答“这个数据是否线性可分”的问题。为此,你需要一个比我更聪明的答案。
编辑:这篇研究论文描述了一种算法,看起来它应该能够确定给定数据集与线性可分离的接近程度。
http://www2.ift.ulaval.ca/~mmarchand/publications/wcnn93aa .pdf
Wow, so you have some training data and you don't know whether you are looking at features representing words in a document, or genese in a cell and need to tune a classifier. Well, since you don't have any semantic information, you are going to have to do this soley by looking at statistical properties of the data sets.
First, to formulate the problem, this is more than just linear vs non-linear. If you are really looking to classify this data, what you really need to do is to select a kernel function for the classifier which may be linear, or non-linear (gaussian, polynomial, hyperbolic, etc. In addition each kernel function may take one or more parameters that would need to be set. Determining an optimal kernel function and parameter set for a given classification problem is not really a solved problem, there are only useful heuristics and if you google 'selecting a kernel function' or 'choose kernel function', you will be treated to many research papers proposing and testing various approaches. While there are many approaches, one of the most basic and well travelled is to do a gradient descent on the parameters-- basically you try a kernel method and a parameter set , train on half your data points and see how you do. Then you try a different set of parameters and see how you do. You move the parameters in the direction of best improvement in accuracy until you get satisfactory results.
If you don't need to go through all this complexity to find a good kernel function, and simply want an answer to linear or non-linear. then the question mainly comes down to two things: Non linear classifiers will have a higher risk of overfitting (undergeneralizing) since they have more dimensions of freedom. They can suffer from the classifier merely memorizing sets of good data points, rather than coming up with a good generalization. On the other hand a linear classifier has less freedom to fit, and in the case of data that is not linearly seperable, will fail to find a good decision function and suffer from high error rates.
Unfortunately, I don't know a better mathematical solution to answer the question "is this data linearly seperable" other than to just try the classifier itself and see how it performs. For that you are going to need a smarter answer than mine.
Edit: This research paper describes an algorithm which looks like it should be able to determine how close a given data set comes to being linearly seperable.
http://www2.ift.ulaval.ca/~mmarchand/publications/wcnn93aa.pdf