潜在语义索引 (LSI) 是统计分类算法吗?
潜在语义索引 (LSI) 是统计分类算法吗?为什么或为什么不呢?
基本上,我试图弄清楚为什么统计分类的维基百科页面没有提到LSI 。我刚刚开始研究这个东西,我正在尝试了解所有不同的对事物进行分类的方法是如何相互关联的。
Is Latent Semantic Indexing (LSI) a Statistical Classification algorithm? Why or why not?
Basically, I'm trying to figure out why the Wikipedia page for Statistical Classification does not mention LSI. I'm just getting into this stuff and I'm trying to see how all the different approaches for classifying something relate to one another.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
不,它们并不完全相同。统计分类的目的是尽可能干净地将项目分成不同的类别,例如,明确决定项目 X 是否更像 A 组或 B 组中的项目。
LSI 旨在显示项目相似或不同的程度,主要是查找与指定项目显示相似程度的项目。虽然这相似,但并不完全相同。
No, they're not quite the same. Statistical classification is intended to separate items into categories as cleanly as possible -- to make a clean decision about whether item X is more like the items in group A or group B, for example.
LSI is intended to show the degree to which items are similar or different and, primarily, find items that show a degree of similarity to an specified item. While this is similar, it's not quite the same.
LSI/LSA最终是一种降维技术,通常与最近邻算法相结合,使其成为分类系统。因此,它本身只是使用 SVD 在较低维度“索引”数据的一种方法。
LSI/LSA is eventually a technique for dimensionality reduction, and usually is coupled with a nearest neighbor algorithm to make it a into classification system. Hence in itself, its only a way of "indexing" the data in lower dimension using SVD.
您是否阅读过维基百科上的 LSI ?它说它使用矩阵分解(SVD),而矩阵分解有时又用于分类。
Have you read about LSI on Wikipedia ? It says it uses matrix factorization (SVD), which in turn is sometimes used in classification.
机器学习的主要区别在于“监督”建模和“无监督”建模。
通常“统计分类”一词指的是监督模型,但并非总是如此。
使用监督方法,训练集包含一个“真实情况”标签,您可以构建模型进行预测。当您评估模型时,目标是预测真实标签的最佳猜测(或概率分布),而您在评估时不会有这种情况。通常有一个性能指标,并且非常清楚什么是正确答案,什么是错误答案。
无监督分类方法试图将大量可能以复杂方式变化的数据点聚类为较少数量的“相似”类别。每个类别中的数据应该在某种“有趣”或“深刻”方面相似。由于没有“基本事实”,您无法评估“对或错”,但“更多”与“更少”有趣或有用。
类似地,在评估时间,您可以将新示例放入潜在的集群之一(清晰分类),或者给出某种权重来量化集群“原型”的相似或不同程度。
因此,在某些方面,监督模型和无监督模型可以产生“预测”,即类/簇标签的预测,但它们本质上是不同的。
通常,无监督模型的目标是为后续监督模型提供更智能、更强大的紧凑输入。
The primary distinction in machine learning is between "supervised" and "unsupervised" modeling.
Usually the words "statistical classification" refer to supervised models, but not always.
With supervised methods the training set contains a "ground-truth" label that you build a model to predict. When you evaluate the model, the goal is to predict the best guess at (or probability distribution of) the true label, which you will not have at time of evaluation. Often there's a performance metric and it's quite clear what the right vs wrong answer is.
Unsupervised classification methods attempt to cluster a large number of data points which may appear to vary in complicated ways into a smaller number of "similar" categories. Data in each category ought to be similar in some kind of 'interesting' or 'deep' way. Since there is no "ground truth" you can't evaluate 'right or wrong', but 'more' vs 'less' interesting or useful.
Similarly evaluation time you can place new examples into potentially one of the clusters (crisp classification) or give some kind of weighting quantifying how similar or different looks like the "archetype" of the cluster.
So in some ways supervised and unsupervised models can yield something which is a "prediction", prediction of class/cluster label, but they are intrinsically different.
Often the goal of an unsupervised model is to provide more intelligent and powerfully compact inputs for a subsequent supervised model.