潜在语义索引
据说,通过LSI,产生的矩阵U、A和V将具有同义词的文档汇集在一起。例如,如果我们搜索“汽车”,我们也会得到包含“汽车”的文档。但LSI只不过是矩阵的操作。它只考虑频率,不考虑语义。那么我所缺少的这个魔法背后的东西是什么?请解释一下。
It is said that through LSI, the matrices that are produced U, A and V, they bring together documents which have synonyms. For e.g. if we search for "car", we also get documents which have "automobile". But LSI is nothing but manipulations of matrices. It only takes into account the frequency, not semantics. So whats the thing behind this magic that I am missing? Please explain.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
LSI 基本上创建每个文档的频率配置文件,并查找具有相似频率配置文件的文档。如果频率分布的其余部分足够相似,它将把两个文档分类为非常相似,即使一个文档系统地替换了一些单词。相反,如果频率配置文件不同,它可以/将把文档分类为不同的,即使它们经常使用一些特定术语(例如,在某些情况下与计算机相关的“文件”,以及用于在其他情况下切割和光滑金属)。
LSI 通常还用于相对较大的文档组。其他文档也可以帮助发现相似之处——即使文档 A 和 B 看起来有很大不同,如果文档 C 使用了 A 和 B 中的相当多的术语,它可以帮助发现 A 和 B 确实相当相似。
LSI basically creates a frequency profile of each document, and looks for documents with similar frequency profiles. If the remainder of the frequency profile is enough alike, it'll classify two documents as being fairly similar, even if one systematically substitutes some words. Conversely, if the frequency profiles are different, it can/will classify documents as different, even if they share frequent use of a few specific terms (e.g., "file" being related to a computer in some cases, and a thing that's used to cut and smooth metal in other cases).
LSI is also typically used with relatively large groups of documents. The other documents can help in finding similarities as well -- even if document A and B look substantially different, if document C uses quite a few terms from both A and B, it can help in finding that A and B are really fairly similar.
根据 维基百科文章,“LSI 的原则是,在相同的上下文往往具有相似的含义。”也就是说,如果两个词看起来可以互换使用,那么它们可能是同义词。
这并不是绝对正确的。
According to the Wikipedia article, "LSI is based on the principle that words that are used in the same contexts tend to have similar meanings." That is, if two words seem to be used interchangeably, they might be synonyms.
It's not infallible.