潜在语义分析概念
我读过有关使用奇异值分解(SVD)在文本语料库中进行潜在语义分析(LSA)的内容。我已经了解如何做到这一点,也了解 SVD 的数学概念。
但我不明白为什么它可以应用于文本语料库(我相信 - 一定有语言学解释)。有人可以从语言学的角度解释一下吗?
谢谢
I've read about using Singular Value Decomposition (SVD) to do Latent Semantic Analysis (LSA) in corpus of texts. I've understood how to do that, also I understand mathematical concepts of SVD.
But I don't understand why does it works applying to corpuses of texts (I believe - there must be linguistical explanation). Could anybody explain me this with linguistic point of view?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
没有语言解释,没有涉及语法,没有处理等价类、同义词、同音异义词、词干等。也不涉及任何语义,它只是单词一起出现。
将“文档”视为购物车:它包含单词(购买)的组合。并且单词往往与“相关”单词一起出现。
例如:“毒品”一词可以与{爱情、医生、医学、运动、犯罪}一起出现;每个都会为您指明不同的方向。但结合文档中的许多其他单词,您的查询可能会找到来自类似字段的文档。
There is no linguistic interpretation, there is no syntax involved, no handling of equivalence classes, synonyms, homonyms, stemming etc. Neither are any semantics involved, it is just words-occuring-together.
Consider a "document" as a shopping cart: it contains a combination of words (purchases). And words tend to occur together with "related" words.
For instance: The word "drug" can occur together with either of {love, doctor, medicine, sports, crime}; each will point you in a different direction. But combined with many other words in the document, your query will probably find documents from a similar field.
一起出现的单词(即在语料库中的附近或同一文档中)有助于上下文。潜在语义分析基本上根据上下文之间的相似程度对语料库中的相似文档进行分组。
我认为 此页面将有助于理解。
Words occurring together (i.e. nearby or in the same document in a corpus) contribute to context. Latent Semantic Analysis basically groups similar documents in a corpus based on how similar they are to each other in terms of context.
I think the example and the word-document plot on this page will help in understanding.
假设我们有以下五个文档集
和搜索查询:dies, dagger。
显然,d3 应该排在列表的首位,因为它包含两个骰子、匕首。然后,d2和d4
后面应该包含一个查询词。但是,d1 和 d5 呢?他们应该是
返回此查询可能有趣的结果?作为人类,我们知道 d1 非常相关
到查询。另一方面,d5 与查询没有太大关系。因此,我们想要 d1 但
不是 d5,或者换句话说,我们希望 d1 的排名高于 d5。
问题是:机器能推断出这一点吗?答案是肯定的,LSI 正是这么做的。在这个
例如,LSI 将能够看到术语 dagger 与 d1 相关,因为它与
d1 的术语“罗密欧”和“朱丽叶”分别在 d2 和 d3 中。此外,术语 die 与 d1 和 d5 相关
因为它在 d3 和 d4 中与 d1 的术语 Romeo 和 d5 的术语 New-Hampshire 一起出现,
分别。 LSI 还会对发现的连接进行适当权衡; d1 比 d5 更多地与查询相关
,因为 d1 通过“罗密欧与朱丽叶”“双重”连接到 dagger,并且还连接到
die 通过 Romeo,而 d5 只有一个通过 New-Hampshire 与查询的连接。
参考:潜在语义分析(Alex Thomo)
Suppose we have the following set of five documents
and a search query: dies, dagger.
Clearly, d3 should be ranked top of the list since it contains both dies, dagger. Then, d2 and d4
should follow, each containing a word of the query. However, what about d1 and d5? Should they be
returned as possibly interesting results to this query? As humans we know that d1 is quite related
to the query. On the other hand, d5 is not so much related to the query. Thus, we would like d1 but
not d5, or differently said, we want d1 to be ranked higher than d5.
The question is: Can the machine deduce this? The answer is yes, LSI does exactly that. In this
example, LSI will be able to see that term dagger is related to d1 because it occurs together with
the d1’s terms Romeo and Juliet, in d2 and d3, respectively. Also, term dies is related to d1 and d5
because it occurs together with the d1’s term Romeo and d5’s term New-Hampshire in d3 and d4,
respectively. LSI will also weigh properly the discovered connections; d1 more is related to the query
than d5 since d1 is “doubly” connected to dagger through Romeo and Juliet, and also connected to
die through Romeo, whereas d5 has only a single connection to the query through New-Hampshire.
Reference: Latent Semantic Analysis (Alex Thomo)