潜在语义分析概念

发布于 2024-11-29 23:20:49 字数 157 浏览 6 评论 0原文

我读过有关使用奇异值分解(SVD)在文本语料库中进行潜在语义分析(LSA)的内容。我已经了解如何做到这一点,也了解 SVD 的数学概念。

但我不明白为什么它可以应用于文本语料库(我相信 - 一定有语言学解释)。有人可以从语言学的角度解释一下吗?

谢谢

I've read about using Singular Value Decomposition (SVD) to do Latent Semantic Analysis (LSA) in corpus of texts. I've understood how to do that, also I understand mathematical concepts of SVD.

But I don't understand why does it works applying to corpuses of texts (I believe - there must be linguistical explanation). Could anybody explain me this with linguistic point of view?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

江挽川 2024-12-06 23:20:49

没有语言解释,没有涉及语法,没有处理等价类、同义词、同音异义词、词干等。也不涉及任何语义,它只是单词一起出现。
将“文档”视为购物车:它包含单词(购买)的组合。并且单词往往与“相关”单词一起出现。

例如:“毒品”一词可以与{爱情、医生、医学、运动、犯罪}一起出现;每个都会为您指明不同的方向。但结合文档中的许多其他单词,您的查询可能会找到来自类似字段的文档。

There is no linguistic interpretation, there is no syntax involved, no handling of equivalence classes, synonyms, homonyms, stemming etc. Neither are any semantics involved, it is just words-occuring-together.
Consider a "document" as a shopping cart: it contains a combination of words (purchases). And words tend to occur together with "related" words.

For instance: The word "drug" can occur together with either of {love, doctor, medicine, sports, crime}; each will point you in a different direction. But combined with many other words in the document, your query will probably find documents from a similar field.

烟凡古楼 2024-12-06 23:20:49

一起出现的单词(即在语料库中的附近或同一文档中)有助于上下文。潜在语义分析基本上根据上下文之间的相似程度对语料库中的相似文档进行分组。

我认为 页面将有助于理解。

Words occurring together (i.e. nearby or in the same document in a corpus) contribute to context. Latent Semantic Analysis basically groups similar documents in a corpus based on how similar they are to each other in terms of context.

I think the example and the word-document plot on this page will help in understanding.

清欢 2024-12-06 23:20:49

假设我们有以下五个文档集

  • d1:罗密欧与朱丽叶。
  • d2:朱丽叶:哦快乐的匕首!
  • d3:罗密欧死于匕首。 >
  • d4 : “不自由,毋宁死”,这是新罕布什尔州的座右铭。
  • d5 : 你知道吗,新罕布什尔州是在新英格兰。

和搜索查询:dies, dagger

显然,d3 应该排在列表的首位,因为它包含两个骰子、匕首。然后,d2和d4
后面应该包含一个查询词。但是,d1 和 d5 呢?他们应该是
返回此查询可能有趣的结果?作为人类,我们知道 d1 非常相关
到查询。另一方面,d5 与查询没有太大关系。因此,我们想要 d1 但
不是 d5,或者换句话说,我们希望 d1 的排名高于 d5。

问题是:机器能推断出这一点吗?答案是肯定的,LSI 正是这么做的。在这个
例如,LSI 将能够看到术语 dagger 与 d1 相关,因为它与
d1 的术语“罗密欧”和“朱丽叶”分别在 d2 和 d3 中。此外,术语 die 与 d1 和 d5 相关
因为它在 d3 和 d4 中与 d1 的术语 Romeo 和 d5 的术语 New-Hampshire 一起出现,
分别。 LSI 还会对发现的连接进行适当权衡; d1 比 d5 更多地与查询相关

,因为 d1 通过“罗密欧与朱丽叶”“双重”连接到 dagger,并且还连接到
die 通过 Romeo,而 d5 只有一个通过 New-Hampshire 与查询的连接。

参考:潜在语义分析(Alex Thomo)

Suppose we have the following set of five documents

  • d1 : Romeo and Juliet.
  • d2 : Juliet: O happy dagger!
  • d3 : Romeo died by dagger.
  • d4 : “Live free or die”, that’s the New-Hampshire’s motto.
  • d5 : Did you know, New-Hampshire is in New-England.

and a search query: dies, dagger.

Clearly, d3 should be ranked top of the list since it contains both dies, dagger. Then, d2 and d4
should follow, each containing a word of the query. However, what about d1 and d5? Should they be
returned as possibly interesting results to this query? As humans we know that d1 is quite related
to the query. On the other hand, d5 is not so much related to the query. Thus, we would like d1 but
not d5, or differently said, we want d1 to be ranked higher than d5.

The question is: Can the machine deduce this? The answer is yes, LSI does exactly that. In this
example, LSI will be able to see that term dagger is related to d1 because it occurs together with
the d1’s terms Romeo and Juliet, in d2 and d3, respectively. Also, term dies is related to d1 and d5
because it occurs together with the d1’s term Romeo and d5’s term New-Hampshire in d3 and d4,
respectively. LSI will also weigh properly the discovered connections; d1 more is related to the query

than d5 since d1 is “doubly” connected to dagger through Romeo and Juliet, and also connected to
die through Romeo, whereas d5 has only a single connection to the query through New-Hampshire.

Reference: Latent Semantic Analysis (Alex Thomo)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文