概率潜在语义分析/索引 - 简介
但最近我发现这个链接对于理解 LSA 的原理非常有帮助,不需要太多的数学知识。 http://www.puffinwarellc.com/ index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html。它为我进一步发展奠定了良好的基础。
目前,我正在寻找对概率潜在语义分析/索引的类似介绍。更少的数学和更多的例子来解释其背后的原理。如果您知道这样的介绍,请告诉我。
它可以用来衡量句子之间的相似度吗?它能处理一词多义吗?
有同样的Python实现吗?
谢谢。
But recently I found this link quite helpful to understand the principles of LSA without too much math. http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html. It forms a good basis on which I can build further.
currently, I'm looking out for a similar introduction to Probabilistic Latent Semantic Analysis/Indexing. Less of math and more of examples explaining the principles behind it. If you would know such an introduction, please let me know.
Can it be used to find the measure of similarity between sentences? Does it handle polysemy?
Is there a python implementation for the same?
Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Thomas Hofmann 有一篇很好的 演讲,解释了 LSA 及其与概率潜在语义分析 (PLSA) 的联系。该演讲涉及一些数学知识,但比 PLSA 论文(甚至其维基百科页面)更容易理解。
PLSA 可用于获得句子之间的一些相似性度量,因为两个句子可以被视为从潜在类的概率分布中提取的短文档。不过,您的相似度在很大程度上取决于您的训练集。用于训练潜在类模型的文档应反映您要比较的文档类型。使用两个句子生成 PLSA 模型不会创建有意义的潜在类。同样,使用非常相似的上下文的语料库进行训练可能会创建对文档的细微变化过于敏感的潜在类。此外,由于句子包含相对较少的标记(与文档相比),我不相信您会在句子级别从 PLSA 获得高质量的相似性结果。
PLSA 不处理一词多义。但是,如果您担心一词多义,您可以尝试在输入文本上运行词义消歧工具,以用正确的含义标记每个单词。在此标记语料库上运行 PLSA(或 LDA)将消除生成的文档表示中的一词多义的影响。
正如 Sharmila 指出的那样,潜在狄利克雷分配 (LDA) 被认为是文档比较的最先进技术,并且优于 PLSA,后者往往会过度拟合训练数据。此外,还有更多的工具来支持LDA并分析你用LDA得到的结果是否有意义。 (如果您喜欢冒险,可以阅读 David Mimno 在 EMNLP 2011 上发表的两篇论文 关于如何评估从 LDA 获得的潜在主题的质量。)
There is a good talk by Thomas Hofmann that explains both LSA and its connections to Probabilistic Latent Semantic Analysis (PLSA). The talk has some math, but is much easier to follow than the PLSA paper (or even its Wikipedia page).
PLSA can be used to get some similarity measure between sentences, as two sentences can be viewed as short documents drawn from a probability distribution over latent classes. Your similarity will heavily depend on your training set though. The documents you use to training the latent class model should reflect the types of documents you want to compare. Generating a PLSA model with two sentences won't create meaningful latent classes. Similarly, training with a corpus of very similar contexts may create latent classes that are overly sensitive to slight changes on the documents. Moreover, because sentences contain relative few tokens (as compared to documents), I don't believe you'll get high quality similarity results from PLSA at the sentence level.
PLSA does not handle polysemy. However, if you are concerned with polysemy, you might try running a Word Sense Disambiguation tool over your input text to tag each word with its correct sense. Running PLSA (or LDA) over this tagged corpus will remove the effects of polysemy in the resulting document representations.
As Sharmila noted, Latent Dirichlet allocation (LDA) is considered the state of the art for document comparison, and is superior to PLSA, which tends to overfit the training data. In addition, there are many more tools to support LDA and analyze whether the results you get with LDA are meaningful. (If you're feeling adventurous, you can read David Mimno's two papers from EMNLP 2011 on how to assess the quality of the latent topics you get from LDA.)