使用 python nltk 查找两个网页之间的相似性?
我想知道两个网页是否相似。有人可以建议带有 wordnet 相似性函数的 python nltk 是否有帮助以及如何帮助?在这种情况下使用的最佳相似度函数是什么?
I want to find whether two web pages are similar or not. Can someone suggest if python nltk with wordnet similarity functions helpful and how? What is the best similarity function to be used in this case?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
joyceschan 提到的 spotsigs 论文解决了内容重复检测问题,它包含有很多值得深思的地方。
如果您正在寻找关键术语的快速比较,
nltk
标准函数可能就足够了。使用
nltk
,您可以通过查找WordNet包含的同义词集来提取术语的同义词。它可以理解复数形式,还可以告诉您哪个词性对应于 Synsets 的同义词
存储在树中,更具体的术语位于叶子,更通用的术语位于根。根术语称为 hypernyms
您可以通过术语与常见 hypernym 的接近程度来衡量相似性
根据 NLTK 手册,请注意不同的词性它们没有重叠的路径,因此您不应该尝试测量它们之间的相似性。
比如说,您有两个术语捐赠和礼物,您可以从
synsets
获取它们,但在本例中我直接初始化它们:cookbook 推荐 Wu-Palmer 相似性方法
这种方法为您提供了一种快速方法来确定所使用的术语是否对应于相关概念。请查看使用 Python 进行自然语言处理,看看您还可以做些什么来帮助您分析文本。
The spotsigs paper mentioned by joyceschan addresses content duplication detection and it contains plenty of food for thought.
If you are looking for a quick comparison of key terms,
nltk
standard functions might suffice.With
nltk
you can pull synonyms of your terms by looking up the synsets contained by WordNetIt understands plurals and it also tells you which part of speech the synonym corresponds to
Synsets are stored in a tree with more specific terms at the leaves and more general ones at the root. The root terms are called hypernyms
You can measure similarity by how close the terms are to the common hypernym
Watch out for different parts of speech, according to the NLTK cookbook they don't have overlapping paths, so you shouldn't try to measure similarity between them.
Say, you have two terms donation and gift, you can get them from
synsets
but in this example I initialized them directly:The cookbook recommends Wu-Palmer Similarity method
This approach gives you a quick way to determine if the terms used correspond to related concepts. Take a look at Natural Language Processing with Python to see what else you can do to help your analysis of text.
考虑实施 Spotsigs
consider implementing Spotsigs