Sunspot / Solr / Lucene :查找类似文章
假设我们有一个由 sunspot/solr/lucene(或任何其他搜索引擎)索引的文章列表。
如何用于查找与给定文章相似的文章?
是否应该使用恢复工具来完成此操作,例如: http://www.wordsfinder.com/api_Keyword_Extractor.php,或来自 http://developer.yahoo.com/yql/console,或http://www.alchemyapi.com/api/demo.html ?
Let's say we have a list of articles that are indexed by sunspot/solr/lucene (or any other search engine).
How can be used to find similar articles with a given article?
Should this be done with a resuming tool, like:
http://www.wordsfinder.com/api_Keyword_Extractor.php, or termextract from http://developer.yahoo.com/yql/console, or http://www.alchemyapi.com/api/demo.html ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您似乎正在寻找 MoreLikeThis 功能。
It seems you're looking for the MoreLikeThis feature.
您想要做的事情与我在 这个答案。
简而言之,您需要为每个文档生成一个摘要,您可以将其用作查询以将其与其他文档进行比较。文档摘要可以像该文档中的前 N 个术语一样简单(不包括停用词)。您可以轻松地从 Lucene 文档生成前 N 个术语,而无需使用任何第 3 方工具,SO< 上有很多示例/a> 和 web 来执行此操作。
What you are trying to do is very similar to the task I outlined in this answer.
In brief, you need to generate a summary for each document that you can use as the query to compare it with every other. A document summary could be as simple as the top N terms in that document (excluding stop words). You can generate top N terms from a Lucene document pretty easily without using any 3rd party tools, there are plenty examples on SO and the web to do this.