让 lucene 仅返回唯一的线程(对线程和帖子进行索引)
我有一个类似 StackOverflow 的系统,其中内容被组织成线程,每个线程都有自己的内容(问题正文/文本)和帖子/回复。
我正在提供通过 Lucene 搜索此内容的能力,如果可能的话,我决定对各个帖子建立索引(这使得索引更容易更新,并且意味着我有更多的控制权和调整结果的能力),而不是索引整个线程。然而,我遇到的问题是我希望搜索显示线程列表,而不是帖子列表。
如何让 Lucene 仅返回唯一的线程作为结果,同时搜索帖子的内容?
I have a StackOverflow-like system where content is organised into threads, each thread having content of its own (the question body / text), and posts / replies.
I'm producing the ability to search this content via Lucene, and if possible I have decided I would like to index individual posts, (it makes the index easier to update, and means I have more control and ability to tweak the results), rather than index entire threads. The problem I have however is that I want the search to display a list of threads, rather than a list of posts.
How can I get Lucene to return only unique threads as results, while also searching the content of the posts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
每个文档都可以有一个“threadId”字段。运行搜索后,您可以循环遍历结果集并返回所有唯一的 threadId。
棘手的部分是指定要返回多少结果。如果你想在你的结果页面上显示 10 个结果,你可能需要 Lucene 返回 10 + m 个结果,因为返回集的一定百分比将被重复数据删除,因为它们是属于同一个帖子线。您需要合并一些额外的逻辑,以便在重复数据删除集 <= 的情况下运行另一个 Lucene 搜索。 10.
这就是 Nutch 项目在折叠属于同一域的多个搜索结果时所做的事情。
Each document can have a "threadId" field. After running a search, you can loop through your result set and return all the unique threadId's.
The tricky part is specifying how many results you want to return. If you want to show say, 10 results on your results page, you'll probably need Lucene to return 10 + m results, since a certain percentage of the return set will be de-duped out, because they are posts belonging to the same thread. You'll need to incorporate some extra logic that will run another Lucene search if the deduped set is < 10.
This is what the Nutch project does when collapsing multiple search results that belong to the same domain.
当您索引线程时,您应该将每个线程分解为帖子,并使每个帖子成为
文档
,其字段包含标识其所属线程的唯一 ID。当您执行搜索实现时,我建议使用 lucene 2.9 或更高版本,它使您能够使用
收集器
。收集器可让您预处理检索到的文档,从而您可以将源自同一线程 ID 的帖子分组在一起。When you index the threads, you should break each thread into postings and make each post a
Document
with a field containing a unique id identifying the thread to which it belongs.When you do the search implementation, I would recommend using lucene 2.9 or later, which enables you to use a
Collector
. Collectors lets you preprocess the retrieved documents and thereby you'll be able to group together posts that originate from the same thread-id.为了完整起见,最新的 Lucene 版本(从 3.2 开始)支持分组 API,这对于此类用例非常有用:
http://lucene.apache.org/java/3_2_0/api/contrib-grouping/org/apache/lucene /search/grouping/package-summary.html
Just for completenes, latest Lucene versions (from 3.2 onwards) support a grouping API that is very useful for this kind of use-cases:
http://lucene.apache.org/java/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html