Java中Lucene 3.0能否获得按频繁更新字段排序的实时搜索结果
考虑以下假设:
- 我有一个 Java 5.0 Web 应用程序,我正在考虑使用 Lucene 3.0 进行全文搜索
- 将有超过 1000K 个 Lucene 文档,每个文档有 100 个单词(平均)
- 新文档在创建后必须是可搜索的创建(实时搜索)
- Lucene 文档具有频繁更新名为quality 的整数字段
在哪里可以找到Lucene 3.0 近实时搜索的代码示例(简单但尽可能完整)?
是否可以获得按可能经常更新的文档字段(质量)之一排序的查询结果(对于已索引的文档)?这样的文档字段更新会不会触发Lucene索引重建?这样的重建表现如何?如何有效地完成它 - 我需要一些完整解决方案的示例/文档。
然而,如果在这种情况下不一定需要重建索引 - 如何有效地对搜索结果进行排序?可能有查询返回大量文档(> 50K),因此我认为从 Lucene 获取未排序的文档,然后按质量字段对它们进行排序,最后将排序列表划分为页面进行分页是低效的。
Lucene 3.0 是我在 Java 中的最佳选择还是应该考虑其他框架/解决方案?也许 SQL Server 本身提供全文搜索(我使用的是 PostgreSQL 8.3)?
Consider following assumptions:
- I have Java 5.0 Web Application for which I'm considering to use Lucene 3.0 for full-text searching
- There will be more than 1000K Lucene documents, each with 100 words (average)
- New documents must be searchable just after they are created (real time search)
- Lucene documents have frequently updating integer field named quality
Where to find code examples (simple but as complete as possible) of near real time search of Lucene 3.0?
Is it possible to obtain query results sorted by one of document fields (quality) which may be updated frequently (for already indexed document)? Such updating of document field will have to trigger Lucene index rebuilding? What is performance of such rebuilding? How to done it efficiently - I need some examples / documentation of complete solution.
If, however, index rebuilding is not necessarily needed in this case - how to sort search results efficiently? There may be queries returning lots of documents (>50K), so I consider it unefficient to obtain them unsorted from Lucene and then sort them by quality field and finally divide sorted list to pages for pagination.
Is Lucene 3.0 my best choice within Java or should I consider some other frameworks/solutions? Maybe full text search provided by SQL Server itself (I'm using PostgreSQL 8.3)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Lucene API 能够满足您所要求的一切,但这并不容易。它是一个相当低级的 API,让它完成复杂的事情本身就是一个很大的练习。
我强烈推荐 Compass,它是一个构建在 Lucene 之上的搜索/索引框架。除了更加友好的 API 之外,它还提供了对象/XML/JSON 到 Lucene 索引的映射等功能,以及完全的事务行为。它应该可以满足您的要求,例如事务更新文档的实时排序。
Compass 2.2.0 是基于 Lucene 2.4.1 构建的,但基于 Lucene 3.0 的版本正在开发中。不过,它已从 Lucene API 中充分抽象出来,因此转换应该是无缝的。
The Lucene API is capable of everything you're asking, but it won't be easy. It's a fairly low-level API, and making it do complicated things is quite an exercise in itself.
I can highly recommend Compass, which is a search/indexing framework built on top of Lucene. As well as a much friendlier API, it provides functionality such as object/XML/JSON mapping to Lucene indexes, as well as fully transactional behaviour. It should have no trouble with your requirements, such as realtime sorting of transactionally-updated documents.
Compass 2.2.0 is built upon Lucene 2.4.1, but a Lucene 3.0-based version is in the works. It's sufficiently abstracted from the Lucene API that the transition should be seamless, though.
Lucene 中提供近实时搜索自2.9。 Lucid Imagination 有一篇文章关于此功能(2.9 版本之前)。基本思想是您现在可以从 IndexWriter 获取 IndexReader。如果您定期刷新此 IndexReader,则可以从 IndexWriter 获得最新的最新更改。
更新:我还没有看到任何代码,但这是大致的想法。
所有的nw文档都会被写入一个
IndexWriter
,最好用RAMDirectory
创建,这样就不会频繁关闭。 (要保留此内存中索引,您可能需要偶尔将其刷新到磁盘。)您将在磁盘上有一些索引,在这些索引上将创建单独的 IndexReader。可以在这些 Reader 之上创建 MultiReader 和 Searcher。其中一个 Reader 将来自内存索引。
每隔一段时间(比如几秒钟),您将从 MultiReader 中删除当前的 Reader,从 IndexWriter 获取新的 Reader,并使用新的 Reader 集构造 MultiReader/Searcher。
根据 Lucid Imagination 的文章(上面链接),他们尝试每秒写入 50 个文档,并且没有严重减慢速度。
Near Real Time Search is available in Lucene since 2.9. Lucid Imagination has an article about this capability (before 2.9 release). The basic idea is you can now get an IndexReader from IndexWriter. If you refresh this IndexReader at regular interval, you get most up to the date changes from the IndexWriter.
Update: I haven't seen any code, but here is the broad idea.
All the nw document will be written to an
IndexWriter
, preferably created withRAMDirectory
, which will will not be closed frequently. (To persist this in-memory index, you may have to flush it to disk ocassionally.)You will have some indexes on the disk on which individual IndexReaders will be created. A MultiReader and a Searcher can be created on top of these Readers. One of the Reader will be from the in-memory index.
At regular interval (say a few seconds), you will remove current Reader from the MultiReader, get the new Reader from IndexWriter and construct the MultiReader/Searcher with new set of Readers.
According to the article from Lucid Imagination (linked above), they have tried writing 50 documents per second, without heavy slowdown.