在 Lucene 中,我可以搜索一个索引但使用另一个索引中的 IDF 吗?
我正在构建一个系统,我只想显示过去几天索引的结果。 此外,如果我只想返回几天的结果(数千个文档),我不想维护包含一百万个文档的巨型索引。
另一方面,我的系统严重依赖于存储在索引中的文档中术语的出现具有现实的分布(因此:现实的 IDF)。
也就是说,我想使用一个小索引来返回结果,但我想使用来自更大索引(甚至外部源)的 IDF 来计算文档分数。
相似性 API 似乎不允许我这样做。 idf 方法不接收所使用的术语作为参数。
另一种可能性是使用 TrieRangeQuery 来确保显示的文档是最近几天内的。再说一遍,我不想维护更大的索引。而且这种查询并不便宜。
I'm building a system where I want to show only results indexed in the past few days.
Furthermore, I don't want to maintain a giant index with a million documents if I only want to return results from a couple of days (thousands of documents).
On the other hand, my system heavily relies that the occurrences of terms in documents stored in the index have a realistic distribution (consequently: realistic IDF).
That said, I would like to use a small index to return results, but I want to compute documents score using a IDF from a much greater Index (or even an external source).
The Similarity API doesn't seem to allow me to do this. The idf method does not receive as parameter the term being used.
Another possibility is to use TrieRangeQuery to make sure the documents shown are within the last couple of days. Again, I rather not mantain a larger index. Also this kind of query is not cheap.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您应该能够扩展 IndexReader 并重写 docFreq() 方法以提供您想要的任何值。该实现可以做的一件事是打开两个 IndexReader 实例——一个用于小索引,一个用于大索引。除了 docFreq() 委托给大索引之外,所有方法都委托给小 IndexReader。您需要缩放返回的值,即
You should be able to extend IndexReader and override the docFreq() methods to provide whatever values you'd like. One thing this implementation can do is open two IndexReader instances -- one for the small index and one for the large index. All the methods are delegated to the small IndexReader, except for docFreq(), which is delegated to the large index. You'll need to scale the value returned, i.e.