如何检测Lucene索引中是否已经存储了相似的文档
我需要排除数据库中的重复项。问题是重复的文档不被视为完全匹配,而是被视为相似的文档。为此,我决定使用 FuzzyQuery,如下所示:
var fuzzyQuery = new global::Lucene.Net.Search.FuzzyQuery(
new Term("text", queryText),
0.8f,
0);
hits = _searcher.Search(query);
这个想法是将最小相似度设置为 0.8(我认为足够高),这样只会找到相似的文档,排除那些不够相似的文档。
为了测试这段代码,我决定看看它是否找到已经存在的文档。为变量 queryText
分配了一个存储在索引中的值。上面的代码什么也没发现,换句话说,它甚至没有检测到完全匹配。
索引是通过以下代码构建的:
doc.Add(new global::Lucene.Net.Documents.Field(
"text",
text,
global::Lucene.Net.Documents.Field.Store.YES,
global::Lucene.Net.Documents.Field.Index.TOKENIZED,
global::Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));
我遵循了下面的建议,结果是: TermQuery 不返回任何结果。 构造的查询
var _analyzer = new RussianAnalyzer();
var parser = new global::Lucene.Net.QueryParsers
.QueryParser("text", _analyzer);
var query = parser.Parse(queryText);
var _searcher = new IndexSearcher
(Settings.General.Default.LuceneIndexDirectoryPath);
var hits = _searcher.Search(query);
返回具有完全匹配的文档和其他具有相似内容的文档的最高分数的多个结果。
I need to exclude duplicates in my database. The problem is that duplicates are not considered exact match but rather similar documents. For this purpose I decided to use FuzzyQuery
like follows:
var fuzzyQuery = new global::Lucene.Net.Search.FuzzyQuery(
new Term("text", queryText),
0.8f,
0);
hits = _searcher.Search(query);
The idea was to set the minimal similarity to 0.8 (that I think is high enough) so only similar documents will be found excluding those that are not sufficiently similar.
To test this code I decided to see if it finds already existing document. To the variable queryText
was assigned a value that is stored in the index. The code from above found nothing, in other words it doesn't detect even exact match.
Index was build by this code:
doc.Add(new global::Lucene.Net.Documents.Field(
"text",
text,
global::Lucene.Net.Documents.Field.Store.YES,
global::Lucene.Net.Documents.Field.Index.TOKENIZED,
global::Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));
I followed recomendations from bellow and the results are:
TermQuery doesn't return any result.
Query contructed with
var _analyzer = new RussianAnalyzer();
var parser = new global::Lucene.Net.QueryParsers
.QueryParser("text", _analyzer);
var query = parser.Parse(queryText);
var _searcher = new IndexSearcher
(Settings.General.Default.LuceneIndexDirectoryPath);
var hits = _searcher.Search(query);
Returns several results with the maximum score the document that has exact match and other several documents that have similar content.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
查看索引内部可能会有所帮助 - 将清楚地显示您正在查询的数据以及 Lucene 如何“查看”您的数据。您可以使用 Luke 来实现此目的。它与 Lucent.NET 存在一些已知的兼容性问题,但总比没有好得多反正。
It might help to look inside the index - will clearly show what data you're querying against and how Lucene 'sees' you data. You can use Luke for this. It has some known compatibility issues with Lucent.NET but is much better than nothing anyway.
我赞同对卢克的建议。其他一些要尝试的事情:
I second the recommendation for Luke. A few other things to try:
尝试 MoreLikeThis 类在 Lucene 中...它有一些很棒的启发式编码,可以帮助您识别“相似”文档。
Try the MoreLikeThis class in Lucene...it has some great heuristics encoded that would help you identify "similar" documents.