如何检测Lucene索引中是否已经存储了相似的文档

发布于 2024-08-20 10:32:24 字数 1228 浏览 3 评论 0原文

我需要排除数据库中的重复项。问题是重复的文档不被视为完全匹配，而是被视为相似的文档。为此，我决定使用 FuzzyQuery，如下所示：

var fuzzyQuery = new global::Lucene.Net.Search.FuzzyQuery(
                     new Term("text", queryText),
                     0.8f,
                     0);
 hits = _searcher.Search(query);

这个想法是将最小相似度设置为 0.8（我认为足够高），这样只会找到相似的文档，排除那些不够相似的文档。

为了测试这段代码，我决定看看它是否找到已经存在的文档。为变量 queryText 分配了一个存储在索引中的值。上面的代码什么也没发现，换句话说，它甚至没有检测到完全匹配。

索引是通过以下代码构建的：

 doc.Add(new global::Lucene.Net.Documents.Field(
            "text",
            text,
            global::Lucene.Net.Documents.Field.Store.YES,
            global::Lucene.Net.Documents.Field.Index.TOKENIZED,
            global::Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));

我遵循了下面的建议，结果是： TermQuery 不返回任何结果。构造的查询

 var _analyzer = new RussianAnalyzer();
 var parser = new global::Lucene.Net.QueryParsers
                .QueryParser("text", _analyzer);
 var query = parser.Parse(queryText);
 var _searcher = new IndexSearcher
       (Settings.General.Default.LuceneIndexDirectoryPath);
 var hits = _searcher.Search(query);

返回具有完全匹配的文档和其他具有相似内容的文档的最高分数的多个结果。

原文

I need to exclude duplicates in my database. The problem is that duplicates are not considered exact match but rather similar documents. For this purpose I decided to use FuzzyQuery like follows:

var fuzzyQuery = new global::Lucene.Net.Search.FuzzyQuery(
                     new Term("text", queryText),
                     0.8f,
                     0);
 hits = _searcher.Search(query);

The idea was to set the minimal similarity to 0.8 (that I think is high enough) so only similar documents will be found excluding those that are not sufficiently similar.

To test this code I decided to see if it finds already existing document. To the variable queryText was assigned a value that is stored in the index. The code from above found nothing, in other words it doesn't detect even exact match.

Index was build by this code:

 doc.Add(new global::Lucene.Net.Documents.Field(
            "text",
            text,
            global::Lucene.Net.Documents.Field.Store.YES,
            global::Lucene.Net.Documents.Field.Index.TOKENIZED,
            global::Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));

I followed recomendations from bellow and the results are:
TermQuery doesn't return any result.
Query contructed with

 var _analyzer = new RussianAnalyzer();
 var parser = new global::Lucene.Net.QueryParsers
                .QueryParser("text", _analyzer);
 var query = parser.Parse(queryText);
 var _searcher = new IndexSearcher
       (Settings.General.Default.LuceneIndexDirectoryPath);
 var hits = _searcher.Search(query);

Returns several results with the maximum score the document that has exact match and other several documents that have similar content.

分享到QQ

分享到微博