如何检测Lucene索引中是否已经存储了相似的文档

发布于 2024-08-20 10:32:24 字数 1228 浏览 3 评论 0原文

我需要排除数据库中的重复项。问题是重复的文档不被视为完全匹配,而是被视为相似的文档。为此,我决定使用 FuzzyQuery,如下所示:

var fuzzyQuery = new global::Lucene.Net.Search.FuzzyQuery(
                     new Term("text", queryText),
                     0.8f,
                     0);
 hits = _searcher.Search(query);

这个想法是将最小相似度设置为 0.8(我认为足够高),这样只会找到相似的文档,排除那些不够相似的文档。

为了测试这段代码,我决定看看它是否找到已经存在的文档。为变量 queryText 分配了一个存储在索引中的值。上面的代码什么也没发现,换句话说,它甚至没有检测到完全匹配。

索引是通过以下代码构建的:

 doc.Add(new global::Lucene.Net.Documents.Field(
            "text",
            text,
            global::Lucene.Net.Documents.Field.Store.YES,
            global::Lucene.Net.Documents.Field.Index.TOKENIZED,
            global::Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));

我遵循了下面的建议,结果是: TermQuery 不返回任何结果。 构造的查询

 var _analyzer = new RussianAnalyzer();
 var parser = new global::Lucene.Net.QueryParsers
                .QueryParser("text", _analyzer);
 var query = parser.Parse(queryText);
 var _searcher = new IndexSearcher
       (Settings.General.Default.LuceneIndexDirectoryPath);
 var hits = _searcher.Search(query);

返回具有完全匹配的文档和其他具有相似内容的文档的最高分数的多个结果。

I need to exclude duplicates in my database. The problem is that duplicates are not considered exact match but rather similar documents. For this purpose I decided to use FuzzyQuery like follows:

var fuzzyQuery = new global::Lucene.Net.Search.FuzzyQuery(
                     new Term("text", queryText),
                     0.8f,
                     0);
 hits = _searcher.Search(query);

The idea was to set the minimal similarity to 0.8 (that I think is high enough) so only similar documents will be found excluding those that are not sufficiently similar.

To test this code I decided to see if it finds already existing document. To the variable queryText was assigned a value that is stored in the index. The code from above found nothing, in other words it doesn't detect even exact match.

Index was build by this code:

 doc.Add(new global::Lucene.Net.Documents.Field(
            "text",
            text,
            global::Lucene.Net.Documents.Field.Store.YES,
            global::Lucene.Net.Documents.Field.Index.TOKENIZED,
            global::Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));

I followed recomendations from bellow and the results are:
TermQuery doesn't return any result.
Query contructed with

 var _analyzer = new RussianAnalyzer();
 var parser = new global::Lucene.Net.QueryParsers
                .QueryParser("text", _analyzer);
 var query = parser.Parse(queryText);
 var _searcher = new IndexSearcher
       (Settings.General.Default.LuceneIndexDirectoryPath);
 var hits = _searcher.Search(query);

Returns several results with the maximum score the document that has exact match and other several documents that have similar content.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

ㄖ落Θ余辉 2024-08-27 10:32:24

查看索引内部可能会有所帮助 - 将清楚地显示您正在查询的数据以及 Lucene 如何“查看”您的数据。您可以使用 Luke 来实现此目的。它与 Lucent.NET 存在一些已知的兼容性问题,但总比没有好得多反正。

It might help to look inside the index - will clearly show what data you're querying against and how Lucene 'sees' you data. You can use Luke for this. It has some known compatibility issues with Lucent.NET but is much better than nothing anyway.

熊抱啵儿 2024-08-27 10:32:24

我赞同对卢克的建议。其他一些要尝试的事情:

  1. 首先尝试精确查询,例如针对术语“文本”的 TermQuery。如果这不起作用,则模糊查询将不起作用。
  2. 使用 Explain() 查看得分情况(前提是您获得其他命中)。
  3. 请遵循调试相关性问题中的建议搜索

I second the recommendation for Luke. A few other things to try:

  1. Try first an exact query, say a TermQuery for the term "text". If this doesn't work, no fuzzy query will.
  2. Use Explain() to see how the scoring went (that is provided you get other hits).
  3. Follow the suggestions from Debugging Relevance Issues in Search.
没有伤那来痛 2024-08-27 10:32:24

尝试 MoreLikeThis 类在 Lucene 中...它有一些很棒的启发式编码,可以帮助您识别“相似”文档。

Try the MoreLikeThis class in Lucene...it has some great heuristics encoded that would help you identify "similar" documents.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文