Lucene.net 中带有特殊字符的精确短语

发布于 2024-11-29 00:27:38 字数 1050 浏览 0 评论 0原文

我在 lucene.net 中进行全文搜索时遇到问题,其中搜索结果包含特殊的 lucene 字符。

我的 Lucene 文档中有一个名为“content”的字段。该字段创建如下,包含索引文档的内容:

document.Add(new Field("content", fulltext, Field.Store.YES, Field.Index.ANALYZED));

为了创建索引,我使用的是 Standardanalyzer。

为了查询索引,我使用以下代码:

var queryParser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "content", analayzer);
queryParser.SetAllowLeadingWildcard(true);
queryParser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
Query fullTextQuery = queryParser.Parse(queryString);

然后将查询添加到 BooleanQuery 中,该 BooleanQuery 用于从 IndexSearcher 获取结果。我认为代码的其余部分并不那么重要,因为代码对于 99% 的查询都适用。我还使用 StandardAnalyzer 来查询索引。

现在问题来了。 有时,文档的“内容”字段包含使用“-”分隔的文本

一些文本一些文本选择器杆一些文本一些文本

现在,当我使用“选择杆”进行全文搜索(精确短语)时。查询如下所示:

内容:“选档杆”

这里的问题是,也找到了包含上述文本的文档,尽管不应该找到它,因为这两个单词是使用“-”分隔的,而不是空白。

我认为这与分析器以及“-”是lucene中的特殊字符有关。

也许有人可以帮助我解决这个问题。

提前致谢 马丁

i've a problem doing a full text search in lucene.net where the search result contains special lucene characters.

I've a field named "content" in my Lucene documents. This field is created as followed and contains the content of the indexed documents:

document.Add(new Field("content", fulltext, Field.Store.YES, Field.Index.ANALYZED));

For creating the index i'm using the Standardanalyzer.

For querying the index i'm using the following code:

var queryParser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "content", analayzer);
queryParser.SetAllowLeadingWildcard(true);
queryParser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
Query fullTextQuery = queryParser.Parse(queryString);

The query is then added to a BooleanQuery which is used to get the results from a IndexSearcher. I think the rest of the code is not that important, because the code works like it should for 99% of the queries. I'm also using the StandardAnalyzer for querying the index.

Now here is the problem.
Sometimes the "content" field of a document contains text that is separated using "-"

some text some text selector-lever some text some text

Now when i'm doing a full text search (exact phrase) using "selector lever". The query looks like this:

content:"selector lever"

The problem here is that also the document containing the above text is found, although it shouldn't be found because the 2 words are separated using the "-" and not blank.

I think it has something to do with the analyzer and the fact that "-" is a special character in lucene.

Maybe someone can help me solving this problem.

thanks in advance
Martin

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

祁梦 2024-12-06 00:27:38

您认为问题在于您在索引时使用的分析器是正确的。

来自 Lucene javadocs

使用 JFlex 构建的基于语法的分词器

对于大多数欧洲语言文档来说,这应该是一个很好的标记器:

  • 在标点符号处拆分单词,删除标点符号。但是,后面没有空格的点被视为标记的一部分。
  • 在连字符处拆分单词,除非令牌中有数字,在这种情况下,整个令牌将被解释为产品编号并且不会拆分
  • 将电子邮件地址和互联网主机名识别为一个令牌。

许多应用程序都有特定的标记器需求。如果此分词器不适合您的应用程序,请考虑将此源代码目录复制到您的项目并维护您自己的基于语法的分词器。

因此,在您的情况下,您需要使用更严格的分析器(例如仅按空格分割的 WhitespaceAnalyzer)来索引文档。

You are right in thinking that the problem is the analyzer that you are using at index time.

From the Lucene javadocs:

A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

  • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
  • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

Therefore, in your case you would need to index your documents with a more strict Analyzer like the WhitespaceAnalyzer which only splits on whitespace.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文