如何执行“包含”操作搜索而不是“开始于”使用 Lucene.Net

发布于 2024-10-29 05:22:43 字数 839 浏览 4 评论 0原文

我们使用 Lucene.NET 在客户网站上实现全文搜索。搜索本身已经可以运行,但我们现在想要进行修改。

目前,所有术语都附加一个 *,这会导致 Lucene 执行我将其分类为 StartsWith 搜索。

将来,我们希望有一个执行类似于 Contains 而不是 StartsWith 的搜索。

我们使用

  • Lucene.Net 2.9.2.2
  • StandardAnalyzer
  • 默认 QueryParser

示例:

(Title:Orch*) 匹配:Orchestra

但:

(Title:rch*) 不匹配:Orchestra

我们希望第一个和第二个都匹配 Orchestra

基本上我想要的与这个问题中所问的完全相反,我不确定为什么 Lucene 对于这个人默认执行 Contains 而不是 StartsWith :< br> 为什么这个 Lucene 查询是“contains” “startsWith”?

我们怎样才能做到这一点?
我感觉这与分析仪有关,但我不确定。

We use Lucene.NET to implement a full text search on a clients website. The search itself works already but we now want to implement a modification.

Currently all terms get appended a * which leads Lucene to perform what I would classify as a StartsWith search.

In the future we would like to have a search that performs something like a Contains rather than a StartsWith.

We use

  • Lucene.Net 2.9.2.2
  • StandardAnalyzer
  • default QueryParser

Samples:

(Title:Orch*) matches: Orchestra

but:

(Title:rch*) does not match: Orchestra

We want the first and the second one to both match Orchestra.

Basically I want the exact opposite of what was asked in this question, I'm not sure why for this person Lucene performed a Contains and rather than a StartsWith by default:
Why is this Lucene query a "contains" instead of a "startsWith"?

How can we make this happen?
I have the feeling it has something to do with the Analyzer but I'm not sure.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

枫林﹌晚霞¤ 2024-11-05 05:22:43

首先,我假设您正在使用 StandardAnalyzer 或类似的东西。您链接的问题无法理解您搜索的术语,并且他的情况 a* 将匹配“Fleet Africa”,因为它被标记为“fleet”和“africa”。

您需要调用 QueryParser.SetAllowLeadingWildcard(true) 才能编写类似 field:*value* 的查询。您实际上是否更改了传递给 QueryParser 的字符串?

您可以照常解析查询,然后实现一个 QueryVisitor ,将所有 TermQuery 重写为 <代码>通配符查询。这样您仍然支持短语搜索。

我认为将查询重写为前缀查询或通配符查询没有什么好处。兽人、箱子和管弦乐队之间几乎没有什么共同之处,但这两个词是匹配的。相反,为您的客户提供一个支持词干、同义词的分析器,并提供拼写更正功能来修复简单的搜索错误。

First off, I assume you're using StandardAnalyzer, or something similar. Your linked question fail to understand that you search for terms, and his case a* will match "Fleet Africa" because it's tokenized into "fleet" and "africa".

You need to call QueryParser.SetAllowLeadingWildcard(true) to be able to write queries like field:*value*. Are you actually changing the string that's passed to QueryParser?

You could parse the query as usual, and then implement a QueryVisitor that rewrites all TermQuery into WildcardQuery. That way you still support phrase searches.

I see no good things in rewriting queries into prefix- or wildcard-queries. There is very little shared between an orc, or a chest, and an Orchestra, but both words will match. Instead, hook up your customer with an analyzer that supports stemming, synonyms, and provide a spell correction feature to fix simple searching mistakes.

自由如风 2024-11-05 05:22:43

@Simon Svensson 可能给出了更好的答案(即你不需要这个),但如果你这样做,你应该使用 叠瓦过滤器

请注意,这将使您的索引变得更大,因为您将存储“orc”,“rch”,“che”,“hes”,而不是仅存储“orchestra”......但只是有一个带有前导通配符的普通术语查询会非常慢。它本质上必须检查语料库中的每个术语。

@Simon Svensson probably gave the better answer (i.e. you don't need this), but if you do, you should use a Shingle Filter.

Note that this will make your index massively larger, since instead of just storing "orchestra", you will store "orc", "rch", "che", "hes"... But just having a plain term query with leading wildcards will be massively slow. It will essentially have to look through every single term in your corpus.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文