Lucene查询:bla~*(匹配以模糊开头的单词),怎么样?
在 Lucene 查询语法中,我想将 * 和 ~ 组合在一个有效的查询中,类似于: bla~* //无效查询
含义:请匹配以“bla”或类似“bla”开头的单词。
更新: 我现在所做的适用于小输入,是使用以下内容(SOLR 模式的片段):
<fieldtype name="text_ngrams" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
如果您不使用 SOLR,则会执行以下操作。
索引时间:通过创建包含我的(短)输入的所有前缀的字段来索引数据。
搜索时间:仅使用 ~ 运算符,因为索引中明确存在前缀。
In the Lucene query syntax I'd like to combine * and ~ in a valid query similar to:
bla~* //invalid query
Meaning: Please match words that begin with "bla" or something similar to "bla".
Update:
What I do now, works for small input, is use the following (snippet of SOLR schema):
<fieldtype name="text_ngrams" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
In case you don't use SOLR, this does the following.
Indextime: Index data by creating a field containing all prefixes of my (short) input.
Searchtime: only use the ~ operator, as prefixes are explicitly present in the index.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我不相信 Lucene 支持这样的事情,也不相信它有一个简单的解决方案。
“模糊”搜索不适用于固定数量的字符。例如,
bla~
可能匹配blah
,因此它必须考虑整个术语。您可以做的是实现一个查询扩展算法,该算法接受查询 bla~* 并将其转换为一系列 OR 查询
但这实际上只有在字符串非常短或者可以缩小范围时才可行基于一定规则的扩展。
或者,如果前缀的长度是固定的,您可以添加一个带有子字符串的字段并对其执行模糊搜索。这会给你你想要的东西,但只有当你的用例足够窄时才会起作用。
您没有具体说明为什么需要这个,也许这样做会引出其他解决方案。
我能想到的一种情况是处理不同形式的单词。例如寻找
car
和cars
。这在英语中很容易,因为有词干分析器可用。在其他语言中,实现词干分析器即使不是不可能,也可能非常困难。
但是,在这种情况下,您可以(假设您可以访问一本好的词典)查找搜索词并以编程方式扩展搜索以搜索该单词的所有形式。
例如,搜索
cars
会被翻译为car OR cars
。这已在至少一个搜索引擎中成功应用于我的语言,但实施起来显然并不简单。I do not believe Lucene supports anything like this, nor do I believe it has a trivial solution.
"Fuzzy" searches do not operate on a fixed number of characters.
bla~
may for example matchblah
and so it must consider the entire term.What you could do is implement a query expansion algorithm that took the query
bla~*
and converted it into a series of OR queriesBut that is really only viable if the string is very short or if you can narrow the expansion based on some rules.
Alternatively if the length of the prefix is fixed you could add a field with the substrings and perform the fuzzy search on that. That would give you what you want, but will only work if your use case is sufficiently narrow.
You don't specify exactly why you need this, perhaps doing so will elicit other solutions.
One scenario I can think of is dealing with different form of words. E.g. finding
car
andcars
.This is easy in English as there are word stemmers available. In other languages it can be quite difficult to implement word stemmers, if not impossible.
In this scenario you can however (assuming you have access to a good dictionary) look up the search term and expand the search programmatically to search for all forms of the word.
E.g. a search for
cars
is translated intocar OR cars
. This has been applied successfully for my language in at least one search engine, but is obviously non-trivial to implement.它用于地址搜索服务,我想根据部分输入和可能输入错误的街道名称/城市名称/等(任何组合)来建议地址。 (想想 ajax,用户在文本字段中输入部分街道地址)
对于这种情况,建议的查询扩展可能不太可行,因为部分字符串(街道地址)可能会变得比“短”长:)
标准化
一种可能性是使用字符串“规范化”,而不是模糊搜索,并简单地将其与通配符查询结合起来。街道地址
“miklabraut 42, 101 reykjavík”
在规范化后将变为“miklabrat 42 101 rekavik”
。因此,像这样构建索引:
1)使用包含街道名称、城市名称等“规范化”版本的记录构建索引,每个文档一个街道地址(1 个或多个字段)。
并像这样搜索索引:
2)规范化用于形成查询(即
mik rek
)的输入字符串(例如mikl reyk
)。3) 使用通配符操作执行搜索(即
mik* AND rek*
),忽略模糊部分。如果归一化算法足够好,那就可以了:)
It's for an address search service, where I want to suggest addresses based on partially typed and possibly mistyped streetnames/citynames/etc (any combination). (think ajax, users typing partial street addresses in a text field)
For this case the suggested query expansion is perhaps not so feasible, as the partial string (street address) may become longer than "short" :)
Normalization
One possibility I can think of is to use string "normalization", instead of fuzzy searches, and simply combine that with wildcard queries. A street address of
"miklabraut 42, 101 reykjavík"
, would become"miklabrat 42 101 rekavik"
when normalized.So, building index like this:
1) build the index with records containing "normalized" versions of street names, city names etc, with one street address per document (1 or several fields).
And search the index like this:
2) Normalize inputstrings (e.g.
mikl reyk
) used to form the queries (i.e.mik rek
).3) use the wildcard op to perform the search (i.e.
mik* AND rek*
), leaving the fuzzy part out.That would fly, provided the normalization algorithm is good enough :)
您的意思是您希望将通配符和模糊查询结合起来?您可以使用带有 OR 条件的布尔查询进行组合,例如:
You mean you wish to combine a wildcard and fuzzy query? You could use a boolean query with an OR condition to combine, for example:
在 lucene 的开发主干(尚未发布)中,有代码通过 AutomatonQuery 支持这样的用例。警告:API 在发布之前可能会发生变化,但它给了您一个想法。
这是您的案例的示例:
in the development trunk of lucene (not yet a release), there is code to support use cases like this, via AutomatonQuery. Warning: the APIs might/will change before its released, but it gives you the idea.
Here is an example for your case: