Solr Dismax 处理程序 - 空格和特殊字符行为
当我的查询中有特殊字符时,我得到了奇怪的结果。
这是我的请求:
q=histoire-france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%
解析查询:
<str name="parsedquery_toString">+((any:histoir any:franc)) ()</str>
我得到了 17000 个结果,因为 Solr 正在执行 OR(应该是 AND)。
当我使用空格而不是特殊的 char 时,我没有问题:
q=histoire france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%
<str name="parsedquery_toString">+(((any:histoir) (any:franc))~2) ()</str>
此查询有 2000 个结果。
这是我的 schema.xml (相关部分):
<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!--<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
我什至尝试使用 PatternTokenizerFactory 对空格和空格进行标记。特殊字符但没有变化...
我当前的解决方法是在将查询发送到 Solr 之前用空格替换所有特殊字符,但这并不令人满意。
编辑:即使使用 charFilter (PatternReplaceCharFilterFactory) 用空格替换特殊字符,它也不起作用...
通过 solr admin 进行的第一行分析,带有详细输出,查询 = 'histoire-france ' :
org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement= , pattern=([,;./\\'&-]), luceneMatchVersion=LUCENE_32}
text histoire france
将 '-' 替换为 ' ',然后由 WhitespaceTokenizerFactory 进行标记。然而,我仍然有“histoire-france”和“histoire france”的不同数量的结果。
我错过了什么吗?
I've got strange results when I have special characters in my query.
Here is my request :
q=histoire-france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%
Parsed query :
<str name="parsedquery_toString">+((any:histoir any:franc)) ()</str>
I've got 17000 results because Solr is doing an OR (should be AND).
I have no problem when I'm using a whitespace instead of a special char :
q=histoire france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%
<str name="parsedquery_toString">+(((any:histoir) (any:franc))~2) ()</str>
2000 results for this query.
Here is my schema.xml (relevant parts) :
<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!--<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
I even tried with a PatternTokenizerFactory to tokenize on whitespaces & special chars but no change...
My current workaround is to replace all special chars by whitespaces before sending query to Solr, but it is not satisfying.
EDIT : Even with a charFilter (PatternReplaceCharFilterFactory) to replace special characters by whitespace, it doesn't work...
First line of analysis via solr admin, with verbose output, for query = 'histoire-france' :
org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement= , pattern=([,;./\\'&-]), luceneMatchVersion=LUCENE_32}
text histoire france
The '-' is replaced by ' ', then tokenized by WhitespaceTokenizerFactory. However I still have different number of results for 'histoire-france' and 'histoire france'.
Did i miss something ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
搜索“histoire-france”和“histoire france”会得到不同数量的结果,因为查询解析器在第一种情况下创建短语查询,在第二种情况下创建布尔查询(分隔两个单词)。
恕我直言,这不是明显的行为,但我相信很难满足所有用例。
要使搜索将“histoire-france”视为简单的两个单词,您可以添加“solr.PositionFilterFactory< /a>”到查询分析器的末尾,例如:
那么“histoire-france”和“histoire france”的搜索结果将相等。
请注意,短语搜索可能不需要位置过滤器(“历史”和“法国”都存在)。考虑使用查询倾斜参数 qs > 0 相反,如果您使用 NGram 过滤器修改了术语序列。
You get different number of results searching for 'histoire-france' and 'histoire france' because query parser creates a phrase query in the first case, and a boolean query (separate two words) in the second case.
This is not obvious behavior imho, but i believe it's hard to satisfy all use cases.
To make search treating 'histoire-france' as simply two words you can add "solr.PositionFilterFactory" to the end of query analyzer like:
Then search results for 'histoire-france' and 'histoire france' will be equal.
Note that position filter can be undesired for phrase searches (both 'historie' and 'france' to be present). Consider using of query slops parameter qs > 0 instead in case you have modified term sequence with say NGram filter.
使用
WhitespaceTokenizerFactory
,Solr 会将您的查询字符串拆分为单词。但是,在对您(Solr)进行标记后,使用 solr.WordDelimiterFilterFactory< 将您的单词(再次)拆分为术语/a>.查看文档并查看 Wi-Fi 示例。
这可能是
histoire france
和histoire-france
处理方式不同的原因之一。第二:不要忘记,DSIMAX(通常)将查询术语处理为“术语”,并且(附加)再次将其处理为已解析的字符串。
为了解决您的问题,您可以尝试避免使用世界分隔符,并尝试使用
PatternTokenizerFactory
来处理“标记化”(正如您之前尝试过的那样,但现在没有 WordDelimiterFilterFactory)。如果这不起作用,请尝试发布 analysys.jsp 的完整输出
using
WhitespaceTokenizerFactory
, Solr will split your query string into words.But, after tokenizing you(Solr) split your word (again) into terms using solr.WordDelimiterFilterFactory. Look at the documentation and look at the Wi-Fi example.
That could be one reason, why
histoire france
andhistoire-france
are handled different.2nd: don't forget, that the DSIMAX handles (normally) the query-term as "term" and also (additional) as parsed string again.
To solve your problem, you could try to avoid the world delimiter and try to handle "tokenizing" by using
PatternTokenizerFactory
(as you tried before, but now without WordDelimiterFilterFactory).If that doesn't work, try to post the complete output of the analysys.jsp
这是一个错误: https://issues.apache.org/jira/browse/SOLR- 3589
它已在 Solr 4.1 中修复(2013 年 1 月 22 日)
It was a bug : https://issues.apache.org/jira/browse/SOLR-3589
It is fixed in Solr 4.1 (22 January 2013)
将 autoGeneratePhraseQueries 启用为 true,这将生成短语查询。
因此,当搜索 histoire-franc 时,它将生成一个带引号的查询,该查询将仅允许将两个单词作为短语进行匹配的文档。
工作配置示例 -
使用查询 slop 指定短语查询中的 slop 数,例如
qs=10
。Enable the autoGeneratePhraseQueries to true and this would generate the phrase queries.
So when searched for histoire-franc, it would generate a query with quotes which will enable only the documents having both words as a phrase being matched.
Example working configuration -
Use query slop to specify the number of slops e.g.
qs=10
in a phrase query.