Solr Dismax 处理程序 - 空格和特殊字符行为

发布于 2024-12-11 12:24:14 字数 3302 浏览 0 评论 0原文

当我的查询中有特殊字符时,我得到了奇怪的结果。

这是我的请求:

q=histoire-france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%

解析查询:

<str name="parsedquery_toString">+((any:histoir any:franc)) ()</str>

我得到了 17000 个结果,因为 Solr 正在执行 OR(应该是 AND)。

当我使用空格而不是特殊的 char 时,我没有问题:

q=histoire france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%

<str name="parsedquery_toString">+(((any:histoir) (any:franc))~2) ()</str>

此查询有 2000 个结果。

这是我的 schema.xml (相关部分):

<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
        <filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!--<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
        <filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
    </fieldType>

我什至尝试使用 PatternTokenizerFactory 对空格和空格进行标记。特殊字符但没有变化...

我当前的解决方法是在将查询发送到 Solr 之前用空格替换所有特殊字符,但这并不令人满意。

编辑:即使使用 charFilter (PatternReplaceCharFilterFactory) 用空格替换特殊字符,它也不起作用...

通过 solr admin 进行的第一行分析,带有详细输出,查询 = 'histoire-france ' :

org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement= , pattern=([,;./\\'&-]), luceneMatchVersion=LUCENE_32}
text    histoire france

将 '-' 替换为 ' ',然后由 WhitespaceTokenizerFactory 进行标记。然而,我仍然有“histoire-france”和“histoire france”的不同数量的结果。

我错过了什么吗?

I've got strange results when I have special characters in my query.

Here is my request :

q=histoire-france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%

Parsed query :

<str name="parsedquery_toString">+((any:histoir any:franc)) ()</str>

I've got 17000 results because Solr is doing an OR (should be AND).

I have no problem when I'm using a whitespace instead of a special char :

q=histoire france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%

<str name="parsedquery_toString">+(((any:histoir) (any:franc))~2) ()</str>

2000 results for this query.

Here is my schema.xml (relevant parts) :

<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
        <filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!--<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
        <filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
    </fieldType>

I even tried with a PatternTokenizerFactory to tokenize on whitespaces & special chars but no change...

My current workaround is to replace all special chars by whitespaces before sending query to Solr, but it is not satisfying.

EDIT : Even with a charFilter (PatternReplaceCharFilterFactory) to replace special characters by whitespace, it doesn't work...

First line of analysis via solr admin, with verbose output, for query = 'histoire-france' :

org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement= , pattern=([,;./\\'&-]), luceneMatchVersion=LUCENE_32}
text    histoire france

The '-' is replaced by ' ', then tokenized by WhitespaceTokenizerFactory. However I still have different number of results for 'histoire-france' and 'histoire france'.

Did i miss something ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

酸甜透明夹心 2024-12-18 12:24:14

搜索“histoire-france”和“histoire france”会得到不同数量的结果,因为查询解析器在第一种情况下创建短语查询,在第二种情况下创建布尔查询(分隔两个单词)。

恕我直言,这不是明显的行为,但我相信很难满足所有用例。

要使搜索将“histoire-france”视为简单的两个单词,您可以添加“solr.PositionFilterFactory< /a>”到查询分析器的末尾,例如:

  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PositionFilterFactory" />
  </analyzer>

那么“histoire-france”和“histoire france”的搜索结果将相等。

请注意,短语搜索可能不需要位置过滤器(“历史”和“法国”都存在)。考虑使用查询倾斜参数 qs > 0 相反,如果您使用 NGram 过滤器修改了术语序列。

You get different number of results searching for 'histoire-france' and 'histoire france' because query parser creates a phrase query in the first case, and a boolean query (separate two words) in the second case.

This is not obvious behavior imho, but i believe it's hard to satisfy all use cases.

To make search treating 'histoire-france' as simply two words you can add "solr.PositionFilterFactory" to the end of query analyzer like:

  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PositionFilterFactory" />
  </analyzer>

Then search results for 'histoire-france' and 'histoire france' will be equal.

Note that position filter can be undesired for phrase searches (both 'historie' and 'france' to be present). Consider using of query slops parameter qs > 0 instead in case you have modified term sequence with say NGram filter.

明媚殇 2024-12-18 12:24:14

使用 WhitespaceTokenizerFactory,Solr 会将您的查询字符串拆分为单词。

但是,在对您(Solr)进行标记后,使用 solr.WordDelimiterFilterFactory< 将您的单词(再次)拆分为术语/a>.查看文档并查看 Wi-Fi 示例。

这可能是 histoire francehistoire-france 处理方式不同的原因之一。

第二:不要忘记,DSIMAX(通常)将查询术语处理为“术语”,并且(附加)再次将其处理为已解析的字符串。

为了解决您的问题,您可以尝试避免使用世界分隔符,并尝试使用 PatternTokenizerFactory 来处理“标记化”(正如您之前尝试过的那样,但现在没有 WordDelimiterFilterFactory)。

如果这不起作用,请尝试发布 analysys.jsp 的完整输出

using WhitespaceTokenizerFactory, Solr will split your query string into words.

But, after tokenizing you(Solr) split your word (again) into terms using solr.WordDelimiterFilterFactory. Look at the documentation and look at the Wi-Fi example.

That could be one reason, why histoire france and histoire-france are handled different.

2nd: don't forget, that the DSIMAX handles (normally) the query-term as "term" and also (additional) as parsed string again.

To solve your problem, you could try to avoid the world delimiter and try to handle "tokenizing" by using PatternTokenizerFactory (as you tried before, but now without WordDelimiterFilterFactory).

If that doesn't work, try to post the complete output of the analysys.jsp

拥抱影子 2024-12-18 12:24:14

这是一个错误: https://issues.apache.org/jira/browse/SOLR- 3589

如果将 edismax mm 设置为 100%,则如果其中一个令牌被分成两个
分析器链的代币(即“fire-fly”=> fire Fly),mm
参数被忽略,相当于“fire OR Fly”的 OR 查询
被生产出来。对于不支持的语言来说,这尤其是一个问题。
使用空格来分隔中文或日语等单词。

它已在 Solr 4.1 中修复(2013 年 1 月 22 日)

It was a bug : https://issues.apache.org/jira/browse/SOLR-3589

With edismax mm set to 100% if one of the tokens is split into two
tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm
parameter is ignored and the equivalent of OR query for "fire OR fly"
is produced. This is particularly a problem for languages that do not
use white space to separate words such as Chinese or Japenese.

It is fixed in Solr 4.1 (22 January 2013)

与酒说心事 2024-12-18 12:24:14

将 autoGeneratePhraseQueries 启用为 true,这将生成短语查询。
因此,当搜索 histoire-franc 时,它将生成一个带引号的查询,该查询将仅允许将两个单词作为短语进行匹配的文档。

<str name="parsedquery">(+DisjunctionMaxQuery(((any:histoire any:franc))))/no_coord</str>

工作配置示例 -

<fieldType name="text_test" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

使用查询 slop 指定短语查询中的 slop 数,例如 qs=10

<str name="parsedquery">(+DisjunctionMaxQuery((any:"histoire france"~10)))/no_coord</str>

Enable the autoGeneratePhraseQueries to true and this would generate the phrase queries.
So when searched for histoire-franc, it would generate a query with quotes which will enable only the documents having both words as a phrase being matched.

<str name="parsedquery">(+DisjunctionMaxQuery(((any:histoire any:franc))))/no_coord</str>

Example working configuration -

<fieldType name="text_test" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Use query slop to specify the number of slops e.g. qs=10 in a phrase query.

<str name="parsedquery">(+DisjunctionMaxQuery((any:"histoire france"~10)))/no_coord</str>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文