solr dismax 短语搜索
我正在构建一个应用程序,它使用 solr 将较长的查询(通常是完整的句子)与几乎总是较短的索引文档(搜索词)进行匹配。所以,我的查询看起来像“我现在应该在利率很低的时候买房子吗?我们两年前提交了 BR。现在租房,有一些 sch 贷款债务”,我的索引文件就像“买房子”、“房子”贷款利率”。
我认为正确的方法是使用 shingles、dismax 解析器和高度增强的“pf”字段。因此,我有一个“正常”文本字段 kw_stopped (solr 3.4 中的 text_en),带有非常激进的停用词列表,还有一个 kw_phrases 字段,它是短语 shingles。它的定义如下所示:
<fieldType name="shingle" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="8" outputUnigrams="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="8" outputUnigrams="false"/>
</analyzer>
</fieldType>
我的模式字段如下所示:
<field name="kw_stopped" type="text_en" indexed="true" omitNorms="True" />
<!-- keywords almost as is - to provide truer match for full phrases -->
<field name="kw_phrases" type="shingle" indexed="true" omitNorms="True" />
我的搜索处理程序配置是这样的:
<requestHandler name="edismax" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.1</float>
<str name="fl">
keywords
</str>
<str name="mm">1</str>
<str name="qf">
kw_stopped^1.0 kw_phrases^5.0
</str>
<str name="pf">
kw_phrases^50.0
</str>
<int name="ps">3</int>
<int name="qs">3</int>
<str name="q.alt">*:*</str>
</lst>
</requestHandler>
当我打开 debugQuery 时,我注意到“kw_phrases”从不匹配,除非查询和文档完全一样。 parsedquery 还显示,查询中的每个标记化都显示为“kw_stopped”的单个 DisjunctionMaxQuery 子句,但所有 shingles 都放入 kw_phrases 字段的一个巨型子句中。
我的理解差距在哪里?我怎样才能做到这一点?
谢谢! 维杰
I'm building an application which uses solr to match longer queries (typically, complete sentences) against indexed documents which are almost always shorter (search terms). So, my query looks like "should I buy a house now while the rates are low. We filed BR 2 yrs ago. Rent now, w/ some sch loan debt" and my indexed documents are like "buy a house", "house loan rates".
I thought that the right way to do this would be to use shingles, the dismax parser, and highly boosted "pf" field. So, I have a "normal" text field, kw_stopped (text_en in solr 3.4) with a very aggressive stopword list, and a kw_phrases field which is meant to be the phrase shingles. Its definition looks like this:
<fieldType name="shingle" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="8" outputUnigrams="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="8" outputUnigrams="false"/>
</analyzer>
</fieldType>
and my schema fields look like this:
<field name="kw_stopped" type="text_en" indexed="true" omitNorms="True" />
<!-- keywords almost as is - to provide truer match for full phrases -->
<field name="kw_phrases" type="shingle" indexed="true" omitNorms="True" />
My search handler config is this:
<requestHandler name="edismax" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.1</float>
<str name="fl">
keywords
</str>
<str name="mm">1</str>
<str name="qf">
kw_stopped^1.0 kw_phrases^5.0
</str>
<str name="pf">
kw_phrases^50.0
</str>
<int name="ps">3</int>
<int name="qs">3</int>
<str name="q.alt">*:*</str>
</lst>
</requestHandler>
When I turn on debugQuery, I notice that the "kw_phrases" is never matched unless the query and the document are exactly the same. Also the parsedquery shows that the each of the tokenized from the query appear as single DisjunctionMaxQuery clauses for "kw_stopped", but all shingles are put in one giant clause for the kw_phrases field.
Where is the gap in my understanding? How can I make this work?
thanks!
Vijay
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您使用长句子来搜索较短的文档,那么您似乎进展顺利。
当然,您需要一个很好的停用词过滤器列表来防止索引和搜索期间的一般术语匹配。
If you are using long sentences to search against shorter documents, you seem to be going fine.
Surely, you would need a nice stopwords filter list to prevent general terms matches during both index and search time.