避免 Solr 由于词干而缓慢突出显示
我对使用 Solr 还很陌生,但想寻求您的帮助。 我正在开发一个应用程序,它应该能够突出显示查询结果。为此,我使用正则表达式分段器:
<highlighting>
<fragmenter name="regex" class="org.apache.solr.highlight.RegexFragmenter">
<lst name="defaults">
<int name="hl.fragsize">500</int>
<float name="hl.regex.slop">0.5</float>
<str name="hl.pre"><![CDATA[<b>]]></str>
<str name="hl.post"><![CDATA[</b>]]></str>
<str name="hl.useFastVectorHighlighter">true</str>
<str name="hl.regex.pattern">[-\w ,/\n\"']{20,300}[.?!]</str>
<str name="hl.fl">dokumentum_syn_query</str>
</lst>
该字段使用术语向量和偏移量进行索引:
<field name="dokumentum_syn_query" type="huntext_syn" indexed="true" stored="true" multiValued="true" termVectors="on" termPositions="on" termOffsets="on"/>
<fieldType name="huntext_syn" class="solr.TextField" stored="true" indexed="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="com.morphologic.solr.huntoken.HunTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_query.txt" enablePositionIncrements="true" />
<filter class="com.morphologic.solr.hunstem.HumorStemFilterFactory"
lex="/home/oroszgy/workspace/morpho/solrplugins/data/lex"
cache="alma"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_query.txt" enablePositionIncrements="true" />
<filter class="com.morphologic.solr.hunstem.HumorStemFilterFactory"
lex="/home/oroszgy/workspace/morpho/solrplugins/data/lex"
cache="alma"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_query.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
突出显示效果很好,只是速度非常慢。我意识到这是因为荧光笔/片段再次对所有结果文档进行词干提取。
您能帮我看看为什么会发生这种情况吗?我应该如何避免这种情况? (我以为使用 fastvectorhighlighter 会解决我的问题,但事实并非如此)
I am quite new about using Solr, but would like to ask your help.
I am developing an application which should be able to highlight the results of a query. For this I am using regex fragmenter:
<highlighting>
<fragmenter name="regex" class="org.apache.solr.highlight.RegexFragmenter">
<lst name="defaults">
<int name="hl.fragsize">500</int>
<float name="hl.regex.slop">0.5</float>
<str name="hl.pre"><![CDATA[<b>]]></str>
<str name="hl.post"><![CDATA[</b>]]></str>
<str name="hl.useFastVectorHighlighter">true</str>
<str name="hl.regex.pattern">[-\w ,/\n\"']{20,300}[.?!]</str>
<str name="hl.fl">dokumentum_syn_query</str>
</lst>
The field is indexed with term vectors and offsets:
<field name="dokumentum_syn_query" type="huntext_syn" indexed="true" stored="true" multiValued="true" termVectors="on" termPositions="on" termOffsets="on"/>
<fieldType name="huntext_syn" class="solr.TextField" stored="true" indexed="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="com.morphologic.solr.huntoken.HunTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_query.txt" enablePositionIncrements="true" />
<filter class="com.morphologic.solr.hunstem.HumorStemFilterFactory"
lex="/home/oroszgy/workspace/morpho/solrplugins/data/lex"
cache="alma"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_query.txt" enablePositionIncrements="true" />
<filter class="com.morphologic.solr.hunstem.HumorStemFilterFactory"
lex="/home/oroszgy/workspace/morpho/solrplugins/data/lex"
cache="alma"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_query.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The highlighting works well, excepts that its really slow. I realized that this is because the highlighter/fragmenter does stemming for all the result documents again.
Could you please help me why does it happen an how should I avoid this? (I thought that using fastvectorhighlighter will solve my problem, but it didn't)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
问题是,我尝试使用值“on”而不是“true”。所以该方案中正确的行是:
The problem was, that I tried to use values "on" instead of "true". So the proper line on the schem is:
为了避免通过突出显示“缓慢”的 solr 结果,我决定不使用 solr 突出显示。
我在客户端编写了突出显示功能。
这项工作适合我,但有点棘手,因为你必须像 solr 在服务器端那样处理客户端的搜索短语,以便在客户端找到标记化和词干术语 - 进行标记,搜索并找到 solr 的内容。
这意味着:您必须在客户端实现词干提取功能。
替代方案:
我认为,结果集中的术语向量为您提供了有关必须在客户端突出显示的术语位置的信息。您可以使用这些信息在客户端突出显示术语,而无需在客户端实施词干分析器。但我认为:最终这并不是真正的替代方案。因为 Solr 仍然需要计算单词的位置 - 所以你不会节省服务器端的时间。
to avoid "slow" solr results by highlighting, i decided not to use the solr highlighting.
I coded the highlighting functionality on client-side.
That work's for me, but is ab bit tricky, because you have to handle the search-phrase at client side in the same way solr does on server side in order to find also the tokenized and stemmed terms on client-side - to mark, what solr was searched for and found.
That means: you have to implement stemming functionality on client side.
Alternative:
I think, the term vector in the result sets gives you information about position of the term you have to highlight on the client side. You could use those information to highlight the terms on client side without implement stemmer on client. But i think: finally this is not really an alternative. Because Solr still needs to compute the position of the words - so you will not save time on server side.