高性能Solr标签云
我正在研究如何在 Solr 中实现高性能标签云。
我有一个 Solr 数据库,其中包含 1500 万条记录,并且每天都会添加更多记录。我有一个字段,其中有多个复制语句将数据复制到其中。它可以有 1 到 6 个值。这些值通常是一两个句子(字符串数据)。我尝试创建自定义字段类型来优化&标记该字段以进行快速分面,但我的性能表现不佳。
这是我创建的自定义字段。
<fieldType name="KeywordCloud" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
关于如何在面对这个领域时至少获得合理的性能有什么建议吗?或者我可以采取完全不同的方法吗?
当我只有 100 万左右文档的索引时,这种方法非常有效,但 1500 万或更多文档就会给我带来问题。
提前致谢!
I am looking at how to implement a high performance tag cloud in Solr.
I have a Solr database with 15 million records and more added every day. I have a field in which several copy statements copy data into. It can have anywhere between 1 and 6 values. These values are usually a sentence or two (string data). I've attempted to create a custom field type to optimize & tokenize the field for quick faceting but I'm getting lackluster performance.
Here is the custom field that I've created.
<fieldType name="KeywordCloud" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Any suggestions on how I can achieve at least reasonable performance when faceting this field? Or is there a totally different approach that I can take?
This approach works great when I only have an index of a million documents or so, but 15 million and higher is giving me issues.
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你玩过solr缓存吗?随着字段的唯一术语数量变大,您需要相应地增加缓存。有关详细信息,请参阅此链接。注意过滤器缓存和字段缓存。
Have you play with the solr cache? As the number of unique terms for a field gets bigger, you need to grow the cache accordingly. See this link for details. Pay attention on the filter cache and on the field cache.