高性能Solr标签云

发布于 2024-12-01 04:38:51 字数 1186 浏览 1 评论 0原文

我正在研究如何在 Solr 中实现高性能标签云。

我有一个 Solr 数据库，其中包含 1500 万条记录，并且每天都会添加更多记录。我有一个字段，其中有多个复制语句将数据复制到其中。它可以有 1 到 6 个值。这些值通常是一两个句子（字符串数据）。我尝试创建自定义字段类型来优化&标记该字段以进行快速分面，但我的性能表现不佳。

这是我创建的自定义字段。

    <fieldType name="KeywordCloud" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

关于如何在面对这个领域时至少获得合理的性能有什么建议吗？或者我可以采取完全不同的方法吗？

当我只有 100 万左右文档的索引时，这种方法非常有效，但 1500 万或更多文档就会给我带来问题。

提前致谢！

原文

I am looking at how to implement a high performance tag cloud in Solr.

I have a Solr database with 15 million records and more added every day. I have a field in which several copy statements copy data into. It can have anywhere between 1 and 6 values. These values are usually a sentence or two (string data). I've attempted to create a custom field type to optimize & tokenize the field for quick faceting but I'm getting lackluster performance.

Here is the custom field that I've created.

    <fieldType name="KeywordCloud" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Any suggestions on how I can achieve at least reasonable performance when faceting this field? Or is there a totally different approach that I can take?

This approach works great when I only have an index of a million documents or so, but 15 million and higher is giving me issues.

Thanks in advance!

分享到QQ

分享到微博