加速 Solr 索引

发布于 2024-12-01 08:20:12 字数 75 浏览 1 评论 0原文

我正在努力加快我的 Solr 索引速度。我只想知道默认情况下 Solr 使用多少个线程(如果有)用于索引。有没有办法增加/减少这个数字。

I am kind of working on speeding up my Solr Indexing speed. I just want to know by default how many threads(if any) does Solr use for indexing. Is there a way to increase/decrease that number.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

终弃我 2024-12-08 08:20:12

当您索引文档时,会执行几个步骤:

  • 分析文档,
  • 将数据放入 RAM 缓冲区,
  • 当 RAM 缓冲区已满时,
  • 如果超过 ${mergeFactor}, 则将数据刷新到磁盘上的新段段,段被合并。

前两个步骤将在与向 Solr 发送数据的客户端一样多的线程中运行,因此如果您希望 Solr 为这些步骤运行三个线程,您只需从三个线程向 Solr 发送数据即可。

如果使用 ConcurrentMergeScheduler (http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/ConcurrentMergeScheduler.html),则可以配置第四步使用的线程数。但是,无法从 Solr 配置文件中配置要使用的最大线程数,因此您需要编写一个在构造函数中调用 setMaxThreadCount 的自定义类。

我的经验是,提高 Solr 索引速度的主要方法是:

  • 购买更快的硬件(尤其是 I/O)、
  • 从多个线程向 Solr 发送数据(与核心一样多的线程是一个好的开始)、
  • 使用 Javabin 格式、
  • 使用更快的分析仪。

虽然 StreamingUpdateSolrServer 对于提高索引性能看起来很有趣,< a href="https://issues.apache.org/jira/browse/SOLR-1565">它不支持Javabin格式。由于 Javabin 解析比 XML 解析快得多,因此我通过使用 CommonsHttpSolrServer 和 Javabin 格式发送批量更新(在我的例子中为 800 个,但文档相当小)获得了更好的性能。

您可以阅读 http://wiki.apache.org/lucene-java/ImproveIndexingSpeed 了解更多信息。

When you index a document, several steps are performed :

  • the document is analyzed,
  • data is put in the RAM buffer,
  • when the RAM buffer is full, data is flushed to a new segment on disk,
  • if there are more than ${mergeFactor} segments, segments are merged.

The first two steps will be run in as many threads as you have clients sending data to Solr, so if you want Solr to run three threads for these steps, all you need is to send data to Solr from three threads.

You can configure the number of threads to use for the fourth step if you use a ConcurrentMergeScheduler (http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/ConcurrentMergeScheduler.html). However, there is no mean to configure the maximum number of threads to use from Solr configuration files, so what you need is to write a custom class which call setMaxThreadCount in the constructor.

My experience is that the main ways to improve indexing speed with Solr are :

  • buying faster hardware (especially I/O),
  • sending data to Solr from several threads (as many threads as cores is a good start),
  • using the Javabin format,
  • using faster analyzers.

Although StreamingUpdateSolrServer looks interesting for improving indexing performance, it doesn't support the Javabin format. Since Javabin parsing is much faster than XML parsing, I got better performance by sending bulk updates (800 in my case, but with rather small documents) using CommonsHttpSolrServer and the Javabin format.

You can read http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for further information.

情深已缘浅 2024-12-08 08:20:12

本文介绍了使用 SolrCloud、Hadoop 和 Behemoth 扩展索引的方法。这是针对 Solr 4.0 的,该问题最初发布时尚未发布。

This article describes an approach to scaling indexing with SolrCloud, Hadoop and Behemoth. This is for Solr 4.0 which hadn't been released at the time this question was originally posted.

楠木可依 2024-12-08 08:20:12

您可以将内容像文件一样存储在外部存储中;

包含大量内容的所有字段是什么,在架构中为相应字段设置了stored="false",并使用某些方法将该字段的内容存储在外部文件中高效的文件系统层次结构。

将索引编制时间缩短了 40% 到 45%。但在进行搜索时,搜索时间速度有所提高。对于搜索 花费的时间比之前多了 25%正常搜索。

You can store the content in external storage like file;

What are all the field that contains huge size of content,in schema set stored="false" for that corresponding field and store the content for that field in external file using some efficient file system hierarchy.

It improves indexing by 40 to 45% reduced time. But when doing search, search time speed is some what increased.For search it took 25% more time than normal search.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文