Solr numDocs 溢出

发布于 2024-12-28 01:26:58 字数 3592 浏览 0 评论 0 原文

我们正在运行 Solr 来索引大量数据,刚刚遇到了一个非常有趣的问题,我无法在任何地方找到任何帮助。

Solr 似乎正在使用带符号的 32 位整数来计算索引中当前文档的数量。我们刚刚达到这个数字,我们的 Solr 统计页面显示以下内容:

numDocs : -2116382092
maxDoc : -2114669444 

Solr 似乎仍然能够很好地索引传入数据,但是当我们搜索时,我们得到一个 NegativeArraySizeException(请参阅下面的堆栈跟踪)

我们假设错误和溢出是相关的(我相信这是一个合理的假设)。 Solr 文档中没有任何内容,到目前为止我还没有找到有关该主题的任何相关帮助。问题的解决方案很可能是删除一大块索引以将大小减小到 MAX_INT 以下,但我们不确定这是否有效,因为 Solr 很可能必须执行搜索才能找到要删除的文档。

我想这与其说是一个问题,不如说是在陈述事实,但我想知道是否有其他人遇到过这个问题,如果有,你是如何解决的?

 java.lang.NegativeArraySizeException
    at org.apache.solr.search.DocSetCollector.<init>(DocSetHitCollector.java:47)
    at org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:627)
    at org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet(SolrIndexSearcher.java:563)
    at org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:592)
    at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:903)
    at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
    at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
    at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
    at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
    at    org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
    at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:235)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
    at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:190)
    at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:92)
    at     org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextEstablishmentValve.java:126)
    at     org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEstablishmentValve.java:70)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
    at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:158)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330)
    at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:829)
    at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:598)
    at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
    at java.lang.Thread.run(Thread.java:662)

We're running Solr to index large amounts of data and just ran into a very interesting problem to which I haven't been able to find any help anywhere.

It seems Solr is using a signed 32-bit integer to count the number of documents currently in the index. We just reached that number and our Solr statistics page show the following:

numDocs : -2116382092
maxDoc : -2114669444 

Solr still seems to be able to index incoming data just fine but when we search we get a NegativeArraySizeException (see stacktrace below)

We assume that the error and the overflow are connected (a fair assumption I believe).
There is nothing in the Solr documentation and so far I haven't been able to find any relevant help on the subject. The solution to the problem is most likely to drop a chunk of the index to reduce the size below MAX_INT but we're not sure that'll work as Solr most likely has to perform a search in order to find the documents to delete.

I guess this is not so much a question as it is stating the facts but I would like to know if anyone else has ever come across this issue and if so, how you solved it?

 java.lang.NegativeArraySizeException
    at org.apache.solr.search.DocSetCollector.<init>(DocSetHitCollector.java:47)
    at org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:627)
    at org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet(SolrIndexSearcher.java:563)
    at org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:592)
    at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:903)
    at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
    at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
    at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
    at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
    at    org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
    at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:235)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
    at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:190)
    at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:92)
    at     org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextEstablishmentValve.java:126)
    at     org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEstablishmentValve.java:70)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
    at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:158)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330)
    at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:829)
    at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:598)
    at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
    at java.lang.Thread.run(Thread.java:662)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

捂风挽笑 2025-01-04 01:26:58

经过一番痛苦之后,我们终于解决了这个问题。

我们做了什么以及如何解决它

首先,我们运行 CheckIndex 来验证所有段及其索引本身都处于良好状态并且没有损坏。这也让我们知道我们有足够的 RAM 可以在以后对更大的部分进行一些实际工作。果然,CheckIndex 的结果是ok的。 Lucene 库中提供了 CheckIndex

第二步是将索引按段分成两部分(不是一半)。
在我们的例子中,我们有一个由大约 17 亿个文档组成的巨大段,我们只是将该段与源索引隔离开来,并创建了两个新索引,一个是大段,另一个是我们剩下的 20 个左右的段。
为此,我们使用了同样来自 Lucene 库的 IndexSplitter

拆分索引需要 Lucene 3.0.x,而我们只安装了 Lucene 2.9.3(与 Solr 1.4.1 捆绑在一起)。我们下载了 Lucene 3.0.3 的单独实例来使用 IndexSplitter。
创建的两个新索引与我们的 Lucene 版本不兼容,因此我们最终不得不将 Solr 安装中的 Lucene 升级到可以读取 3.0.x 索引的 2.9.4。

然后,我们启动 Solr,一次指向每个新索引。这次 numDocs 低于 MAX_INT,我们可以运行删除语句。
在对两个新索引执行此操作后,我们简单地使用也在 Lucene 库中的 IndexMerge 工具将它们合并在一起,剩下的内容最终得到了一个健康的 15 亿文档索引,这将持续我们另外几个索引几个月:)
这里学到的教训是在达到上限之前运行删除查询。

向所有 Lucene 专家提出一个问题:

当您在达到 MAX_INT 后继续索引时,实际上会发生什么?我们是否会覆盖数据?如果是,哪些数据最有可能被覆盖?

We have managed to solve the issue after a bit of pain.

What we did and how we solved it

First we ran CheckIndex to verify that all segments and the index it self was in good shape and hadn't gone corrupt. This also let us know that we had enough RAM to do some real work on our larger segments later. As epected, the result of CheckIndex was ok. CheckIndex is available in the Lucene library.

Step two was to split the index in two by segments (not in half).
In our case we had one huge segment consisting of about 1.7 billion documents, we simply isolated that segment from the source index and created two new indexes, one of only the large segment and one of the remaining 20 something segments we had.
To do this we used the IndexSplitter, also from the Lucene library.

Splitting the index requires Lucene 3.0.x and we only had Lucene 2.9.3 (bundled with Solr 1.4.1) installed. We downloaded a separate instance of Lucene 3.0.3 to get our hands on the IndexSplitter.
The two new indexes created were incompatible with our Lucene version so we ended up having to upgrade Lucene in our Solr installation to 2.9.4 which can read 3.0.x indexes.

We then booted Solr pointing towards each of the new indexes, one at a time. This time the numDocs was below MAX_INT and we could run our delete statements.
After doing this on both the two new indexes we simply merged them together using the IndexMerge tool, also in Lucene library, what was left and ended up with a healthy 1.5 billion doc index that will last us another couple of months :)
Lesson learned here is to run the delete queries before we hit the ceiling.

A question to all Lucene experts:

What actually happens when you continue to index after MAX_INT has been hit? Are we overwriting data? If yes, what data is most likely to be overwritten?

惟欲睡 2025-01-04 01:26:58

您尝试过使用分布式搜索吗?

“如果您有如此多的文档,由于 RAM 或索引大小的原因,您根本无法将它们全部放入一个盒子中,那么您可以将索引分成多个部分,称为分片。” Lucidworks

Have you tried using distributed searching?

"If you have so many documents that you simply cannot fit them all on a single box for RAM or index size reasons, you can split an index into multiple pieces, called shards." Lucidworks

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文