Zend Lucene 索引时耗尽内存

发布于 2024-12-04 09:22:49 字数 1132 浏览 7 评论 0原文

我正在维护的一个旧站点使用 Zend Lucene (ZF 1.7.2) 作为搜索引擎。我最近添加了两个要索引的新表，总共包含大约 2000 行文本数据，范围在 31 字节到 63kB 之间。

索引工作了几次，但在第三次运行后，由于耗尽了分配的内存，它开始终止并出现致命错误。 PHP 内存限制最初设置为 16M，足以索引所有其他内容，即 200 行文本，每行几千字节。我逐渐将内存限制增加到160M，但仍然不够，而且我无法将其增加得更高。

索引时，我首先需要清除以前索引的结果，因为路径方案包含 Lucene 似乎将其视为停用词的数字，当我运行此搜索时返回每个条目：

$this->index->find('url:/tablename/12345');

清除所有结果后，我将它们一一重新插入

foreach($urls as $v) {
   $doc = new Zend_Search_Lucene_Document();
   $doc->addField(Zend_Search_Lucene_Field::UnStored('content', $v['data']);
   $doc->addField(Zend_Search_Lucene_Field::Text('title', $v['title']);
   $doc->addField(Zend_Search_Lucene_Field::Text('description', $v['description']);
   $doc->addField(Zend_Search_Lucene_Field::Text('url', $v['path']);
   $this->index->addDocument($doc);
}

：大约一千次迭代后，索引器就会耗尽内存并崩溃。奇怪的是，将内存限制加倍只能帮助几十行。

我已经尝试调整 MergeFactor 和 MaxMergeDocs 参数（分别为 5 和 100 的值）并每 100 行调用 $this->index->optimize() 但两者都没有提供一致的帮助。

清除整个搜索索引并重建它似乎在大多数情况下都会成功建立索引，但我更喜欢更优雅且 CPU 密集度较低的解决方案。我做错了什么吗？索引占用这么多内存正常吗？

原文

An oldish site I'm maintaining uses Zend Lucene (ZF 1.7.2) as it's search engine. I recently added two new tables to be indexed, together containing about 2000 rows of text data ranging between 31 bytes and 63kB.

The indexing worked fine a few times, but after the third run or so it started terminating with a fatal error due to exhausting it's allocated memory. The PHP memory limit was originally set to 16M, which was enough to index all other content, 200 rows of text at a few kilobytes each. I gradually increased the memory limit to 160M but it still isn't enough and I can't increase it any higher.

When indexing, I first need to clear the previously indexed results, because the path scheme contains numbers which Lucene seems to treat as stopwords, returning every entry when I run this search:

$this->index->find('url:/tablename/12345');

After clearing all of the results I reinsert them one by one:

foreach($urls as $v) {
   $doc = new Zend_Search_Lucene_Document();
   $doc->addField(Zend_Search_Lucene_Field::UnStored('content', $v['data']);
   $doc->addField(Zend_Search_Lucene_Field::Text('title', $v['title']);
   $doc->addField(Zend_Search_Lucene_Field::Text('description', $v['description']);
   $doc->addField(Zend_Search_Lucene_Field::Text('url', $v['path']);
   $this->index->addDocument($doc);
}

After about a thousand iterations the indexer runs out of memory and crashes. Strangely doubling the memory limit only helps a few dozen rows.

I've already tried adjusting the MergeFactor and MaxMergeDocs parameters (to values of 5 and 100 respectively) and calling $this->index->optimize() every 100 rows but neither is providing consistent help.

Clearing the whole search index and rebuilding it seems to result in a successful indexing most of the time, but I'd prefer a more elegant and less CPU intensive solution. Is there something I'm doing wrong? Is it normal for the indexing to hog so much memory?

分享到QQ

分享到微博