Zend Lucene 索引时耗尽内存
我正在维护的一个旧站点使用 Zend Lucene (ZF 1.7.2) 作为搜索引擎。我最近添加了两个要索引的新表,总共包含大约 2000 行文本数据,范围在 31 字节到 63kB 之间。
索引工作了几次,但在第三次运行后,由于耗尽了分配的内存,它开始终止并出现致命错误。 PHP 内存限制最初设置为 16M,足以索引所有其他内容,即 200 行文本,每行几千字节。我逐渐将内存限制增加到160M,但仍然不够,而且我无法将其增加得更高。
索引时,我首先需要清除以前索引的结果,因为路径方案包含 Lucene 似乎将其视为停用词的数字,当我运行此搜索时返回每个条目:
$this->index->find('url:/tablename/12345');
清除所有结果后,我将它们一一重新插入
foreach($urls as $v) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnStored('content', $v['data']);
$doc->addField(Zend_Search_Lucene_Field::Text('title', $v['title']);
$doc->addField(Zend_Search_Lucene_Field::Text('description', $v['description']);
$doc->addField(Zend_Search_Lucene_Field::Text('url', $v['path']);
$this->index->addDocument($doc);
}
:大约一千次迭代后,索引器就会耗尽内存并崩溃。奇怪的是,将内存限制加倍只能帮助几十行。
我已经尝试调整 MergeFactor 和 MaxMergeDocs 参数(分别为 5 和 100 的值)并每 100 行调用 $this->index->optimize()
但两者都没有提供一致的帮助。
清除整个搜索索引并重建它似乎在大多数情况下都会成功建立索引,但我更喜欢更优雅且 CPU 密集度较低的解决方案。我做错了什么吗?索引占用这么多内存正常吗?
An oldish site I'm maintaining uses Zend Lucene (ZF 1.7.2) as it's search engine. I recently added two new tables to be indexed, together containing about 2000 rows of text data ranging between 31 bytes and 63kB.
The indexing worked fine a few times, but after the third run or so it started terminating with a fatal error due to exhausting it's allocated memory. The PHP memory limit was originally set to 16M, which was enough to index all other content, 200 rows of text at a few kilobytes each. I gradually increased the memory limit to 160M but it still isn't enough and I can't increase it any higher.
When indexing, I first need to clear the previously indexed results, because the path scheme contains numbers which Lucene seems to treat as stopwords, returning every entry when I run this search:
$this->index->find('url:/tablename/12345');
After clearing all of the results I reinsert them one by one:
foreach($urls as $v) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnStored('content', $v['data']);
$doc->addField(Zend_Search_Lucene_Field::Text('title', $v['title']);
$doc->addField(Zend_Search_Lucene_Field::Text('description', $v['description']);
$doc->addField(Zend_Search_Lucene_Field::Text('url', $v['path']);
$this->index->addDocument($doc);
}
After about a thousand iterations the indexer runs out of memory and crashes. Strangely doubling the memory limit only helps a few dozen rows.
I've already tried adjusting the MergeFactor and MaxMergeDocs parameters (to values of 5 and 100 respectively) and calling $this->index->optimize()
every 100 rows but neither is providing consistent help.
Clearing the whole search index and rebuilding it seems to result in a successful indexing most of the time, but I'd prefer a more elegant and less CPU intensive solution. Is there something I'm doing wrong? Is it normal for the indexing to hog so much memory?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我有一个类似的问题,我必须维护一个至少有三种不同语言的网站,并且必须分别为每个不同的语言环境重新索引相同的 10'000+(并且不断增长)本地化文档(每个都使用自己的本地化搜索引擎) )。可以说,它通常在第二遍内失败。
我们最终实现了一个基于 Ajax 的重新索引过程,该过程第一次调用脚本来初始化并开始重新索引。该脚本在处理完预定数量的文档后中止,并返回一个 JSON 值,指示它是否已完成,以及其他进度信息。然后,我们使用进度变量再次重新调用相同的脚本,直到脚本返回完成状态。
这还允许管理区域有一个进程进度条。
对于 cron 作业,我们只是制作了一个 bash 脚本来执行相同的任务,但带有退出代码。
这已经是大约 3 年前的事了,从那以后就没有出现过任何问题。
I had a similar problem for a site I had to maintain that had at least three different languages and had to re-index the same 10'000+ (and growing) localized documents for each different locale separately (each using their own localized search engine). Suffice to say that it failed usually within the second pass.
We ended up implementing an Ajax based re-indexing process that called the script a first time to initialize and start re-indexing. That script aborted at a predefined number of processed documents and returned a JSON value indicating if it was completed or not, along with other progress information. We then re-called the same script again with the progress variables until the script returned a completed state.
This allowed also to have a progress bar of the process for the admin area.
For the cron job, we simply made a bash script doing the same task but with exit codes.
This was about 3 years ago and nothing has failed since then.