创建和更新 Zend_Search_Lucene 索引
我正在使用 Zend_Search_Lucene 创建文章索引,以便可以在我的网站上搜索它们。每当管理员在管理区域中更新/创建/删除文章时,都会重建索引:
$config = Zend_Registry::get("config");
$cache = $config->lucene->cache;
$path = $cache . "/articles";
try
{
$index = Zend_Search_Lucene::open($path);
}
catch (Zend_Search_Lucene_Exception $e)
{
$index = Zend_Search_Lucene::create($path);
}
$model = new Default_Model_Articles();
$select = $model->select();
$articles = $model->fetchAll($select);
foreach ($articles as $article)
{
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text("title", $article->title));
$index->addDocument($doc);
}
$index->commit();
我的问题是这样的。由于我正在重新索引文章并处理已删除的文章,为什么我不每次都使用“创建”(而不是“打开”和更新)?使用上面的方法,我认为每次都会用addDocument添加文章(所以会有重复)。我该如何防止这种情况发生?有没有办法检查索引中是否已存在文档?
另外,我认为我不完全理解当您“打开”并更新它时索引是如何工作的。似乎每次都会在索引文件夹中创建新的 #.cfs (因此我有 _0.cfs、_1.cfs、_2.cfs)文件,但是当我使用“create”时,它会用新的 #.cfs 覆盖该文件文件的 # 递增(因此,例如只是 _2.cfs)。您能解释一下这些分段文件是什么吗?
I'm using Zend_Search_Lucene to create an index of articles to allow them to be searched on my website. Whenever a administrator updates/creates/deletes an article in the admin area, the index is rebuilt:
$config = Zend_Registry::get("config");
$cache = $config->lucene->cache;
$path = $cache . "/articles";
try
{
$index = Zend_Search_Lucene::open($path);
}
catch (Zend_Search_Lucene_Exception $e)
{
$index = Zend_Search_Lucene::create($path);
}
$model = new Default_Model_Articles();
$select = $model->select();
$articles = $model->fetchAll($select);
foreach ($articles as $article)
{
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text("title", $article->title));
$index->addDocument($doc);
}
$index->commit();
My question is this. Since I am reindexing the articles and handling deleted articles as well, why would I not just use "create" every time (instead of "open" and update)? Using the above method, I think the articles would be added with addDocument every time (so there would be duplicates). How would I prevent that? Is there a way to check if a Document exists already in the index?
Also, I don't think I fully understand how the indexing works when you "open" and update it. It seems to create new #.cfs (so I have _0.cfs, _1.cfs, _2.cfs) files in the index folder every time, but when I use "create", it overwrites that file with a new #.cfs file with the # incremented (so, for example just _2.cfs). Can you please explain what these segmented files are?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是的,您可以检查文档是否已在索引中,请查看 本手册页。然后,您可以通过 $index->delete($id); 从索引中删除该特定文档,其中 $id 是 termDocs 方法的返回值。之后,您只需添加新版本的文档即可。
关于Lucene创建的多个索引文件:每次修改现有索引时,Lucene并不会真正更改现有文件,而是为您所做的每次更改添加部分索引。这对于性能来说非常糟糕,但是有一个简单的方法可以解决这个问题。每次对索引进行更改后,请执行以下操作: $index->optimize(); - 这会将所有部分文件附加到真实索引中,从而显着缩短搜索时间。
Yes , you can check if a Document is already in the index, have a look in this Manual Page. You can then delete this specific Document from the index via $index->delete($id);, where $id is the return value of the termDocs method. After that you can simply add the new version of the Document.
About the multiple index files that Lucene creates: Every time you modify an existing index, Lucene does not realy change the existing files, but adds partial indexes for every change you make. This is extremely bad for performance, but there is a simple way around this. After every change you make to the index do this: $index->optimize(); - this will append all the partial files to the real index, improving searchtimes dramatically.