创建和更新 Zend_Search_Lucene 索引

发布于 2024-08-06 04:01:24 字数 1009 浏览 7 评论 0原文

我正在使用 Zend_Search_Lucene 创建文章索引，以便可以在我的网站上搜索它们。每当管理员在管理区域中更新/创建/删除文章时，都会重建索引：

$config = Zend_Registry::get("config");
$cache = $config->lucene->cache;
$path = $cache . "/articles";

try
{
    $index = Zend_Search_Lucene::open($path);
}
catch (Zend_Search_Lucene_Exception $e)
{
    $index = Zend_Search_Lucene::create($path);
}

$model = new Default_Model_Articles();
$select = $model->select();
$articles = $model->fetchAll($select);

foreach ($articles as $article)
{
    $doc = new Zend_Search_Lucene_Document();
    $doc->addField(Zend_Search_Lucene_Field::Text("title", $article->title));
    $index->addDocument($doc);
}

$index->commit();

我的问题是这样的。由于我正在重新索引文章并处理已删除的文章，为什么我不每次都使用“创建”（而不是“打开”和更新）？使用上面的方法，我认为每次都会用addDocument添加文章（所以会有重复）。我该如何防止这种情况发生？有没有办法检查索引中是否已存在文档？

另外，我认为我不完全理解当您“打开”并更新它时索引是如何工作的。似乎每次都会在索引文件夹中创建新的 #.cfs （因此我有 _0.cfs、_1.cfs、_2.cfs）文件，但是当我使用“create”时，它会用新的 #.cfs 覆盖该文件文件的 # 递增（因此，例如只是 _2.cfs）。您能解释一下这些分段文件是什么吗？

原文

I'm using Zend_Search_Lucene to create an index of articles to allow them to be searched on my website. Whenever a administrator updates/creates/deletes an article in the admin area, the index is rebuilt:

$config = Zend_Registry::get("config");
$cache = $config->lucene->cache;
$path = $cache . "/articles";

try
{
    $index = Zend_Search_Lucene::open($path);
}
catch (Zend_Search_Lucene_Exception $e)
{
    $index = Zend_Search_Lucene::create($path);
}

$model = new Default_Model_Articles();
$select = $model->select();
$articles = $model->fetchAll($select);

foreach ($articles as $article)
{
    $doc = new Zend_Search_Lucene_Document();
    $doc->addField(Zend_Search_Lucene_Field::Text("title", $article->title));
    $index->addDocument($doc);
}

$index->commit();

My question is this. Since I am reindexing the articles and handling deleted articles as well, why would I not just use "create" every time (instead of "open" and update)? Using the above method, I think the articles would be added with addDocument every time (so there would be duplicates). How would I prevent that? Is there a way to check if a Document exists already in the index?

Also, I don't think I fully understand how the indexing works when you "open" and update it. It seems to create new #.cfs (so I have _0.cfs, _1.cfs, _2.cfs) files in the index folder every time, but when I use "create", it overwrites that file with a new #.cfs file with the # incremented (so, for example just _2.cfs). Can you please explain what these segmented files are?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绿光 2024-08-13 04:01:24

是的，您可以检查文档是否已在索引中，请查看本手册页。然后，您可以通过 $index->delete($id); 从索引中删除该特定文档，其中 $id 是 termDocs 方法的返回值。之后，您只需添加新版本的文档即可。

关于Lucene创建的多个索引文件：每次修改现有索引时，Lucene并不会真正更改现有文件，而是为您所做的每次更改添加部分索引。这对于性能来说非常糟糕，但是有一个简单的方法可以解决这个问题。每次对索引进行更改后，请执行以下操作： $index->optimize(); - 这会将所有部分文件附加到真实索引中，从而显着缩短搜索时间。

回复收藏 0 原文

~没有更多了~