重新索引和自定义 IndexingProvider 时的 Sitecore 搜索性能
我们使用 Sitecore 6.4,并使用共享源高级搜索模块,当 Sitecore 重新索引过程启动并将更改更新到 Web 数据库时,我们发现站点搜索性能大幅下降。
当我们开始完整的站点发布时,索引管理器会获取更改并处理历史记录,从而对每个受影响的项目重新建立索引。由于每个项目都会发生这种情况,因此您可以在查看目录时看到磁盘上的 Lucene 索引发生变化(文件数量在您观看时会增长和变化)。
如果发生这种情况时您尝试在公共网站上进行搜索,则搜索可能需要明显更长的时间才能完成;在重负载下,重新索引过程完成可能需要长达 15 秒的时间。
我可以看到这个过程是由 IndexingProvider 类控制的。有什么方法可以重写这个类并实现我们自己的类吗?
我们已经查看了搜索逻辑,可以看到每次请求搜索时都会创建一个 IndexSearchContext 对象,进而创建一个新的 IndexSearcher。我们更改了一些逻辑,以便将 IndexSearchContext 保留为单例,这当然意味着同一个 Lucene IndexSearcher 可以服务多个请求。这大大减少了内存消耗,因为建议使用相同的搜索器来提高性能。
然而,在执行此操作时,在创建新的 IndexSearcher 之前,不会拾取对索引的更改。我们需要一种方法来通知我们的代码索引过程已完成,然后我们可以重置我们的单例 IndexSearchContext 对象。我们如何将此逻辑集成到 Sitecore 配置代码中?
手动重建索引只需要大约 5 秒即可完成。显然,这有效地删除了索引,然后再次创建它,但是为什么逐项更新需要这么长时间?有没有更好的办法既可以做到不逐项更新又不影响公众网站?
我预计其他人也会受到这个问题的影响,所以我很想听听人们如何解决这个问题。
编辑 - 来自 Sitecore 论坛的附加信息
Sitecore.Search 代码似乎确实大量使用了为单个操作创建/处置新的 Lucene 对象。对于大型环境来说,它似乎没有过度的可扩展性,这就是为什么当我看到代码时我感到惊讶的原因。特别是当索引很大并且每天有大量内容更新/发布时。
通过 dotPeek 查看这些类,我看不到如何重写 IndexUpdateContext,因为它是在非虚拟方法中创建的。自定义 DatabaseCrawler 可以获得一些访问权限,但只能访问已创建的上下文对象。
我注意到我们可以在 web.config 中为每个索引定义我们自己的索引实现。我们还可以重新实现爬虫(我们已经从共享模块中安装了高级爬虫),并且也许可以对索引过程进行一些控制。我不愿意将太多的 Sitecore 代码提取到我们自己的实现中,因为它可能会影响未来的更新。
我有一个关于 IndexingProvider 的问题。在以下方法中:
private void UpdateItem(HistoryEntry entry, Database database)
{
int count = database.Indexes.Count;
if (count != 0 || this.OnUpdateItem != null)
{
Item obj = database.GetItem(entry.ItemId, entry.ItemLanguage, entry.ItemVersion);
if (obj != null)
{
if (this.OnUpdateItem != null)
this.OnUpdateItem((object) this, (EventArgs) new SitecoreEventArgs("index:updateitem", new object[2]
{
(object) database,
(object) obj
}, new EventResult()));
for (int index = 0; index < count; ++index)
database.Indexes[index].UpdateItem(obj);
}
}
}
它触发更新事件,该事件由 DatabaseCrawler 处理,因为它附加到 IndexingProvider.OnUpdateItem 事件;但是为什么上面的方法还调用了Sitecore.Data.Indexing.Index.UpdateItem方法呢?我认为命名空间在 6.5 版本中被贬值,所以我很惊讶地看到新旧命名空间之间的链接。
因此,看起来 DatabaseCrawler 正在处理更新,它会删除该项目,然后将其再次添加到索引中;然后旧的 Sitecore.Data.Indexing.Index 也尝试更新它。这里面肯定有什么问题吧?我不知道,所以如果我错了,请纠正我,这就是我在没有任何调试的情况下跟踪反编译代码时的样子。
We are on Sitecore 6.4 and are using the shared source advanced search module and are seeing a big degredation in site search performance when the Sitecore re-index process kicks in and updates the changes to the web database.
When we kick off a full site publish, the indexing manager picks up the changes and processes the history records, which in turn re-indexes each item that has been affected. As this is happening per item you can see the Lucene index on disk changing whilst looking at the directory (the number of files grow and change as you watch it).
If you try and search on the public website when this is happening, the search can take noticibly longer to complete; and under heavy load it can take up to 15 seconds longer until the re-index process has finished.
I can see this process is controlled by the IndexingProvider class. Is there any way in which to override this class and implement our own?
We have looked at the searching logic and can see that an IndexSearchContext object is created each time a search is requested, which in turn creates a new IndexSearcher. We have changed some of the logic so that the IndexSearchContext is preserved as a singlton, which of course means that multiple requests can be served by the same Lucene IndexSearcher. This has drastically reduced memory consumption as using the same searher is recommended to increase performance.
However, in doing this, changes to the index will not be picked up until a new IndexSearcher is created. We need a way in which to notify our code that the indexing process has finished and then we can reset our singleton IndexSearchContext object. How might we integrate this logic into the Sitecore configured code?
When rebuilding the index manually it only takes about 5 seconds to complete. Obviously this effectively deletes the index and then creates it all again but why does the item by item update take so long? Is there not a better way in which the update process can be achieved without going item by item and it not affecting the public website?
I would have expected others to be affected by this problem so I'm keen to hear how people have tackled the problem.
EDIT - additional info from Sitecore forum
The Sitecore.Search code does seem to make heavy use of creating/disposing new Lucene objects for a single operation. It does not seem overly scalable for large environments, which is why I was surprised when I saw the code. Especially if the indexes are large and there are a lot of content updates/publishes each day.
Looking at the classes via dotPeek I cannot see how we would override the IndexUpdateContext as it's created in a non virtual method. A custom DatabaseCrawler could get some access but only to the context object already created.
I notice that we can define our own Index implementation in the web.config for each index. We can also re-implement the crawler (we already have the advanced crawler in place from the shared module) and maybe get some control of the indexing process. I would be reluctant to pull out too much of the Sitecore code into our own implementation as it may affect future updates.
I have one question though regarding the IndexingProvider. In the following method:
private void UpdateItem(HistoryEntry entry, Database database)
{
int count = database.Indexes.Count;
if (count != 0 || this.OnUpdateItem != null)
{
Item obj = database.GetItem(entry.ItemId, entry.ItemLanguage, entry.ItemVersion);
if (obj != null)
{
if (this.OnUpdateItem != null)
this.OnUpdateItem((object) this, (EventArgs) new SitecoreEventArgs("index:updateitem", new object[2]
{
(object) database,
(object) obj
}, new EventResult()));
for (int index = 0; index < count; ++index)
database.Indexes[index].UpdateItem(obj);
}
}
}
It fires the update event, which is handled by the DatabaseCrawler as it attached to the IndexingProvider.OnUpdateItem event; but why does the method above also call the Sitecore.Data.Indexing.Index.UpdateItem method? I thought that namespace was being depreciated in version 6.5 so I'm surprised to see a link between the new and the old namespace.
So it looks like the DatabaseCrawler is handling the update, which deletes the item and then adds it again to the index; and then the old Sitecore.Data.Indexing.Index also tries to update it. Surely there is something wrong here? I don't know though so please correct me if I am wrong, this is just what it looks like when I track through the decompiled code without any debugging.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我建议两件事:
使用高级数据库爬虫 (v2 是最新版本),它覆盖了
Sitecore.Search
命名空间。这使得将 Lucene.NET 与 Sitecore 结合使用变得非常容易。每天完全重建索引。这会对索引进行碎片整理,因为随着时间的推移,碎片会降低性能(这可能是您的问题)。
I would recommend two things:
Use the Advanced Database Crawler (v2 is the latest version) which wraps over the
Sitecore.Search
namespace. This makes it super easy to use Lucene.NET with Sitecore.Rebuild the indexes fully daily. This defragments the indexes as the fragmentation over time can reduce performance (which might be your issue here).
我以前也遇到过类似的问题。当我分析发生了什么时,所有时间都花在打开每次搜索的索引上。
我们最终解决这个问题的方法是绕过 Sitecore 的索引类并直接使用 Lucene。 Lucene 提供了“重新打开”方法,该方法仅打开修改后的段文件,而不是像 Sitecore 那样打开所有段文件。
所以我们所做的是:
看一下在 Lucene.Net.Index.IndexReader.Reopen 方法 文档
您可以从 Sitecore.Search.Index.CreateReader() 创建索引阅读器
I've come across similar problems before. When I was analysed what was going on all of the time was spent in opening the index for every search.
The way we ended up solving it was by bypassing Sitecore's index classes and going direct to Lucene. Lucene provides a "Reopen" method which only opens the modified segment files, as opposed to all of the segment files like Sitecore does.
So what we did was:
Have a look at the Lucene.Net.Index.IndexReader.Reopen method Documentation
You can create an Index Reader from Sitecore.Search.Index.CreateReader()