我是 Lucene.NET 的新手,但我正在使用为 为 ="http://www.sitecore.net" rel="noreferrer">Sitecore CMS,使用 Lucene.NET 对 CMS 中的大量内容进行索引。我昨天确认,当我重建索引时,当前的索引文件会被擦除干净,因此任何依赖于该索引的内容在大约 30-60 秒(完整索引重建的时间)内不会获取任何数据。是否有最佳实践或方法使 Lucene.NET 在新索引完全重建之前不会覆盖当前索引文件?我基本上认为我希望它写入新的临时索引文件,并且当重建完成时让这些文件覆盖当前索引。
我正在谈论的示例:
- 构建新索引(约 30 秒)
- 索引大约有 500 个文档
- 使用代码访问索引中的数据并在网站上显示
- 重建索引(约 30 秒)
- 现在读取数据索引的任何代码都不会返回任何内容,因为索引文件正在被覆盖;结果网站不显示任何数据
- 重建完成:数据现在再次可用,数据返回网站
提前致谢
I'm new to Lucene.NET but I'm using an open source tool built for Sitecore CMS that uses Lucene.NET to index lots of content from the CMS. I confirmed yesterday that when I rebuild my indexes, the current index files wipe clean so anything that relies on the index gets no data for about 30-60 seconds (the amount of time for a full index rebuild). Is there a best practice or way to make Lucene.NET not overwrite the current index files until the new index is completely rebuilt? I'm basically thinking I'd like it to write to new temp index files and when the rebuild is done have those files overwrite the current index.
Example of what I'm talking about:
- Build fresh index (~30 seconds)
- Index has about 500 documents
- Use code to access data in index and display on website
- Rebuild index (~30 seconds)
- Any code that now reads the index for data returns nothing because the index files are being overwritten; results in website not showing any data
- Rebuild complete: data now available again, data back on website
Thanks in advance
发布评论
评论(2)
我对“Sitecore”本身没有任何经验,但这是我的故事。
我们最近为我们的电子商务子系统引入了基于索引的搜索(使用 Lucene.Net)。我们案例的索引更新过程可能需要大约半小时(约 50,000 个产品本身 + 许多相关信息)。为了防止在索引更新期间出现“拒绝服务”响应,我们首先创建索引的“备份”版本(只需将索引目录复制到另一个位置),并且所有进一步的请求都将重定向以使用此“备份”版本。索引更新完成后,我们删除备份,以便客户端开始使用索引的更新(或“实时”)版本。如果在更新过程中可能发生任何未处理的异常,这也很有帮助,因为您最终可能会遇到根本没有索引的情况(在我们的情况下,客户端始终可以使用“备份”版本)。
Lucene.Net.Index.IndexWriter
对象的 API 参考 (Lucene 2.4) 声明如下:因此,至少您不应该担心当前正在您的索引中搜索的客户端。
希望这会帮助您做出正确的决定。
I have no experience with "Sitecore" itself but here's my story.
We've recently incorporated the index-based search (using Lucene.Net) for our eCommerce sub-system. The index update process for our case might take about half a hour (~50,000 products themselves + lots of related information). To prevent a "denial of service" responses during the update of the index we first create a "backup" version of the it (simply copying index directory to another location) and all further requests are redirected to use this "backup" version. When the index update is completed we delete the backup in order for clients to start using the updated (or "live") version of the index. This is also helps in case of any unhandled exceptions that might occur during the update process becase you might end up in a situation of having no index at all (and in our case clients can always use the "backup" version).
The API reference (Lucene 2.4) of the
Lucene.Net.Index.IndexWriter
object states the following:So at least you shouldn't worry about the clients that are currently searching within your index.
Hope this will help you to make a right decision.
我不熟悉那个 sitecore 工具,但我可以回答你如何使用纯 Lucene.Net 来做到这一点:你应该使用 NRT 设置,这意味着“拥有一个索引编写器并且永不关闭它”。
基本上,索引编写器在内存中拥有一个“虚拟”索引,直到将其刷新到磁盘为止。因此,只要您从作者那里获得读者,您就总是会看到最新的内容,即使它尚未刷新到磁盘。
I'm not familiar with that sitecore tool, but I can answer how you would do it with pure Lucene.Net: you should use an NRT setup, which means "have one index writer and never close it."
Basically, index writers have a "virtual" index in memory until it gets flushed to disk. So as long as you get your readers from the writer, you'll always see the latest stuff, even if it hasn't been flushed to disk yet.