如何在azure cloud-app中维护lucene索引

发布于 2024-09-26 20:16:04 字数 513 浏览 6 评论 0原文

我刚刚开始使用 Azure Library for Lucene.NET (http://code.msdn.microsoft.com/ Azure目录)。到目前为止,我一直在使用自己的自定义代码在 azure blob 上编写 lucene 索引。因此,我将 blob 复制到 azure web/worker 角色的本地存储,并将文档读/写到索引。我使用自定义锁定机制来确保对 blob 的读取和写入之间不会发生冲突。我希望 Azure 图书馆能为我解决这些问题。

然而,在尝试测试应用程序时,我调整了代码以使用复合文件选项,并且每次写入索引时都会创建一个新文件。现在,我的问题是,如果我必须维护索引 - 即保留索引文件的快照并在主索引损坏时使用它,那么我该如何去做呢?我应该保留创建的所有 .cfs 文件的备份还是仅处理最新的文件就可以了。每次写入索引后是否有 api 调用来清理 blob 以保留最新文件?

谢谢 卡皮尔

I just started playing with the Azure Library for Lucene.NET (http://code.msdn.microsoft.com/AzureDirectory). Until now, I was using my own custom code for writing lucene indexes on the azure blob. So, I was copying the blob to localstorage of the azure web/worker role and reading/writing docs to the index. I was using my custom locking mechanism to make sure we dont have clashes between reads and writes to the blob. I am hoping Azure Library would take care of these issues for me.

However, while trying out the test app, I tweaked the code to use compound-file option, and that created a new file everytime I wrote to the index. Now, my question is, if I have to maintain the index - i.e keep a snapshot of the index file and use it if the main index gets corrupt, then how do I go about doing this. Should I keep a backup of all the .cfs files that are created or handling only the latest one is fine. Are there api calls to clean up the blob to keep the latest file after each write to the index?

Thanks
Kapil

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

残龙傲雪 2024-10-03 20:16:04

在我回答这个问题后,我们最终更改了搜索基础架构并使用了 Windows Azure Drive。我们有一个辅助角色,它将使用块存储挂载 VHD,并在其上托管 Lucene.NET 索引。该代码检查以确保首先安装 VHD 并且索引目录存在。如果辅助角色跌倒,VHD 将在 60 秒后自动卸载,第二个辅助角色可以接起它。

此后,我们再次更改了基础设施,并使用 Solr 实例迁移到 Amazon 进行搜索,但 VHD 选项在开发过程中运行良好。它本来可以在测试和生产中运行良好,但需求意味着我们需要迁移到 EC2。

After i answered this, we ended up changing our search infrastructure and used Windows Azure Drive. We had a Worker Role, which would mount a VHD using the Block Storage, and host the Lucene.NET Index on it. The code checked to make sure the VHD was mounted first and that the index directory existed. If the worker role fell over, the VHD would automatically dismount after 60 seconds, and a second worker role could pick it up.

We have since changed our infrastructure again and moved to Amazon with a Solr instance for search, but the VHD option worked well during development. it could have worked well in Test and Production, but Requirements meant we needed to move to EC2.

你的背包 2024-10-03 20:16:04

我正在使用 AzureDirectory 在 Azure 上进行全文索引,并且我也得到了一些奇怪的结果...但希望这个答案对您有用...

首先,复合文件选项:来自我正在阅读和可以看出,复合文件是一个包含所有索引数据的单个大文件。与此相关的是将许多较小的文件(使用 IndexWriter 的 SetMaxMergeDocs(int) 函数配置)写入存储。问题是,一旦你获取大量文件(我愚蠢地将其设置为大约 5000 个),下载索引就需要一段时间(在 Azure 服务器上,我的开发盒大约需要一分钟......好吧,它现在已经运行了 20 分钟但仍未完成...)。

至于备份索引,我还没有遇到过这个问题,但考虑到我们目前有大约 500 万条记录,而且这个数字还会增长,我也想知道这一点。如果您使用单个复合文件,也许将文件下载到辅助角色,压缩它们并使用今天的日期上传它们会起作用...如果您有较小的文档集,您可能需要重新索引数据如果出了问题……但同样,取决于数量……

i am using AzureDirectory for Full Text indexing on Azure, and i am getting some odd results also... but hopefully this answer will be of some use to you...

firstly, the compound-file option: from what i am reading and figuring out, the compound file is a single large file with all the index data inside. the alliterative to this is having lots of smaller files (configured using the SetMaxMergeDocs(int) function of IndexWriter) written to storage. the problem with this is once you get to lots of files (i foolishly set this to about 5000) it takes an age to download the indexes (On the Azure server it takes about a minute,, of my dev box... well its been running for 20 min now and still not finished...).

as for backing up indexes, i have not come up against this yet, but given we have about 5 million records currently, and that will grow, i am wondering about this also. if you are using a single compounded file, maybe downloading the files to a worker role, zipping them and uploading them with todays date would work... if you have a smaller set of documents, you might get away with re-indexing the data if something goes wrong... but again, depends on the number....

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文