在 Azure 和 Lucene.NET 上构建分布式索引。我应该学习 Solr 和 Hadoop 吗?
我需要基于 Azure/Lucene.NET 实现的搜索索引。话虽如此,我对 Solr 和 Hadoop 以及它们为 Linux 人群提供的服务了解不多。
由于我不知道接下来的学习曲线,我会告诉你我在寻找什么,也许你可以告诉我应该如何度过我的时间。
我有兴趣为我们系统中不断增长的一批电子邮件建立索引。发送或接收消息时,它们需要可搜索。这意味着索引可能会变得巨大,这就是我们考虑云存储的原因。考虑到我对Azure比较熟悉,管理层建议我们使用Lucene.NET。
你认为对我来说最好的消磨时间的方式是什么:研究如何让 Lucene.NET 索引我的文档,或者看看 Solr/Hadoop 的实现。
I need to have my search indexes based on a Azure/Lucene.NET implementation. That being said, I don't have much knowledge of Solr and Hadoop, or what they offer the Linux crowd.
Since I don't know the learning curve ahead of me, I'll tell you what I'm looking for and perhaps you can tell me how I should spend my time.
I'm interested in indexing an ever-growing batch of emails from our system. As messages are sent or received they need to be searchable. That means the indexes could become huge, and that is why we are looking at cloud storage. Considering that I'm familliar with Azure, managment is sugguesting that we use Lucene.NET.
What do you think is the best way for me to spend my time: Study how to make Lucene.NET index my documents, or look at Solr/Hadoop's implementation for the same.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在不了解您的源语料库规模的情况下(我们在近乎实时的应用程序中对几个 TB 进行操作),我可以分享我们的一些经验。我们主要是一家 .NET 商店,我们发现使用 SolrNet 等工具非常容易,并且对于我们的开发人员来说,学习曲线非常简单。
使用 Solr 的优点很多:从显而易见的优点来看,例如分面、简单、灵活的 API(如果您需要的话)等;事实上,它拥有更加活跃的社区,并拥有最新、最棒的功能和功能。修复(参见 Lucene.net)。重要的是,我们可以使用 Solr 和商用机器轻松地线性扩展(抱歉,无法与使用云进行美元比较),但考虑到我们用于分片的机器类型的成本(几乎为零),我无法想象使用 Azure 或AWS 会更便宜。
希望有帮助。
Without knowledge of the scale of your source corpus (we operate on several TB in a near real-time application), I can share some of our experiences. We are primarily a .NET shop and we found using Solr quite easy using tools such as SolrNet and a very easy learning curve for our developers.
The advantages of using Solr are plenty: from the obvious ones such as faceting, a simple, flexible API if you need one etc.; to the fact that it has far more active community and has the latest-and-greatest features & fixes (cf. Lucene.net). Importantly, we could easily scale linearly using Solr with commodity machines (Sorry cannot make a $ comparison to using the cloud), but given the (almost zero) cost of the kind of machines we use for our shards, I cannot imagine using Azure or AWS would be cheaper.
Hope that helps.
如果您可以通过 HTTP 与索引机器通信,我建议您使用 Solr。您可以非常轻松地设置 Solr 服务器,无需任何编程,只需更改配置文件即可。它可以很好地扩展,请参阅:扩展Lucene 和 Solr。目前正在开发的是 Solr Cloud,这将使 Solr 的扩展变得更容易,并支持一些类似 hadoop 的功能。
If you can communicate with your index machines over HTTP, I would suggest that you use Solr. You can quite easily set up a Solr server without any programming by just changing configuration files. It can scale nicely, see: Scaling Lucene and Solr. Currently in development is Solr Cloud, which will make scaling Solr easier and support some hadoop-like features.