在具有多个 Web 服务器的现有 .NET / SQL Server 堆栈上实施 Lucene
我想考虑使用 Lucene 为我当前管理的网站提供全文搜索解决方案。该网站完全基于 SQL Server 2008 / C# .NET 4 技术构建。我要索引的数据实际上非常简单,每个记录只有几个字段,并且只有其中一个字段实际上是可搜索的。
我不清楚我需要使用的最佳工具集是什么,或者我应该使用的架构是什么。具体来说:
我应该把索引放在哪里?我见过有人建议将其放在网络服务器上,但这对于大量网络服务器来说似乎是浪费。集中化在这里肯定会更好吗?
如果索引是集中式的,鉴于它仅存在于文件系统上,我将如何查询它?我是否必须有效地将其放在所有网络服务器都可以看到的网络共享上?
是否有任何预先存在的工具可以按计划增量填充 Lucene 索引,从 SQL Server 数据库中提取数据?我在这里推出自己的服务会更好吗?
当我查询索引时,我是否应该只提取一堆记录 ID,然后返回数据库获取实际记录,或者我应该直接提取搜索所需的所有内容索引的?
尝试在这种风味环境中实现像 Solr 这样的东西是否有价值?如果是这样,我可能会给它它自己的 *nix VM 并在 Tomcat 中运行它。但我不确定在这种情况下 Solr 会给我买什么。
I want to look at using Lucene for a fulltext search solution for a site that I currently manage. The site is built entirely on SQL Server 2008 / C# .NET 4 technologies. The data I'm looking to index is actually quite simple, with only a couple of fields per record and only one of those fields actually searchable.
It's not clear to me what the best toolset I need to be using is, or what the architecture I should be using is. Specifically:
Where should I put the index? I've seen people recommend putting it on the webserver, but that would seem wasteful for a large number of webservers. Surely centralising would be better here?
If the index is centralised, how would I query it, given that it just lives on the filesystem? Will I have to effectively put it on a network share that all the webservers can see?
Are there any pre-existing tools that will incrementally populate a Lucene index on a schedule, pulling the data from an SQL Server database? Would I be better off rolling my own service here?
When I query the index, should I be looking to just pull back a bunch of record id's which I then go back to the DB for the actual record, or should I be aiming to pull everything I need for the search straight out of the index?
Is there value in trying to implement something like Solr in this flavour environment? If so, I'd probably give it it's own *nix VM and run it within Tomcat on that. But I'm not sure what Solr would buy me in this case.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我将根据我们选择如何实现 Lucene 来回答一些问题。 Net 在 Stack Overflow 上,以及我在此过程中学到的一些经验教训:
NIOFSDirectory
)。n
次,但幸运的是,我们并不缺乏网络带宽,而且 SQL Server 缓存结果使得每次增量索引操作都非常快。由于网络服务器数量众多,仅此一项就可能消除此选项。write.lock
,目录锁定机制将确保这一点,并在您一次尝试多个 IndexWriter 时出错) 。IndexReader
以获取包含新文档的更新索引。我们使用 redis 消息传递通道来提醒任何关心索引已更新的人...任何消息传递机制都会在这里工作。n
个服务器抓取网络共享(也争夺 IO)。索引是更多来回数据......这将是 Solr 服务器上的本地数据。此外,由于索引服务器较少,因此您不会频繁访问 SQL 服务器。I'll answer a bit based on how we chose to implement Lucene.Net here on Stack Overflow, and some lessons I learned along the way:
NIOFSDirectory
for example).n
times for the web tier, but luckily we're not starved for network bandwidth and SQL server caching the results makes this a very fast delta indexing operation each time. With a large number of web servers, that alone may eliminate this option.write.lock
, the directory locking mechanism will ensure this and error when you try multiple IndexWriters at once).IndexReader
s to get the updated index with new documents. We use a redis messaging channel to alert whoever cares that the index has updated...any messaging mechanism would work here.n
servers crawling a network share (competing for IO as well), they can hit a single server that only deals with requests and results over the network, not crawling the index which is a lot more data going back and forth...this would be local on the Solr server(s). Also, you're not hitting your SQL server as much since fewer servers are indexing.