在具有多个 Web 服务器的现有 .NET / SQL Server 堆栈上实施 Lucene

发布于 2024-11-17 23:58:24 字数 623 浏览 0 评论 0原文

我想考虑使用 Lucene 为我当前管理的网站提供全文搜索解决方案。该网站完全基于 SQL Server 2008 / C# .NET 4 技术构建。我要索引的数据实际上非常简单，每个记录只有几个字段，并且只有其中一个字段实际上是可搜索的。

我不清楚我需要使用的最佳工具集是什么，或者我应该使用的架构是什么。具体来说：

我应该把索引放在哪里？我见过有人建议将其放在网络服务器上，但这对于大量网络服务器来说似乎是浪费。集中化在这里肯定会更好吗？
如果索引是集中式的，鉴于它仅存在于文件系统上，我将如何查询它？我是否必须有效地将其放在所有网络服务器都可以看到的网络共享上？
是否有任何预先存在的工具可以按计划增量填充 Lucene 索引，从 SQL Server 数据库中提取数据？我在这里推出自己的服务会更好吗？
当我查询索引时，我是否应该只提取一堆记录 ID，然后返回数据库获取实际记录，或者我应该直接提取搜索所需的所有内容索引的？
尝试在这种风味环境中实现像 Solr 这样的东西是否有价值？如果是这样，我可能会给它它自己的 *nix VM 并在 Tomcat 中运行它。但我不确定在这种情况下 Solr 会给我买什么。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

与风相奔跑 2024-11-24 23:58:24

我将根据我们选择如何实现 Lucene 来回答一些问题。 Net 在 Stack Overflow 上，以及我在此过程中学到的一些经验教训：

我应该把索引放在哪里？我见过有人建议将其放在网络服务器上，但这对于大量网络服务器来说似乎是浪费。集中化在这里肯定会更好吗？

这取决于您的目标，我们有一个严重未充分利用的网络层（~10% CPU），以及一个进行全文搜索的超载数据库（大约 60% CPU，我们希望它更低）。在每个 Web层上加载相同的索引让我们可以利用这些机器并拥有大量冗余，我们仍然可以丢失十分之九的Web服务器并保留Stack Exchange如果需要的话可以联网。这样做有一个缺点，它对我们来说是 IO（读取）密集型的，并且购买 Web 层时并没有考虑到这一点（大多数公司通常都是这种情况）。虽然它工作正常，但我们仍然会将我们的 Web 层升级到 SSD，并实现 .Net 端口中遗漏的一些其他位，以弥补这种硬件缺陷（例如 NIOFSDirectory）。
另一个缺点是，如果我们为 Web 层对所有数据库建立索引 n 次，但幸运的是，我们并不缺乏网络带宽，而且 SQL Server 缓存结果使得每次增量索引操作都非常快。由于网络服务器数量众多，仅此一项就可能消除此选项。

如果索引是集中式的，鉴于它仅存在于文件系统上，我将如何查询它？我是否必须有效地将其放在所有网络服务器都可以看到的网络共享上？

您可以通过任何一种方式在文件共享上查询它，只需确保一次只有一个索引（write.lock，目录锁定机制将确保这一点，并在您一次尝试多个 IndexWriter 时出错）。
请记住我上面的注释，当很多读者飞来飞去时，这是 IO 密集型的，因此您的商店需要足够的带宽，至少缺少 iSCSI 或光纤 SAN，我会对这种方法持谨慎态度高流量（每天数十万次搜索）使用。
另一个考虑因素是如何更新/警告您的网络服务器（或查询它的任何层）。完成索引过程后，您需要重新打开 IndexReader 以获取包含新文档的更新索引。我们使用 redis 消息传递通道来提醒任何关心索引已更新的人...任何消息传递机制都会在这里工作。

是否有任何预先存在的工具可以按计划增量填充 Lucene 索引，从 SQL Server 数据库中提取数据？我在这里推出自己的服务会更好吗？

不幸的是，据我所知，没有，但我可以与您分享我是如何处理这个问题的。
当索引特定表（类似于 Lucene 中的文档）时，我们添加了 rowversion 到该表。当我们索引时，我们根据最后一个行版本进行选择（时间戳数据类型，作为 bigint）。我选择通过一个简单的 .txt 文件在文件系统上存储最后一个索引日期和最后一个索引行版本，原因之一是：Lucene 中的其他所有内容都存储在那里。这意味着如果出现大问题，您可以删除包含索引的文件夹，下一个索引过程将恢复并具有完全最新的索引，只需添加一些代码来处理不存在的内容，这意味着“索引所有内容” 。

当我查询索引时，我是否应该只提取一堆记录 ID，然后返回数据库获取实际记录，或者我应该直接提取搜索所需的所有内容索引？

这确实取决于您的数据，对于我们来说，将所有内容存储在索引中实际上并不可行（也不建议这样做）。我建议您将搜索结果的字段存储在索引中，我的意思是您需要在列表中呈现您的搜索结果，然后用户单击以转到完整的[在此插入类型]。
另一个考虑因素是数据更改的频率。如果您没有搜索的许多字段正在快速变化，则您需要重新索引这些行（文档）以更新索引，而不仅仅是当您正在搜索的字段发生变化时变化。

尝试在这种风味环境中实现像 Solr 这样的东西是否有价值？如果是这样，我可能会给它它自己的 *nix VM 并在 Tomcat 中运行它。但我不确定在这种情况下 Solr 会给我买什么。

当然有，这就是您所说的集中式搜索（如果搜索次数较多，您可能会再次达到虚拟机设置的限制，请密切关注）。我们没有这样做，因为它在我们的技术堆栈和构建过程中引入了很多（我们认为）不必要的复杂性，但对于大量的 Web 服务器来说，它更有意义。
它给你买了什么？主要是性能和专用索引服务器。它们可以访问单个服务器，该服务器仅处理网络上的请求和结果，而不是抓取网络共享，而不是 n 个服务器抓取网络共享（也争夺 IO）。索引是更多来回数据......这将是 Solr 服务器上的本地数据。此外，由于索引服务器较少，因此您不会频繁访问 SQL 服务器。
它不会为您带来同样多的冗余，但这取决于您的重要性。如果您可以在降级搜索或没有降级搜索的情况下正常运行，只需让您的应用程序处理即可。如果您不能，那么备份 Solr 服务器或更多服务器也可能是一个有效的解决方案......并且可能需要维护另一个软件堆栈。

I'll answer a bit based on how we chose to implement Lucene.Net here on Stack Overflow, and some lessons I learned along the way:

Where should I put the index? I've seen people recommend putting it on the webserver, but that would seem wasteful for a large number of webservers. Surely centralising would be better here?

It depends on your goals here, we had a severely under-utilized web tier (~10% CPU), and an overloaded database doing FullText searching (around 60% CPU, we wanted it lower). Loading up the same index on each web tier let us utilize those machines and have a ton of redundancy, we can still lose 9 out of 10 web servers and keep the Stack Exchange network up if need be. There is a downside to this, it's very IO (read) intensive for us, and the web tier was not bought with this in mind (this is often the case at most companies). While it works fine, we'll still be upgrading our web tier to SSDs and implementing some other bits left out of the .Net port to compensate for this hardware deficiency (NIOFSDirectory for example).
The other downside if we index all our databases n times for the web tier, but luckily we're not starved for network bandwidth and SQL server caching the results makes this a very fast delta indexing operation each time. With a large number of web servers, that alone may eliminate this option.

If the index is centralised, how would I query it, given that it just lives on the filesystem? Will I have to effectively put it on a network share that all the webservers can see?

You can query it on a file share either way, just make sure only one is indexing at a time (write.lock, the directory locking mechanism will ensure this and error when you try multiple IndexWriters at once).
Keep in mind my notes above, this is is IO intensive when a lot of readers are flying around, so you need ample bandwidth to your store, short of at least iSCSI or a fiber SAN, I'd be cautious of this approach on a high traffic (hundreds of thousands of searches a day) use.
Another consideration is how you update/alert your web servers (or whatever tier is querying it). When you finishing an indexing pass, you'll need to re-open your IndexReaders to get the updated index with new documents. We use a redis messaging channel to alert whoever cares that the index has updated...any messaging mechanism would work here.

Are there any pre-existing tools that will incrementally populate a Lucene index on a schedule, pulling the data from an SQL Server database? Would I be better off rolling my own service here?

Unfortunately there are none that I know of, but I can share with you how I approached this.
When indexing a specific table (akin to a document in Lucene), we added a rowversion to that table. When we index we select based off the last rowversion (a timestamp datatype, pulled back as a bigint). I chose to store the last index date and last indexed rowversion on the file system via a simple .txt file for one reason: everything else in Lucene is stored there. This means if there's ever a large problem, you can just delete the folder containing the index and the next indexing pass will recover and have a fully up-to-date index, just add some code to handle nothing being there meaning "index everything".

When I query the index, should I be looking to just pull back a bunch of record id's which I then go back to the DB for the actual record, or should I be aiming to pull everything I need for the search straight out of the index?

This really depends on your data, for us it's not really feasible to store everything in the index (nor is this recommended). What I suggest is you store the fields for your search results in the index, and by that I mean what you need to present your search results in a list, before the user clicks to go to the full [insert type here].
Another consideration is how often your data is changing. If a lot of fields you're not searching on are changing rapidly, you'll need to re-index those rows (documents) to update your index, not only when the field you're searching on changes.

Is there value in trying to implement something like Solr in this flavour environment? If so, I'd probably give it it's own *nix VM and run it within Tomcat on that. But I'm not sure what Solr would buy me in this case.

Sure there is, it's the centralized search you're talking about (with a high number of searches you may again hit a limit with a VM setup, keep an eye on this). We didn't do this because it introduced a lot of (we feel) unwarranted complexity in our technology stack and build process, but for a larger number of web servers it makes much more sense.
What does it buy you? performance mainly, and a dedicated indexing server(s). Instead of n servers crawling a network share (competing for IO as well), they can hit a single server that only deals with requests and results over the network, not crawling the index which is a lot more data going back and forth...this would be local on the Solr server(s). Also, you're not hitting your SQL server as much since fewer servers are indexing.
What it doesn't buy you is as much redundancy, but it's up to you how important this is. If you can operate fine on degraded search or without it, simply have your app handle that. If you can't, then a backup Solr server or more may also be a valid solution...and it is possible another software stack to maintain.

回复收藏 0 原文

~没有更多了~