呼叫搜索大师:Lucene 的数字范围搜索性能?
我正在开发一个系统,该系统根据字符串、数字范围和日期范围对大量记录执行匹配。据我所知,字符串匹配大多是精确匹配,而不是我理解的 lucene 通常设计的不太精确的全文搜索类型结果。数字精度很重要,因为数据涉及价格。
我注意到 Lucene 最近添加了一些对数字范围搜索的支持,但这并不是它最初设计的目的。
目前系统使用过程式SQL进行匹配,系统的可扩展性已经达到极限。我正在研究水平扩展系统的方法,并且使用搜索引擎技术似乎是一种可能性,因为有些技术可以扩展到非常大的数据集,同时执行非常快的搜索结果。我想研究是否可以通过与 lucene 生成的元数据进行匹配来减轻数据库的大量负载,而无需访问数据库来获取完整记录,直到匹配规则确定应该检索什么。我希望最终的目标是获得接近实时的结果,尽管目前我们距离这一点还有很长的路要走。
我的问题如下:对于这种类型的索引和搜索,Lucene 的执行速度是否可能比 RDBMS 快很多倍,并且可以更便宜地扩展到更大的数据集?
I'm working on a system that performs matching on large sets of records based on strings and numeric ranges, and date ranges. The String matches are mostly exact matches as far as I can tell, as opposed to less exact full text search type results that I understand lucene is generally designed for. Numeric precision is important as the data concerns prices.
I noticed that Lucene recently added some support for numeric range searching but it's not something it's originally designed for.
Currently the system uses procedural SQL to do the matching and the limits are being reached as to the scalability of the system. I'm researching ways to scale the system horizontally and using search engine technology seems like a possibility, given that there are technologies that can scale to very large data sets while performing very fast search results. I'd like to investigate if it's possible to take a lot of load off the database by doing the matching with the lucene generated metadata without hitting the database for the full records until the matching rules have determined what should be retrieved. I would like to aim eventually for near real time results although we are a long way from that at this point.
My question is as follows: Is it likely that Lucene would perform many times faster and scale to greater data sets more cheaply than an RDBMS for this type of indexing and searching?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
简而言之,如果您正在进行“select * where x = y”这样的搜索,那么使用哪个并不重要。在 (x = y OR (x = z AND y = x)...) 中添加的子句越多,Lucene 就会变得越好。
他们并没有真正提到这一点,但 Lucene 的一个巨大优势是所有内置功能:词干提取、查询解析等。
In short, if you are doing a search like "select * where x = y", it doesn't matter which you use. The more clauses you add in (x = y OR (x = z AND y = x)...) the better Lucene becomes.
They don't really mention this, but a huge advantage of Lucene is all the built-in functionality: stemming, query parsing etc.
我建议您阅读 Marc Krellenstein 的“全文搜索引擎与 DBMS”。
开始使用 Lucene 的一个相对简单的方法是尝试 Solr 。您可以扩展 Lucene 和 Solr 使用复制和分片。
I suggest you read Marc Krellenstein's "Full Text Search Engines vs DBMS".
A relatively easy way to start using Lucene is by trying Solr. You can scale Lucene and Solr using replication and sharding.
从本质上讲,Lucene 最简单的形式是一个单词密度搜索引擎。 Lucene 可以扩展以处理极大的数据集,并且在正确索引时以极快的速度返回结果。对于基于文本的搜索,Lucene 中的搜索结果返回速度可能比 SQL Server/Oracle/My SQL 更快。话虽这么说,将 Lucene 与传统 RDBMS 进行比较是不公平的,因为它们的用途完全不同。
At its heart, and in its simplest form, Lucene is a word density search engine. Lucene can scale to handle extremely large data sets and when indexed correctly return results in a blistering speed. For text based searching it is possible and very probable that search results will return quicker in Lucene as opposed to SQL Server/Oracle/My SQL. That being said it is unfair to compare Lucene to traditional RDBMS as they both have completely different usages.