呼叫搜索大师:Lucene 的数字范围搜索性能?

发布于 2024-09-28 06:32:19 字数 460 浏览 1 评论 0原文

我正在开发一个系统,该系统根据字符串、数字范围和日期范围对大量记录执行匹配。据我所知,字符串匹配大多是精确匹配,而不是我理解的 lucene 通常设计的不太精确的全文搜索类型结果。数字精度很重要,因为数据涉及价格。

我注意到 Lucene 最近添加了一些对数字范围搜索的支持,但这并不是它最初设计的目的。

目前系统使用过程式SQL进行匹配,系统的可扩展性已经达到极限。我正在研究水平扩展系统的方法,并且使用搜索引擎技术似乎是一种可能性,因为有些技术可以扩展到非常大的数据集,同时执行非常快的搜索结果。我想研究是否可以通过与 lucene 生成的元数据进行匹配来减轻数据库的大量负载,而无需访问数据库来获取完整记录,直到匹配规则确定应该检索什么。我希望最终的目标是获得接近实时的结果,尽管目前我们距离这一点还有很长的路要走。

我的问题如下:对于这种类型的索引和搜索,Lucene 的执行速度是否可能比 RDBMS 快很多倍,并且可以更便宜地扩展到更大的数据集?

I'm working on a system that performs matching on large sets of records based on strings and numeric ranges, and date ranges. The String matches are mostly exact matches as far as I can tell, as opposed to less exact full text search type results that I understand lucene is generally designed for. Numeric precision is important as the data concerns prices.

I noticed that Lucene recently added some support for numeric range searching but it's not something it's originally designed for.

Currently the system uses procedural SQL to do the matching and the limits are being reached as to the scalability of the system. I'm researching ways to scale the system horizontally and using search engine technology seems like a possibility, given that there are technologies that can scale to very large data sets while performing very fast search results. I'd like to investigate if it's possible to take a lot of load off the database by doing the matching with the lucene generated metadata without hitting the database for the full records until the matching rules have determined what should be retrieved. I would like to aim eventually for near real time results although we are a long way from that at this point.

My question is as follows: Is it likely that Lucene would perform many times faster and scale to greater data sets more cheaply than an RDBMS for this type of indexing and searching?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

我一向站在原地 2024-10-05 06:32:19
  1. Lucene 将其数字内容存储为 trie; SQL 实现可能会将其存储为 b 树或 r 树。 Lucene 存储 trie 的方式和 SQL 使用 R 树的方式非常相似,如果您看到巨大的差异,我会感到惊讶(除非您利用了来自 Solr 的一些可扩展性)。
  2. 作为 Lucene 与 SQL 全文性能的一般问题,我发现的一项很好的研究是:Jing, Y., C.Zhang 和 X.Wang。 “Lucene 与关系数据库性能比较的实证研究。”通信软件和网络,2009 年。ICCSN'09。国际会议,336-340。 IEEE,2009。

首先,执行时
精确查询,Lucene的性能比
未索引的 RDB,而几乎与
索引-RDB。二、当通配符查询为前缀时
查询,那么索引RDB和Lucene都执行得很好
仍然通过利用索引...第三,对于组合查询,Lucene 执行
顺利,通常花费很少的时间,而查询时间
RDB的大小与组合搜索条件有关,
索引字段的数量。如果某些字段在
组合条件尚未编入索引,搜索将
花费更多的时间。四、Lucene的查询时间和
unindexed-RDB与记录复杂度有关系,
但索引 RDB 几乎独立于它。

简而言之,如果您正在进行“select * where x = y”这样的搜索,那么使用哪个并不重要。在 (x = y OR (x = z AND y = x)...) 中添加的子句越多,Lucene 就会变得越好。

他们并没有真正提到这一点,但 Lucene 的一个巨大优势是所有内置功能:词干提取、查询解析等。

  1. Lucene stores its numeric stuff as a trie; a SQL implementation will probably store it as a b-tree or an r-tree. The way Lucene stores its trie and SQL uses an R-tree are pretty similar, and I would be surprised if you saw a huge difference (unless you leveraged some of the scalability that comes from Solr).
  2. As a general question of the performance of Lucene vs. SQL fulltext, a good study I've found is: Jing, Y., C. Zhang, and X. Wang. “An Empirical Study on Performance Comparison of Lucene and Relational Database.” In Communication Software and Networks, 2009. ICCSN'09. International Conference on, 336-340. IEEE, 2009.

First, when executing
exact query, the performance of Lucene is much better than
that of unindexed-RDB, while is almost same as that of
indexed-RDB. Second, when the wildcard query is a prefix
query, then the indexed-RDB and Lucene both perform very
well still by leveraging the index... Third, for combinational query, Lucene performs
smoothly and usually costs little time, while the query time
of RDB is related to the combinational search conditions and
the number of indexed fields. If some fields in the
combinational condition haven’t been indexed, search will
cost much more time. Fourth, the query time of Lucene and
unindexed-RDB has relations with the record complexity,
but the indexed-RDB is nearly independent of it.

In short, if you are doing a search like "select * where x = y", it doesn't matter which you use. The more clauses you add in (x = y OR (x = z AND y = x)...) the better Lucene becomes.

They don't really mention this, but a huge advantage of Lucene is all the built-in functionality: stemming, query parsing etc.

梦冥 2024-10-05 06:32:19

我建议您阅读 Marc Krellenstein 的“全文搜索引擎与 DBMS”。

开始使用 Lucene 的一个相对简单的方法是尝试 Solr 。您可以扩展 Lucene 和 Solr 使用复制和分片。

I suggest you read Marc Krellenstein's "Full Text Search Engines vs DBMS".

A relatively easy way to start using Lucene is by trying Solr. You can scale Lucene and Solr using replication and sharding.

这个俗人 2024-10-05 06:32:19

从本质上讲,Lucene 最简单的形式是一个单词密度搜索引擎。 Lucene 可以扩展以处理极大的数据集,并且在正确索引时以极快的速度返回结果。对于基于文本的搜索,Lucene 中的搜索结果返回速度可能比 SQL Server/Oracle/My SQL 更快。话虽这么说,将 Lucene 与传统 RDBMS 进行比较是不公平的,因为它们的用途完全不同。

At its heart, and in its simplest form, Lucene is a word density search engine. Lucene can scale to handle extremely large data sets and when indexed correctly return results in a blistering speed. For text based searching it is possible and very probable that search results will return quicker in Lucene as opposed to SQL Server/Oracle/My SQL. That being said it is unfair to compare Lucene to traditional RDBMS as they both have completely different usages.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文