什么是“最大”?在您的 RDBMS 中实际可处理的数据库查询的大小(复杂性)?
随着查询大小的增长,对数据库的查询很容易变得难以通过实际使用的 RDBMS 进行计算。因此,我想,为了在实践中使用数据库(使用数据库作为后端进行编程),您必须知道可接受的查询的复杂性/大小的界限在哪里。
如果您编写的程序需要向关系数据库发出复杂查询,您使用的 RDMS 预计能够有效响应的查询的“最大”大小/复杂度是多少?
对关系数据库系统提出的查询通常大小?它比最大界限低多少?
提出这个问题的动机是 以下理论推测: 似乎知道要找到查询的答案Q 在数据库D上,需要时间|D||Q|,并且 人们无法摆脱指数|Q|。 (寻找一个派系是最坏情况查询的一个示例。) 由于 D 在实践中可能非常大,我们想知道为什么数据库能够工作。
With the growth of the size of the query, a query to a database can easily become computationally intractable by the RDBMS you use in pratice. So, I suppose, in order to use DBs in practice (do programming with a DB as a backend), you must know where the bound for the complexity/size of an admissible query is.
If you write programs that need to issue complex queries to relational databases, what is the "maximal" size/complexity of the queries that are expected to be effectively answerable by the RDMS you use?
And what is the usual size of the queries posed to relational database systems? How much is it lower than the maximal bound?
The motivation for asking this is the following theoretical speculation:
It seems to be known that to find an answer to a query Q
over a database D, one needs time |D||Q|, and
one cannot get rid of the exponent |Q|. (Looking for a clique is an example of the worst-case query.)
As D can be very large in practice, we wonder why database work at all.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于注释,我想指出您问题中的一个问题:您假设您总是想要查询的精确答案。实际情况并非如此。当挖掘大量数据时,答案的近似值就足够了。
就 PostgreSQL 而言,我不知道对连接数量有任何硬编码限制,但根据事务隔离级别,我预计在达到该级别之前很久就会用完锁。
根据我的经验,在 RDBMS 上抛出的查询最多有几个连接,并且以可以使用索引的方式编写。否则,开发人员通常会做一些非常错误的事情。
可以说,偶尔的报告查询往往会比较慢。这些可能涉及更复杂的语句,有数十个连接、并集和聚合等。但在这种情况下,遗传算法就会发挥作用。并且规划器将在达到折叠限制后,尊重连接顺序,使得可以在预先了解数据重新分区的情况下以最佳方式编写查询。
我似乎 PostgreSQL 吞下了带有两打连接的查询,没有出现任何问题……不过,更典型的是,将此类查询分割成更小的、一口大小的块是可能的、更有效的;和/或预先汇总所需的一些结果。
对于大型查询或数据集的行计数,运行
explain
并返回规划器的估计行数通常就足够了:知道确切有 9,992 个匹配行没有什么意义。For the note, I'd point out an issue in your question: you're assuming you'll always want a precise answer to a query. This is not the case in practice. When mining large amounts of data, an approximation of the answer will be good enough.
In the case of PostgreSQL, I'm not aware of any hard-coded limit to the number of joins, but depending on the transaction isolation level I'd expect to run out of locks long before it's reached.
Queries thrown at an RDBMS, in my experience, have a few joins at most and are written in such a way that they can use indexes. When not, the developer is usually doing something very wrong.
There arguably is the occasional report query that tends to be slower. These might involve much more complicated statements, with dozens of joins and unions and aggregates what not. But in this case a genetic algorithm kicks in, for one. And the planner will, upon reaching collapse limits, respect the join order, making it possible to write the query in an optimal way given advance knowledge on the data's repartition.
I've seem PostgreSQL swallow queries with two dozen joins without a hiccup... More typically, though, it's possible and more efficient to split such queries into smaller, bite-sized chunks; and/or to pre-aggregate some of the results it'll need.
For the row counts on large queries or data sets, running
explain
and returning the planner's estimate number of rows is usually enough: there's little point in knowing there are exactly 9,992 matching rows.在我看来,这是一个非常好的问题。在典型场景中,人类查询似乎很小且简单(例如,包含很少的循环(如果有的话)),并且 RDBMS 非常高效。现在想象一下这样一种情况,您用用户可用的特定词汇表制定查询,该查询必须由计算机翻译为关系数据库的词汇表(例如,在 Web 上)。这是一个典型的语义 Web 场景,诸如 OWL 2 之类的语言就是为此设计的。在这种情况下,您的原始查询可能很小,但提供给 RDBMS 的结果查询可能会呈指数级增长。
This is a very good question, in my opinion. In a typical scenario, human queries seem to be small and simple (for instance, contain few cycles, if any), and RDBMSs are really efficient. Now imagine a situation where you formulate your query in a certain vocabulary available to the user, which has to be translated by a computer to the vocabulary of the relational databases (say, on the Web). This is a typical Semantic Web scenario, for which languages like OWL 2 have been designed. In this case, your original query may be small, but the resulting query, posed to an RDBMS, can be exponentially larger.