针对树结构优化 SQL
如何从数据库中获取具有最佳性能的树形结构数据? 例如,假设数据库中有一个文件夹层次结构。 其中文件夹数据库行具有 ID、名称 和 ParentID 列。
您会使用特殊的算法一次获取所有数据,最大限度地减少数据库调用量并在代码中处理它吗?
或者您会使用对数据库进行多次调用并直接从数据库获取结构吗?
也许根据数据库行数、层次结构深度或其他因素有不同的答案?
编辑:我使用 Microsoft SQL Server,但从其他角度来看的答案也很有趣。
How would you get tree-structured data from a database with the best performance? For example, say you have a folder-hierarchy in a database. Where the folder-database-row has ID, Name and ParentID columns.
Would you use a special algorithm to get all the data at once, minimizing the amount of database-calls and process it in code?
Or would you use do many calls to the database and sort of get the structure done from the database directly?
Maybe there are different answers based on x amount of database-rows, hierarchy-depth or whatever?
Edit: I use Microsoft SQL Server, but answers out of other perspectives are interesting too.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
并不适用于所有情况,但例如给定评论结构:
您还可以存储代表最上面评论的
TopCommentID
:其中
TopCommentID
和ParentCommentID当它是最上面的评论时,
为null
或0
。 对于子评论,ParentCommentID
指向其上方的评论,TopCommentID
指向最上面的父评论。Not going to work for all situations, but for example given a comment structure:
You could also store
TopCommentID
which represents the top most comment:Where the
TopCommentID
andParentCommentID
arenull
or0
when it's the topmost comment. For child comments,ParentCommentID
points to the comment above it, andTopCommentID
points to the topmost parent.这篇文章很有趣,因为它也展示了一些检索方法作为将谱系存储为派生列的一种方式。 沿袭提供了一种无需太多连接即可检索层次结构的快捷方法。
This article is interesting as it shows some retrieval methods as well as a way to store the lineage as a derived column. The lineage provides a shortcut method to retrieve the hierarchy without too many joins.
在 Oracle 中,有 SELECT ... CONNECT BY 语句来检索树。
In Oracle there is SELECT ... CONNECT BY statement to retrieve trees.
我喜欢存储与其parentID 相关联的ID 的简单方法:
它易于维护,并且非常可扩展。
I am a fan of the simple method of storing an ID associated with its parentID:
It is easy to maintain, and very scalable.
如果数据库中有很多树,并且只能取出整个树,我会在数据库中存储每个节点的树 ID(或根节点 ID)和父节点 ID,获取一个树的所有节点。特定的树 ID 和内存中的进程。
但是,如果要获取子树,则只能获取特定父节点 ID 的子树,因此您要么需要存储每个节点的所有父节点才能使用上述方法,要么在深入到子树时执行多个 SQL 查询树(希望树中没有循环!),尽管您可以重用相同的预准备语句(假设节点具有相同类型并且全部存储在单个表中)以防止重新编译 SQL,因此可能不会更慢,实际上,将数据库优化应用于查询可能会更好。 可能需要进行一些测试来找出答案。
如果您只存储一棵树,您的问题将变成仅查询子树之一,并应用第二个答案。
If you have many trees in the database, and you will only ever get the whole tree out, I would store a tree ID (or root node ID) and a parent node ID for each node in the database, get all the nodes for a particular tree ID, and process in memory.
However if you will be getting subtrees out, you can only get a subtree of a particular parent node ID, so you either need to store all parent nodes of each node to use the above method, or perform multiple SQL queries as you descend into the tree (hope there are no cycles in your tree!), although you can reuse the same Prepared Statement (assuming that nodes are of the same type and are all stored in a single table) to prevent re-compiling the SQL, so it might not be slower, indeed with database optimisations applied to the query it could be preferable. Might want to run some tests to find out.
If you are only storing one tree, your question becomes one of querying subtrees only, and the second answer applied.
针对层次结构的查询有几种常见类型。 大多数其他类型的查询都是这些查询的变体。
从父项中找到所有子项。
a. 到特定的深度。 例如,给定我的直系父母,所有深度为 1 的孩子都将是我的兄弟姐妹。
b. 到树的底部。
从孩子开始,找到所有父母。
a. 到特定的深度。 例如,我的直接父母是深度为 1 的父母。
b. 无限深度。
(a) 情况(特定深度)在 SQL 中更容易。 特殊情况(深度=1)在 SQL 中是微不足道的。 非零深度更难。 有限但非零的深度可以通过有限数量的连接来完成。 (b) 情况的深度不确定(从顶部到底部),非常困难。
如果你的树巨大(数百万个节点),那么无论你尝试做什么,你都会陷入一个受伤的世界。
如果您的树少于一百万个节点,只需将其全部提取到内存中并在那里进行处理即可。 在面向对象的世界中,生活要简单得多。 只需获取行并在返回行时构建树即可。
如果你有一棵巨大树,你有两个选择。
递归游标来处理无限的获取。 这意味着结构的维护时间复杂度为 O(1)——只需更新一些节点即可完成。 然而,获取的时间复杂度为 O(n*log(n)),因为您必须为每个具有子节点的节点打开一个游标。
聪明的“堆编号”算法可以对每个节点的起源进行编码。 一旦每个节点被正确编号,一个简单的 SQL SELECT 就可以用于所有四种类型的查询。 然而,对树结构的更改需要对节点重新编号,与检索成本相比,更改的成本相当高。
There are several common kinds of queries against a hierarchy. Most other kinds of queries are variations on these.
From a parent, find all children.
a. To a specific depth. For example, given my immediate parent, all children to a depth of 1 will be my siblings.
b. To the bottom of the tree.
From a child, find all parents.
a. To a specific depth. For example, my immediate parent is parents to a depth of 1.
b. To an unlimited depth.
The (a) cases (a specific depth) are easier in SQL. The special case (depth=1) is trivial in SQL. The non-zero depth is harder. A finite, but non-zero depth, can be done via a finite number of joins. The (b) cases, with indefinite depth (to the top, to the bottom), are really hard.
If you tree is HUGE (millions of nodes) then you're in a world of hurt no matter what you try to do.
If your tree is under a million nodes, just fetch it all into memory and work on it there. Life is much simpler in an OO world. Simply fetch the rows and build the tree as the rows are returned.
If you have a Huge tree, you have two choices.
Recursive cursors to handle the unlimited fetching. This means the maintenance of the structure is O(1) -- just update a few nodes and you're done. However fetching is O(n*log(n)) because you have to open a cursor for each node with children.
Clever "heap numbering" algorithms can encode the parentage of each node. Once each node is properly numbered, a trivial SQL SELECT can be used for all four types of queries. Changes to the tree structure, however, require renumbering the nodes, making the cost of a change fairly high compared to the cost of retrieval.
这实际上取决于您将如何访问该树。
一种巧妙的技术是为每个节点提供一个字符串 id,其中父节点的 id 是子节点的可预测子字符串。 例如,父级可能是“01”,子级可能是“0100”、“0101”、“0102”等。这样,您可以使用以下命令立即从数据库中选择整个子树:
因为标准是初始子字符串,ID 列上的索引将加快查询速度。
It really depends on how you are going to access the tree.
One clever technique is to give every node a string id, where the parent's id is a predictable substring of the child. For example, the parent could be '01', and the children would be '0100', '0101', '0102', etc. This way you can select an entire subtree from the database at once with:
Because the criterion is an initial substring, an index on the ID column would speed the query.
在 RDMS 中存储树的所有方法中,最常见的是邻接表和嵌套集。 嵌套集针对读取进行了优化,并且可以在单个查询中检索整个树。 邻接列表针对写入进行了优化,并且可以在简单的查询中添加到 with 中。
对于邻接列表,每个节点都有一个引用父节点或子节点的列(其他链接也是可能的)。 使用它,您可以基于父子关系构建层次结构。 不幸的是,除非您限制树的深度,否则您无法在一个查询中提取整个内容,并且读取它通常比更新它慢。
对于嵌套集合模型,情况正好相反,读取快速且简单,但更新变得复杂,因为您必须维护编号系统。 嵌套集模型通过使用基于预序的编号系统枚举所有节点来对起源和排序顺序进行编码。
我使用了嵌套集模型,虽然读取优化大型层次结构很复杂,但这是值得的。 一旦你做了一些绘制树和对节点编号的练习,你就应该掌握它的窍门。
我对这种方法的研究始于这篇文章:Managing Hierarchical Data in MySQL。
Out of all the ways to store a tree in a RDMS the most common are adjacency lists and nested sets. Nested sets are optimized for reads and can retrieve an entire tree in a single query. Adjacency lists are optimized for writes and can added to with in a simple query.
With adjacency lists each node a has column that refers to the parent node or the child node (other links are possible). Using that you can build the hierarchy based on parent child relationships. Unfortunately unless you restrict your tree's depth you cannot pull the whole thing in one query and reading it is usually slower than updating it.
With the nested set model the inverse is true, reading is fast and easy but updates get complex because you must maintain the numbering system. The nested set model encodes both parentage and sort order by enumerating all of the nodes using a preorder based numbering system.
I've used the nested set model and while it is complex for read optimizing a large hierarchy it is worth it. Once you do a few exercises in drawing out the tree and numbering the nodes you should get the hang of it.
My research on this method started at this article: Managing Hierarchical Data in MySQL.
在我开发的产品中,我们在 SQL Server 中存储了一些树结构,并使用上面提到的技术在记录中存储节点的层次结构。 即,
维护层次结构当然是棘手的一点,并且需要使用触发器。 但是在插入/删除/移动时生成它永远不会递归,因为父级或子级的层次结构具有您需要的所有信息。
您可以这样获得所有节点的后代:
这是插入触发器:
这是更新触发器:
还有一点,防止树节点中循环引用的检查约束:
我还建议使用触发器来防止多个根节点(空)父)每棵树,并防止相关节点属于不同的 TreeID(但这些比上面的更琐碎。)
您需要检查您的特定情况,看看该解决方案的性能是否可接受。 希望这可以帮助!
In the product I work on we have some tree structures stored in SQL Server and use the technique mentioned above to store a node's hierarchy in the record. i.e.
Maintaining the the hierarchy is the tricky bit of course and makes use of triggers. But generating it on an insert/delete/move is never recursive, because the parent or child's hierarchy has all the information you need.
you can get all of node's descendants thusly:
Here's the insert trigger:
and here's the update trigger:
one more bit, a check constraint to prevent a circular reference in tree nodes:
I would also recommend triggers to prevent more than one root node (null parent) per tree, and to keep related nodes from belonging to different TreeIDs (but those are a little more trivial than the above.)
You'll want to check for your particular case to see if this solution performs acceptably. Hope this helps!
Celko 对此写过文章(2000):
http://www.dbmsmag.com/9603d06.html
http://www.intelligententerprise.com/001020/celko1_1 .jhtml;jsessionid=3DFR02341QLDEQSNDLRSKHSCJUNN2JVN?_requestid=32818
和其他人问:
在oracle树查询中加入其他表
如何使用 SQL 计算树中值的总和
如何在数据库中存储目录/层次/树结构?
在 MYSQL 中执行递归存储过程来获取分层数据
将平面表解析为树的最有效/优雅的方法是什么?
最后,你可以查看rails“acts_as_tree”(读重)和“acts_as_nested_set”(写重)插件。 我没有比较它们的良好链接。
Celko wrote about this (2000):
http://www.dbmsmag.com/9603d06.html
http://www.intelligententerprise.com/001020/celko1_1.jhtml;jsessionid=3DFR02341QLDEQSNDLRSKHSCJUNN2JVN?_requestid=32818
and other people asked:
Joining other tables in oracle tree queries
How to calculate the sum of values in a tree using SQL
How to store directory / hierarchy / tree structure in the database?
Performance of recursive stored procedures in MYSQL to get hierarchical data
What is the most efficient/elegant way to parse a flat table into a tree?
finally, you could look at the rails "acts_as_tree" (read-heavy) and "acts_as_nested_set" (write-heavy) plugins. I don't ahve a good link comparing them.
谷歌搜索“物化路径”或“遗传树”......
Google for "Materialized Path" or "Genetic Trees"...