在 SQL 中管理层次结构:MPTT/嵌套集 vs 邻接表 vs 存储路径
一段时间以来,我一直在思考如何最好地处理 SQL 中的层次结构。由于对邻接列表的限制和 MPTT/嵌套集的复杂性感到沮丧,我开始考虑简单地将关键路径存储为简单的 node_key/node_key/...
字符串。我决定编译这三种技术的优缺点:
创建/删除/移动节点所需的调用次数:
- 邻接 = 1
- MPTT = 3
- 路径 = 1 (将包含该节点的所有节点的旧节点路径替换为新节点路径) path)
获取树所需的调用次数:
- Adjacency = [子级别数]
- MPTT = 1
- Path = 1
获取节点/祖先的路径所需的调用次数:
- Adjacency = [number上级数]
- MPTT = 1
- Path = 0
获取子节点数所需的调用次数:
- Adjacency = [子级数]
- MPTT = 0 (可以根据右/左值计算)
- Path = 1
所需的调用数获取节点深度:
- Adjacency = [超级级别数]
- MPTT = 1
- Path = 0
所需的数据库字段:
- Adjacency = 1(父)
- MPTT = 3(父、右、左)
- Path = 1(路径)
结论
在除一个用例之外的每个用例中,存储路径技术都使用与其他技术相同或更少的调用。通过此分析,存储路径显然是赢家。更不用说,它实现起来更简单,人类可读等等。
所以问题是,存储路径不应该被认为是比 MPTT 更强的技术吗?为什么存储路径不是更常用的技术,以及为什么在给定实例中不通过 MPTT 使用它们?
另外,如果您认为此分析不完整,请告诉我。
更新:
以下是 MPTT 至少可以开箱即用执行存储路径解决方案无法执行的两件事:
- 允许计算每个节点的子节点计数,而无需任何其他查询(如上所述)。
- 对给定级别的节点施加命令。其他解决方案是无序的。
For a while now I've been wrestling with how best to handle hierarchies in SQL. Frustrated by the limitations of adjacency lists and the complexity of MPTT/nested sets, I began thinking about simply storing key paths instead, as a simple node_key/node_key/...
string. I decided to compile the pros and cons of the three techniques:
Number of calls required to create/delete/move a node:
- Adjacency = 1
- MPTT = 3
- Path = 1 (Replace old node path with new node path across all nodes that contain that path)
Number of calls required to get a tree:
- Adjacency = [number of sub-levels]
- MPTT = 1
- Path = 1
Number of calls required to get path to a node / ancestry:
- Adjacency = [number of super-levels]
- MPTT = 1
- Path = 0
Number of calls required to get number of subnodes:
- Adjacency = [number of sub-levels]
- MPTT = 0 (Can be calculated from right/left values)
- Path = 1
Number of calls required to get depth of node:
- Adjacency = [number of super-levels]
- MPTT = 1
- Path = 0
DB fields required:
- Adjacency = 1 (parent)
- MPTT = 3 (parent,right,left)
- Path = 1 (path)
Conclusion
The stored path technique uses the same or less calls than the other techniques in every use case except one. By this analysis, storing paths is a clear winner. Not to mention, it's a lot simpler to implement, human readable, etc.
So the question is, shouldn't stored paths be considered a stronger technique than MPTT? Why are stored paths not a more commonly used technique, and why would you not use them over MPTT in a given instance?
Also, if you think this analysis is incomplete please let me know.
UPDATE:
Here are at least 2 things MPTT can do out of the box that a stored path solution won't:
- Allows calculation of subnode count for each node without any additional queries (mentioned above).
- Imposes an order on nodes at a given level. The other solutions are unordered.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您还可以考虑我在对 将平面表解析为树的最有效/优雅的方法是什么?
需要调用创建/删除/移动节点:
获取树所需的调用:
获取节点/祖先的路径所需的调用:
获取子节点数量所需的调用:
获取所需的调用节点深度:
所需的数据库字段:
还有其他一些考虑因素:
支持无限深度:
支持引用完整性:
我还在演示中介绍了 Closure Table 使用 SQL 和的分层数据模型PHP,以及我的书 SQL 反模式卷 1:避免数据库编程的陷阱。
You might also consider the Closure Table design I describe in my answer to What is the most efficient/elegant way to parse a flat table into a tree?
Calls required to create/delete/move a node:
Calls required to get a tree:
Calls required to get path to a node / ancestry:
Calls required to get number of subnodes:
Calls required to get depth of node:
DB fields required:
There are a couple of other considerations:
Supports unlimited depth:
Supports referential integrity:
I also cover Closure Table in my presentation Models for Hierarchical Data with SQL and PHP, and my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
您的结论的问题在于它忽略了与树木相关的大多数问题。
通过将技术的有效性降低到“调用次数”,您实际上忽略了众所周知的数据结构和算法试图解决的所有问题;也就是说,最快的执行速度以及较低的内存和资源占用量。
对 SQL 服务器的“调用次数”可能看起来是一个很好的度量标准(“看起来代码更少”),但是如果结果是一个永远不会完成、运行缓慢或占用太多空间的程序,那么它就是实际上是一个无用的指标。
通过存储每个节点的路径,您并没有创建树数据结构。相反,您正在创建一个列表。树旨在优化的任何操作都会丢失。
对于小日期集,这可能很难看到(在许多小树的情况下,列表更好),在大小为 500、1000、10k 的数据集上尝试一些示例——您很快就会明白为什么不存储整个路径好主意。
It problem with your conclusion is that it ignores most of the issues involved in working with trees.
By reducing the validity of a technique to the "number of calls" you effectively ignore all of the issues which well understood data structures and algorithms attempt to solve; that is, fastest execution and low memory and resource foot print.
The "number of calls" to an SQL server may seem like a good metric to use ("look ma less code"), but if the result is a program which never finishes, runs slowly, or takes up to much space, it is in fact a useless metric.
By storing the path with every node you are not creating a tree data structure. Instead you are creating a list. Any operation which a tree is designed to optimize is lost.
This might be hard to see with small date sets (and in many cases of small trees a list is better), try some examples on data sets of size 500, 1000, 10k -- You will quickly see why storing the whole path is not a good idea.