当前位置：文江博客话题详情

SQL PostgreSQL materialized-path-pattern

用物化路径对树进行排序？

发布于 2024-08-31 11:27:28 字数 1686 浏览 7 评论 0 原文

我的表中有一个树结构，它使用物化路径让我可以快速找到孩子。但是，我还需要对结果进行深度优先排序，正如人们对线程论坛回复所期望的那样。

 id | parent_id | matpath |          created           
----+-----------+---------+----------------------------
  2 |         1 | 1       | 2010-05-08 15:18:37.987544
  3 |         1 | 1       | 2010-05-08 17:38:14.125377
  4 |         1 | 1       | 2010-05-08 17:38:57.26743
  5 |         1 | 1       | 2010-05-08 17:43:28.211708
  7 |         1 | 1       | 2010-05-08 18:18:11.849735
  6 |         2 | 1.2     | 2010-05-08 17:50:43.288759
  9 |         5 | 1.5     | 2010-05-09 14:02:43.818646
  8 |         6 | 1.2.6   | 2010-05-09 14:01:17.632695

所以最终结果实际上应该这样排序：

 id | parent_id | matpath |          created
----+-----------+---------+----------------------------
  2 |         1 | 1       | 2010-05-08 15:18:37.987544
  6 |         2 | 1.2     | 2010-05-08 17:50:43.288759
  8 |         6 | 1.2.6   | 2010-05-09 14:01:17.632695
  3 |         1 | 1       | 2010-05-08 17:38:14.125377
  4 |         1 | 1       | 2010-05-08 17:38:57.26743
  5 |         1 | 1       | 2010-05-08 17:43:28.211708
  9 |         5 | 1.5     | 2010-05-09 14:02:43.818646
  7 |         1 | 1       | 2010-05-08 18:18:11.849735

我将如何解决这个问题？我可以直接使用 SQL（这是 PostgreSQL 8.4）执行此操作吗？还是应该向此表中添加其他信息？

更新：尝试更好地解释排序标准。

想象一下，id“1”是论坛的根帖子，并且“matpath”以“1”开头的所有内容都是该帖子的子帖子。因此 ids 2 到 5 是对 1 的直接回复，并获得“1”的 matpath。但是，id 6 是回复 2，而不是直接回复 1，因此它的 matpath 为 1.2。这意味着，对于具有适当嵌套的线程论坛，表中显示所有 id，论坛的结构将如下所示，因此排序要求是：

* id 1 (root post)
    * id 2
        * id 6
            * id 8
    * id 3
    * id 4
    * id 5
        * id 9
    * id 7

原文

I have a tree structure in a table and it uses materialized paths to allow me to find children quickly. However, I also need to sort the results depth-first, as one would expect with threaded forum replies.

 id | parent_id | matpath |          created           
----+-----------+---------+----------------------------
  2 |         1 | 1       | 2010-05-08 15:18:37.987544
  3 |         1 | 1       | 2010-05-08 17:38:14.125377
  4 |         1 | 1       | 2010-05-08 17:38:57.26743
  5 |         1 | 1       | 2010-05-08 17:43:28.211708
  7 |         1 | 1       | 2010-05-08 18:18:11.849735
  6 |         2 | 1.2     | 2010-05-08 17:50:43.288759
  9 |         5 | 1.5     | 2010-05-09 14:02:43.818646
  8 |         6 | 1.2.6   | 2010-05-09 14:01:17.632695

So the final results should actually be sorted like this:

 id | parent_id | matpath |          created
----+-----------+---------+----------------------------
  2 |         1 | 1       | 2010-05-08 15:18:37.987544
  6 |         2 | 1.2     | 2010-05-08 17:50:43.288759
  8 |         6 | 1.2.6   | 2010-05-09 14:01:17.632695
  3 |         1 | 1       | 2010-05-08 17:38:14.125377
  4 |         1 | 1       | 2010-05-08 17:38:57.26743
  5 |         1 | 1       | 2010-05-08 17:43:28.211708
  9 |         5 | 1.5     | 2010-05-09 14:02:43.818646
  7 |         1 | 1       | 2010-05-08 18:18:11.849735

How would I work that out? Can I do that in straight SQL (this is PostgreSQL 8.4) or should additional information be added to this table?

Update: trying to explain sort criteria better.

Imagine that id '1' is the root post to a forum and everything with a 'matpath' beginning with '1' is a child of that post. So ids 2 through 5 are direct replies to 1 and get matpaths of '1'. However, id 6 is a reply 2, not directly to 1, so it gets a matpath of 1.2. This means that for a threaded forum with proper nesting, with all ids shown in the tables, the structure of the forum would look like this, hence the ordering requirement:

* id 1 (root post)
    * id 2
        * id 6
            * id 8
    * id 3
    * id 4
    * id 5
        * id 9
    * id 7

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

荒岛晴空 2024-09-07 11:27:28

我相信你的物质化道路是不正确的。

你有什么逻辑可以对这样的事情进行排序

为什么第二个 1 不与第一个 1 在一起？

如果你有

这将是微不足道的。

编辑：
我查看了您的示例，您没有存储行的物化路径，而是存储父行的物化路径。
下面是该行的具体化路径的实际外观。如果您将其存储为不超过 9 个分支，则直接在 matpath 上排序是可行的：

 id | parent_id | matpath   |          created
----+-----------+-----------+----------------------------
  2 |         1 | 1.2       | 2010-05-08 15:18:37.987544
  6 |         2 | 1.2.6     | 2010-05-08 17:50:43.288759
  8 |         6 | 1.2.6.8   | 2010-05-09 14:01:17.632695
  3 |         1 | 1.3       | 2010-05-08 17:38:14.125377
  4 |         1 | 1.4       | 2010-05-08 17:38:57.26743
  5 |         1 | 1.5       | 2010-05-08 17:43:28.211708
  9 |         5 | 1.5.9     | 2010-05-09 14:02:43.818646
  7 |         1 | 1.7       | 2010-05-08 18:18:11.849735

否则（> 9）您必须将 matpath 转换为类似的内容，

001.002.006
001.002.006.008

最多支持 999分支机构。

请注意

，即使使用 4 个固定数字的方法，例如 0001.0002.0006 也会给您一个比接受的答案中更短的字段，
您可以使用用户函数动态解析 matpath 并生成排序值
您可以直接以这种格式存储 matpath （它还有其他一些不错的属性）

I believe your materialized path is not right.

What logic do you get to sort things like this

Why is the second 1 not together with the first one?

If you had

This would be trivial.

EDIT:
I have looked at your example and you are not storing materialized path of a row, but you are storing a materialized path of the parent row.
Here's how the materialized path of the row actually should look like. Sorting directly on matpath would work if you would not have more than 9 branches if you stored it as:

 id | parent_id | matpath   |          created
----+-----------+-----------+----------------------------
  2 |         1 | 1.2       | 2010-05-08 15:18:37.987544
  6 |         2 | 1.2.6     | 2010-05-08 17:50:43.288759
  8 |         6 | 1.2.6.8   | 2010-05-09 14:01:17.632695
  3 |         1 | 1.3       | 2010-05-08 17:38:14.125377
  4 |         1 | 1.4       | 2010-05-08 17:38:57.26743
  5 |         1 | 1.5       | 2010-05-08 17:43:28.211708
  9 |         5 | 1.5.9     | 2010-05-09 14:02:43.818646
  7 |         1 | 1.7       | 2010-05-08 18:18:11.849735

otherwise (>9) you would have to turn the matpath into something like

001.002.006
001.002.006.008

that would support up to 999 branches.

Please note

even the approach with 4 fixed digits, such as 0001.0002.0006 would give you a field that is shorter then in the accepted answer
you could parse matpath an produce sorting value on the fly with a user function
you could directly store matpath in this format (it has some other nice properties, too)

回复收藏 0 原文

累赘 2024-09-07 11:27:28

我通常为此创建一个附加列，称为 SortPath。它将包含您需要排序并连接在一起的数据。该列的类型为 varchar，并作为字符串进行排序。像这样的事情：

id | parent_id | matpath |          created            |                   sortpath
---+-----------+---------+-----------------------------+--------------------------------------------------------------------------------------
 2 |         1 | 1       | 2010-05-08 15:18:37.987544  | 2010-05-08 15:18:37.987544-2
 6 |         2 | 1.2     | 2010-05-08 17:50:43.288759  | 2010-05-08 15:18:37.987544-2.2010-05-08 17:50:43.288759-6
 8 |         6 | 1.2.6   | 2010-05-09 14:01:17.632695  | 2010-05-08 15:18:37.987544-2.2010-05-08 17:50:43.288759-6.2010-05-09 14:01:17.632695-8
 3 |         1 | 1       | 2010-05-08 17:38:14.125377  | 2010-05-08 17:38:14.125377-3
 4 |         1 | 1       | 2010-05-08 17:38:57.26743   | 2010-05-08 17:38:57.267430-4 
 5 |         1 | 1       | 2010-05-08 17:43:28.211708  | 2010-05-08 17:43:28.211708-5
 9 |         5 | 1.5     | 2010-05-09 14:02:43.818646  | 2010-05-08 17:43:28.211708-5.2010-05-09 14:02:43.818646-9
 7 |         1 | 1       | 2010-05-08 18:18:11.849735  | 2010-05-08 18:18:11.849735-7

这里需要注意一些事情：

sortpath 将作为字符串进行排序，因此所有日期都具有相同的长度才能正确排序，这一点很重要。例如，观察 2010-05-08 17:38:57.26743 如何在 sortpath 列中添加一个额外的零。
我已将每个节点的 PK 附加到其日期末尾。这样，如果您碰巧有两行具有完全相同的日期，由于我们附加了附加数据，它们将始终以相同的顺序返回。
对我来说，数据看起来与我编写的方式不对称，因为我们在 sortpath 中显示当前节点的日期，但它不在 matpath 中。我更愿意在两者中看到它。
您可能还想将节点 ID 1 的日期放在每个 sortcolumn 的开头。这样一来，如果您想一次查询多个论坛（您可能不会），那么它仍然会正确排序。

I typically create an additional columnn for this, called something like SortPath. It would contain the data that you need to sort by, concatenated together. That column would be of type varchar, and get sorted as a string. Something like this:

id | parent_id | matpath |          created            |                   sortpath
---+-----------+---------+-----------------------------+--------------------------------------------------------------------------------------
 2 |         1 | 1       | 2010-05-08 15:18:37.987544  | 2010-05-08 15:18:37.987544-2
 6 |         2 | 1.2     | 2010-05-08 17:50:43.288759  | 2010-05-08 15:18:37.987544-2.2010-05-08 17:50:43.288759-6
 8 |         6 | 1.2.6   | 2010-05-09 14:01:17.632695  | 2010-05-08 15:18:37.987544-2.2010-05-08 17:50:43.288759-6.2010-05-09 14:01:17.632695-8
 3 |         1 | 1       | 2010-05-08 17:38:14.125377  | 2010-05-08 17:38:14.125377-3
 4 |         1 | 1       | 2010-05-08 17:38:57.26743   | 2010-05-08 17:38:57.267430-4 
 5 |         1 | 1       | 2010-05-08 17:43:28.211708  | 2010-05-08 17:43:28.211708-5
 9 |         5 | 1.5     | 2010-05-09 14:02:43.818646  | 2010-05-08 17:43:28.211708-5.2010-05-09 14:02:43.818646-9
 7 |         1 | 1       | 2010-05-08 18:18:11.849735  | 2010-05-08 18:18:11.849735-7

A couple of things to note here:

sortpath will be sorted as a string, so it is important all dates have the same length for it to correctly sort. E.g., observe how 2010-05-08 17:38:57.26743 has an extra zero added in the sortpath column.
I have appended the PK of each node to the end of its date. This is so that if you happen to have two rows with the exact same date, they will always get returned in the same order due to the additional data we are appending.
To me, the data looks asymmetrical the way I have written it, because we are showing the current node's date in sortpath, but it is not in matpath. I would prefer to see it in both.
You may want to put the date of node ID 1 at the beginning of each sortcolumn as well. This is so that if you ever want to query for more than one forum at a time (you probably won't), then it will still sort correctly.

回复收藏 0 原文

我三岁 2024-09-07 11:27:28

我不确定我是否理解为什么接受的解决方案有任何意义。它可以工作，但它比 @Unreason 的解决方案（仅在物化路径中填充 ID）标准化程度更低，效率更低（更多磁盘空间，更多索引等）。

OP 面临的整个场景似乎源于这样一个事实：正如@Unreason 正确指出的那样，物化路径（MP）的实现是不正确的。 OP 已将 MP 提供给父节点，而不是当前节点。在接受的解决方案中，SortPath 列通过提供当前节点的具体化路径来纠正此问题（这次使用日期 - 为什么？ - 而不是主键）。

作为参考，请考虑以下摘录：

物化路径

在这种方法中，每个记录都存储到根的整个路径。在我们的
在前面的示例中，我们假设 KING 是根节点。然后，
ename = 'SCOTT' 的记录通过路径连接到根
斯科特->琼斯->金。现代数据库允许表示一系列
节点作为单个值，但由于物化路径已
在那之前很久就发明了，惯例坚持简单的字符
用一些分隔符连接的节点字符串；最常见的是“.”或者
'/'。

回复收藏 0 原文

错爱 2024-09-07 11:27:28

虽然@Unreason关于填充的答案有效，但我想提供另一种解决方案，我相信这是我自己针对这个问题的发明。

您正在寻找创建字节流的函数，一个 f(x)=b_1b_2..b_i （抱歉，SO 上没有 MatJax），其中 b_i 是一个字节。我们知道两个字节流与第一个不同的字节进行比较。我们想要这样一个函数：f(x)。

用 0 填充到相同的长度肯定可以轻松实现此目标。您取两个数字，查看第一个非零字节，就可以了。

Steven Wittens (acko.net) 大约八年前向 Drupal 核心引入了一个不同的技巧：将字符串前面的数字作为另一个数字。因此，数字 97685 变为字符 5 9 7 6 8 5。这也是可行的：先看长度字节，如果不相同那么越大的肯定越大。除此之外，您知道这两个数字的长度相等。他还使用 36 进制数字，其中 0-9a-z 为数字，就像每个字母都使用十六进制一样。此编码需要两个字节用于前 36 个节点，三个字节用于接下来的 1260 个节点...

请注意，填充和这种狡猾的可变长度编码都不需要物化路径的分隔符，尽管为了可读性通常包含它们。

numconv 支持 base85 编码，但需要区分大小写的排序规则。有 94 个 ASCII 字符，如果删除小写字母，您仍然拥有 Base68。

但是，如果您使用“二进制”字段，那么您可以执行 base256：而不是巧妙的编码，只需将数字写为一系列字节，然后将字节流的长度作为单个字节作为整个内容的前缀。这将允许您对任何小于 2^2048 左右的树进行编码。对于前 256 个节点，您使用两个字节，对于接下来的 65280 个节点，您使用三个字节。这已经是相当高效了。

我提名 utf8encode(x) 函数。考虑一下！您需要进行位排序而不是字节排序，但这不会改变结果：找到最左边的零位。如果另一个字符串有 1，那么它将是一个更长的 UTF-8 编码，所以肯定会更大。如果它们的第一个零位于同一位置，那么我们就有相同长度的位串，这对我们来说比较好。

这很好，但是分隔符呢？当 UTF-8 算法纯粹被视为创建比特流的算法时，它可以处理 31 位数字，因此它适用于包含少于 20 亿个节点的树。您的具体化路径将是 UTF-8 编码数字的比特流，其比较良好：丢弃最左边相同的 UTF-8 编码数字，我们回到上一段。 QED

因为我们不需要分隔符或前缀字节，所以我们可以将前 128 个节点编码为一个字节，然后将接下来的 1920 个节点编码为两个字节，最多 65535 个为三个字节。对于四个字节，base256 将获胜。对于真正的大树，您可以将 UTF-8 视为 0-2147483647 的编码器到字节流中。所以你可以用它作为base2147483647编码：D

总结一下：UTF-8最适合小树，在20亿节点以下并不比base256差多少。除此之外，直到 2^2048 左右 prefixed-with-length-base256 获胜。除了 prefixed-with-length-base2147483647 获胜之外，就没有什么了。

While @Unreason's answer about padding works, I would like to offer another solution which I believe is my own invention to this problem.

You are looking for function creating a bytestream, an f(x)=b_1b_2..b_i (sorry no MatJax on SO) where b_i is a byte. We know two bytestream compares the same as the first differing byte. We want such a function that f(x)<f(y) iff x<y.

Padding to the same length with 0 definitely obtains this goal, easily. You take two numbers, look at the first nonzero byte and there you are.

Steven Wittens (acko.net) introduced a different trick to Drupal core some eight years ago: put the number of digits in front of the string as another digit. So, the number 97685 becomes the characters 5 9 7 6 8 5. This also works: look at the length byte first, if they are not the same then the larger will definitely be the larger. Beyond that, you know the two numbers are equal length. He also used base 36 numbers with 0-9a-z being the digits, much like hex just for every letter. This encoding needs two bytes for the first 36 nodes, three for the next 1260...

Note that neither padding nor this cunning variable length encoding need separators for the materialized path although for readability they are often included.

numconv supports a base85 encoding but that requires a case sensitive collation. There are 94 ASCII characters if you remove lower case letters you still have base68.

But if you use a 'binary' field then you can do base256: instead of a cunning encoding just write the number as a series of bytes and then prefix the whole thing with the length of the bytestream as a single byte. This will allow you to encode any tree smaller than 2^2048 or so. For the first 256 nodes, you are using two bytes, for the next 65280 you are looking at three bytes. This is already quite efficient.

I nominate the utf8encode(x) function. Consider that! You need to descend into bitsorting instead of bytesorting but that doesn't change the outcome: find the leftmost zero bit. If the other string has a 1 there, it'll be a longer UTF-8 encoding so definitely that's larger. If they have the first zero at the same place then we have same length bit strings which compare well for us.

That's nice but what about separators. The UTF-8 algorithm when looking at it as purely an algorithm creating bitstreams can handle 31 bit numbers -- so it'll work for trees containing less than two billion nodes. Your materialized path will be a bitstream of UTF-8 encoded numbers which compare well: Discard the leftmost identical UTF-8 encoded numbers and we are back at the previous paragraph. Q.E.D.

Because we don't need separators or prefix bytes, we can encode the first 128 nodes into a single byte then the next 1920 into two bytes, and up to 65535, three bytes. For four bytes, base256 will win. For really big trees, you can treat UTF-8 as an encoder of 0-2147483647 into a byte stream. So you can use it as base2147483647 encoding :D

To summarize: UTF-8 is best for small trees and not much worse than base256 below two billion nodes. Beyond that until 2^2048 or so prefixed-with-length-base256 wins. Beyond that prefixed-with-length-base2147483647 wins and there's nothing beyond that.

回复收藏 0 原文

情定在深秋 2024-09-07 11:27:28

我想不出一种简单的方法可以直接使用 SQL 来完成此操作。我将在这里使用node_path，而不是matpath。 node_path 是 matpath||'.'||id

 id | parent_id | node_path |          created           
----+-----------+---------+----------------------------
  2 |         1 | 1.2       | 2010-05-08 15:18:37.987544
  3 |         1 | 1.3       | 2010-05-08 17:38:14.125377
  4 |         1 | 1.4       | 2010-05-08 17:38:57.26743
  5 |         1 | 1.5       | 2010-05-08 17:43:28.211708
  7 |         1 | 1.7       | 2010-05-08 18:18:11.849735
  6 |         2 | 1.2.6     | 2010-05-08 17:50:43.288759
  9 |         5 | 1.5.9     | 2010-05-09 14:02:43.818646
  8 |         6 | 1.2.6.8   | 2010-05-09 14:01:17.632695

现在您想要根据 node_path 对树进行排序，排序字段由运行排序的次数定义。

plpgsql 中的自定义递归函数对 split_part(node_path, '.', recursion_depth) 进行排序将起作用。您必须检查 split_part 中的 NULL 值（并忽略它们）。

I can't think of a simple way to do this in straight SQL. Rather than matpath, I will use node_path here. node_path is matpath||'.'||id

 id | parent_id | node_path |          created           
----+-----------+---------+----------------------------
  2 |         1 | 1.2       | 2010-05-08 15:18:37.987544
  3 |         1 | 1.3       | 2010-05-08 17:38:14.125377
  4 |         1 | 1.4       | 2010-05-08 17:38:57.26743
  5 |         1 | 1.5       | 2010-05-08 17:43:28.211708
  7 |         1 | 1.7       | 2010-05-08 18:18:11.849735
  6 |         2 | 1.2.6     | 2010-05-08 17:50:43.288759
  9 |         5 | 1.5.9     | 2010-05-09 14:02:43.818646
  8 |         6 | 1.2.6.8   | 2010-05-09 14:01:17.632695

Now you want to order the tree based on node_path, with the sorting field defined by the number of times you have run the sort.

A custom recursive function in plpgsql sorting on split_part(node_path, '.', recursion_depth) will work. You will have to check for NULL values from split_part (and ignore those).

回复收藏 0 原文

~没有更多了~