用物化路径对树进行排序?
我的表中有一个树结构,它使用物化路径让我可以快速找到孩子。但是,我还需要对结果进行深度优先排序,正如人们对线程论坛回复所期望的那样。
id | parent_id | matpath | created
----+-----------+---------+----------------------------
2 | 1 | 1 | 2010-05-08 15:18:37.987544
3 | 1 | 1 | 2010-05-08 17:38:14.125377
4 | 1 | 1 | 2010-05-08 17:38:57.26743
5 | 1 | 1 | 2010-05-08 17:43:28.211708
7 | 1 | 1 | 2010-05-08 18:18:11.849735
6 | 2 | 1.2 | 2010-05-08 17:50:43.288759
9 | 5 | 1.5 | 2010-05-09 14:02:43.818646
8 | 6 | 1.2.6 | 2010-05-09 14:01:17.632695
所以最终结果实际上应该这样排序:
id | parent_id | matpath | created
----+-----------+---------+----------------------------
2 | 1 | 1 | 2010-05-08 15:18:37.987544
6 | 2 | 1.2 | 2010-05-08 17:50:43.288759
8 | 6 | 1.2.6 | 2010-05-09 14:01:17.632695
3 | 1 | 1 | 2010-05-08 17:38:14.125377
4 | 1 | 1 | 2010-05-08 17:38:57.26743
5 | 1 | 1 | 2010-05-08 17:43:28.211708
9 | 5 | 1.5 | 2010-05-09 14:02:43.818646
7 | 1 | 1 | 2010-05-08 18:18:11.849735
我将如何解决这个问题?我可以直接使用 SQL(这是 PostgreSQL 8.4)执行此操作吗?还是应该向此表中添加其他信息?
更新:尝试更好地解释排序标准。
想象一下,id“1”是论坛的根帖子,并且“matpath”以“1”开头的所有内容都是该帖子的子帖子。因此 ids 2 到 5 是对 1 的直接回复,并获得“1”的 matpath。但是,id 6 是回复 2,而不是直接回复 1,因此它的 matpath 为 1.2。这意味着,对于具有适当嵌套的线程论坛,表中显示所有 id,论坛的结构将如下所示,因此排序要求是:
* id 1 (root post)
* id 2
* id 6
* id 8
* id 3
* id 4
* id 5
* id 9
* id 7
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我相信你的物质化道路是不正确的。
你有什么逻辑可以对这样的事情进行排序
为什么第二个 1 不与第一个 1 在一起?
如果你有
这将是微不足道的。
编辑:
我查看了您的示例,您没有存储行的物化路径,而是存储父行的物化路径。
下面是该行的具体化路径的实际外观。如果您将其存储为不超过 9 个分支,则直接在 matpath 上排序是可行的:
否则(> 9)您必须将
matpath
转换为类似的内容,最多支持 999分支机构。
请注意
0001.0002.0006
也会给您一个比接受的答案中更短的字段,I believe your materialized path is not right.
What logic do you get to sort things like this
Why is the second 1 not together with the first one?
If you had
This would be trivial.
EDIT:
I have looked at your example and you are not storing materialized path of a row, but you are storing a materialized path of the parent row.
Here's how the materialized path of the row actually should look like. Sorting directly on matpath would work if you would not have more than 9 branches if you stored it as:
otherwise (>9) you would have to turn the
matpath
into something likethat would support up to 999 branches.
Please note
0001.0002.0006
would give you a field that is shorter then in the accepted answer我通常为此创建一个附加列,称为
SortPath
。它将包含您需要排序并连接在一起的数据。该列的类型为varchar
,并作为字符串进行排序。像这样的事情:这里需要注意一些事情:
sortpath
将作为字符串进行排序,因此所有日期都具有相同的长度才能正确排序,这一点很重要。例如,观察2010-05-08 17:38:57.26743
如何在sortpath
列中添加一个额外的零。sortpath
中显示当前节点的日期,但它不在matpath
中。我更愿意在两者中看到它。sortcolumn
的开头。这样一来,如果您想一次查询多个论坛(您可能不会),那么它仍然会正确排序。I typically create an additional columnn for this, called something like
SortPath
. It would contain the data that you need to sort by, concatenated together. That column would be of typevarchar
, and get sorted as a string. Something like this:A couple of things to note here:
sortpath
will be sorted as a string, so it is important all dates have the same length for it to correctly sort. E.g., observe how2010-05-08 17:38:57.26743
has an extra zero added in thesortpath
column.sortpath
, but it is not inmatpath
. I would prefer to see it in both.sortcolumn
as well. This is so that if you ever want to query for more than one forum at a time (you probably won't), then it will still sort correctly.我不确定我是否理解为什么接受的解决方案有任何意义。它可以工作,但它比 @Unreason 的解决方案(仅在物化路径中填充 ID)标准化程度更低,效率更低(更多磁盘空间,更多索引等)。
OP 面临的整个场景似乎源于这样一个事实:正如@Unreason 正确指出的那样,物化路径(MP)的实现是不正确的。 OP 已将 MP 提供给父节点,而不是当前节点。在接受的解决方案中,
SortPath
列通过提供当前节点的具体化路径来纠正此问题(这次使用日期 - 为什么? - 而不是主键)。作为参考,请考虑以下摘录:
I'm not sure I understand why the accepted solution makes any sense at all. It works, but it is even less normalized and less efficient (more disk space, more indexes, etc) than @Unreason's solution (to just pad the ID's in the materialized path).
The whole scenario that the OP faces seems to stem from the fact that, as @Unreason correctly points out, the implementation of the materialized path (MP) is incorrect. The OP has provided the MP to the parent, not to the current node. In the accepted solution the
SortPath
column corrects this by providing the materialized path to the current node (this time using dates -- why? -- instead of the primary key).For reference consider the following excerpt:
虽然@Unreason关于填充的答案有效,但我想提供另一种解决方案,我相信这是我自己针对这个问题的发明。
您正在寻找创建字节流的函数,一个
f(x)=b_1b_2..b_i
(抱歉,SO 上没有 MatJax),其中b_i
是一个字节。我们知道两个字节流与第一个不同的字节进行比较。我们想要这样一个函数:f(x)。
用 0 填充到相同的长度肯定可以轻松实现此目标。您取两个数字,查看第一个非零字节,就可以了。
Steven Wittens (acko.net) 大约八年前向 Drupal 核心引入了一个不同的技巧:将字符串前面的数字作为另一个数字。因此,数字 97685 变为字符
5 9 7 6 8 5
。这也是可行的:先看长度字节,如果不相同那么越大的肯定越大。除此之外,您知道这两个数字的长度相等。他还使用 36 进制数字,其中 0-9a-z 为数字,就像每个字母都使用十六进制一样。此编码需要两个字节用于前 36 个节点,三个字节用于接下来的 1260 个节点...请注意,填充和这种狡猾的可变长度编码都不需要物化路径的分隔符,尽管为了可读性通常包含它们。
numconv 支持 base85 编码,但需要区分大小写的排序规则。有 94 个 ASCII 字符,如果删除小写字母,您仍然拥有 Base68。
但是,如果您使用“二进制”字段,那么您可以执行 base256:而不是巧妙的编码,只需将数字写为一系列字节,然后将字节流的长度作为单个字节作为整个内容的前缀。这将允许您对任何小于 2^2048 左右的树进行编码。对于前 256 个节点,您使用两个字节,对于接下来的 65280 个节点,您使用三个字节。这已经是相当高效了。
我提名
utf8encode(x)
函数。考虑一下!您需要进行位排序而不是字节排序,但这不会改变结果:找到最左边的零位。如果另一个字符串有 1,那么它将是一个更长的 UTF-8 编码,所以肯定会更大。如果它们的第一个零位于同一位置,那么我们就有相同长度的位串,这对我们来说比较好。这很好,但是分隔符呢?当 UTF-8 算法纯粹被视为创建比特流的算法时,它可以处理 31 位数字,因此它适用于包含少于 20 亿个节点的树。您的具体化路径将是 UTF-8 编码数字的比特流,其比较良好:丢弃最左边相同的 UTF-8 编码数字,我们回到上一段。 QED
因为我们不需要分隔符或前缀字节,所以我们可以将前 128 个节点编码为一个字节,然后将接下来的 1920 个节点编码为两个字节,最多 65535 个为三个字节。对于四个字节,base256 将获胜。对于真正的大树,您可以将 UTF-8 视为 0-2147483647 的编码器到字节流中。所以你可以用它作为base2147483647编码:D
总结一下:UTF-8最适合小树,在20亿节点以下并不比base256差多少。除此之外,直到 2^2048 左右 prefixed-with-length-base256 获胜。除了 prefixed-with-length-base2147483647 获胜之外,就没有什么了。
While @Unreason's answer about padding works, I would like to offer another solution which I believe is my own invention to this problem.
You are looking for function creating a bytestream, an
f(x)=b_1b_2..b_i
(sorry no MatJax on SO) whereb_i
is a byte. We know two bytestream compares the same as the first differing byte. We want such a function thatf(x)<f(y) iff x<y
.Padding to the same length with 0 definitely obtains this goal, easily. You take two numbers, look at the first nonzero byte and there you are.
Steven Wittens (acko.net) introduced a different trick to Drupal core some eight years ago: put the number of digits in front of the string as another digit. So, the number 97685 becomes the characters
5 9 7 6 8 5
. This also works: look at the length byte first, if they are not the same then the larger will definitely be the larger. Beyond that, you know the two numbers are equal length. He also used base 36 numbers with 0-9a-z being the digits, much like hex just for every letter. This encoding needs two bytes for the first 36 nodes, three for the next 1260...Note that neither padding nor this cunning variable length encoding need separators for the materialized path although for readability they are often included.
numconv supports a base85 encoding but that requires a case sensitive collation. There are 94 ASCII characters if you remove lower case letters you still have base68.
But if you use a 'binary' field then you can do base256: instead of a cunning encoding just write the number as a series of bytes and then prefix the whole thing with the length of the bytestream as a single byte. This will allow you to encode any tree smaller than 2^2048 or so. For the first 256 nodes, you are using two bytes, for the next 65280 you are looking at three bytes. This is already quite efficient.
I nominate the
utf8encode(x)
function. Consider that! You need to descend into bitsorting instead of bytesorting but that doesn't change the outcome: find the leftmost zero bit. If the other string has a 1 there, it'll be a longer UTF-8 encoding so definitely that's larger. If they have the first zero at the same place then we have same length bit strings which compare well for us.That's nice but what about separators. The UTF-8 algorithm when looking at it as purely an algorithm creating bitstreams can handle 31 bit numbers -- so it'll work for trees containing less than two billion nodes. Your materialized path will be a bitstream of UTF-8 encoded numbers which compare well: Discard the leftmost identical UTF-8 encoded numbers and we are back at the previous paragraph. Q.E.D.
Because we don't need separators or prefix bytes, we can encode the first 128 nodes into a single byte then the next 1920 into two bytes, and up to 65535, three bytes. For four bytes, base256 will win. For really big trees, you can treat UTF-8 as an encoder of 0-2147483647 into a byte stream. So you can use it as base2147483647 encoding :D
To summarize: UTF-8 is best for small trees and not much worse than base256 below two billion nodes. Beyond that until 2^2048 or so prefixed-with-length-base256 wins. Beyond that prefixed-with-length-base2147483647 wins and there's nothing beyond that.
我想不出一种简单的方法可以直接使用 SQL 来完成此操作。我将在这里使用node_path,而不是matpath。 node_path 是 matpath||'.'||id
现在您想要根据 node_path 对树进行排序,排序字段由运行排序的次数定义。
plpgsql 中的自定义递归函数对 split_part(node_path, '.', recursion_depth) 进行排序将起作用。您必须检查 split_part 中的 NULL 值(并忽略它们)。
I can't think of a simple way to do this in straight SQL. Rather than matpath, I will use node_path here. node_path is matpath||'.'||id
Now you want to order the tree based on node_path, with the sorting field defined by the number of times you have run the sort.
A custom recursive function in plpgsql sorting on split_part(node_path, '.', recursion_depth) will work. You will have to check for NULL values from split_part (and ignore those).