如何在 SQL Server 中有效地合并两个层次结构?

发布于 2024-11-29 12:39:52 字数 1966 浏览 1 评论 0 原文

我有两个带有 Hierarchyid 字段的表,其中一个是临时表,其中包含需要合并到另一个中的新数据(即,需要添加到主树中的一组节点,其中一些可能已经是那里)。

除了定义树结构(父/子关系)的hierarchyid 列之外。每个表都有一个单独的列,其中包含唯一标识每个节点的节点标识符。也就是说,判断暂存表中的节点是否已在主表中的方法是通过节点 ID,而不是通过 hierarchyid 列。

必须执行的处理看起来像这样:

For each row, RS, in the staging table:
    If there is not already a row with the same Id as RS in the main table:
         Find the parent, PS, of the staging row
         Find the row, PM, in the main table that has the same node ID as PS
         Create a new child, RM of row PM
         Set PM's ID equal to the ID of RS

重要的是,只有当临时表中的树按广度优先顺序排序/遍历时,这种方法才有效 - 这样当遇到 RS 时,它是保证其父 PS 在主表中已经有相应的行。

到目前为止,我能看到在 SQL Server 中实现此目的的唯一方法是在临时表(已排序)上使用游标,并为每一行调用一个存储过程,该存储过程基本上完全执行上述操作,并使用 SELECT 完成MAX() 查找已作为 PM 的子级存在的最高层级 ID,以便可以唯一地添加该子级。

不过,这是一种效率极低的方法,而且对于我的目的来说太慢了。还有更好的办法吗?

作为背景,这是我正在做的可行性检查。我需要弄清楚是否可以在 SQL Server 中快速执行此操作。如果事实证明我不能,我将不得不在数据库之外以另一种方式进行。树的合并是问题域所固有的(实际上,在某种意义上),因此以不同的方式构建数据或采取更广泛的视角并试图以某种方式完全避免执行此操作是不可行的一个选项。

更新

根据要求,这是一个具体示例。

表“staging”和“main”都具有相同的两列:

   hierarchy_id of type hierarchyid
   node_id of type bigint

初始内容

main:

 hierarchy_id    node_id
 /1/             1
 /1/1/           2
 /1/2/           3
 /1/3/           4

staging:

 hierarchy_id    node_id
 /1/             1
 /1/1/           3
 /1/2/           5
 /1/1/1/         6

所需内容

main:

 hierarchy_id    node_id
 /1/             1
 /1/1/           2
 /1/2/           3
 /1/3/           4
 /1/4/           5
 /1/2/1/         6

请注意,staging 表中的节点为 Hierarchy_id / 1/1/ 与目标表中的 hiearchy_id /1/2/ 相对应(这就是为什么 node_id 很重要 - 不能只复制 hierarchy_id 值)。另请注意,node_id 6 的新节点将作为正确父节点(node_id 3 的节点)的子节点添加,这就是 hierarchy_id 很重要的原因 - 它定义任何新节点的树结构(父/子关系)。任何解决方案都需要考虑这两方面。

I have two tables with hierarchyid fields, one of which is a staging table with new data that needs to be merged into the other (that is, a set of nodes that need to be added to the main tree, some of which might already be there).

In addition to the hierarchyid column that defines the tree structure (parent/child relationships). each table has a separate column which contains a node identifier that uniquely identifies each node. That is, the way to tell if a node from the staging table is already in the main table is via the node ID, not via the hierarchyid columns.

Imperatively, the processing that needs to be performed would look something like this:

For each row, RS, in the staging table:
    If there is not already a row with the same Id as RS in the main table:
         Find the parent, PS, of the staging row
         Find the row, PM, in the main table that has the same node ID as PS
         Create a new child, RM of row PM
         Set PM's ID equal to the ID of RS

Importantly, this approach will only work if the tree in the staging table is sorted/traversed in breadth-first order - this is so that when RS is encountered, it is guaranteed that its parent PS already has a corresponding row in the main table.

So far the only way I can see to achieve this in SQL server is to use a cursor over the staging table (which is already sorted) and call a stored procedure for each row that essentially does exactly what is described above, complete with a SELECT MAX() to find the highest hierarchyid that already exists as a child of PM, so that the child can be added uniquely.

This is an awfully inefficient approach, though, and waaaay too slow for my purposes. Is there any better way?

For background, this is kind of a feasibility check I'm doing. I need to figure out whether I can perform this operation quickly inside SQL Server. If it turns out I can't I will have to do it another way, outside the database. The merging of the trees is inherent to (actually, in a sense kind of is) the problem domain, so structuring the data differently or taking a wider view and trying to somehow avoid performing this operation altogether is not an option.

Update

As requested, here's a concrete example.

Tables "staging" and "main" both have the same two columns:

   hierarchy_id of type hierarchyid
   node_id of type bigint

Initial contents

main:

 hierarchy_id    node_id
 /1/             1
 /1/1/           2
 /1/2/           3
 /1/3/           4

staging:

 hierarchy_id    node_id
 /1/             1
 /1/1/           3
 /1/2/           5
 /1/1/1/         6

Desired contents

main:

 hierarchy_id    node_id
 /1/             1
 /1/1/           2
 /1/2/           3
 /1/3/           4
 /1/4/           5
 /1/2/1/         6

Note that the node in the staging table with hierarchy_id /1/1/ corresponds to that with hiearchy_id /1/2/ in the target table (which is why the node_id is important - can't just copy hierarchy_id values across). Also note that the new node with node_id 6 is added as a child of the correct parent, the one with node_id 3, which is why the hierarchy_id is important - it defines the tree structure (parent/child relationships) for any new nodes. Any solution needs to take account of both aspects.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

娇俏 2024-12-06 12:39:52

以这种方式建模层次结构会导致问题。 hierarchy_id 列违反了第一范式,如果您不进行序列化/瓶颈访问,合并过程将很容易出现更新异常。

您应该考虑一个只有node_id和parent_id的表,看看它如何简化您的合并问题

node_id   parent_id
1         NULL
2         1
3         2
4         3

node_id   parent_id
1         NULL
3         1
5         2
6         1

您可以对此使用递归查询,您可能会惊讶于执行计划的效率有多高。如果您必须拥有扁平化层次结构列,您可能可以使用递归查询创建索引视图。

Modeling your hierarchy in this way will lead to problems. The hierarchy_id column violates first normal form and the process for merging is going to be prone to update anomalies if you don't serialize/bottleneck access.

You should consider a table with just node_id and parent_id, see how it trivializes your merge problem

node_id   parent_id
1         NULL
2         1
3         2
4         3

node_id   parent_id
1         NULL
3         1
5         2
6         1

You would use recursive queries with this and you might be surprised how efficient the execution plans turn out. If you must have the flattened hierarchy column you can probably create an indexed view with a recursive query.

与风相奔跑 2024-12-06 12:39:52

我们一直在开发需要类似解决方案的产品。经过对此方法和其他方法的大量研究后,我们得出结论:hierarchyID 方法不适合我们。

因此,作为对您问题的直接回答:使用这种方法没有更好的方法来做到这一点。

查看嵌套集模型邻接列表模型

对于这一特定的设计挑战,这两种方法都是更加优雅和高效的解决方案。

编辑:
事后想想,如果您没有使用 SQL,那么使用非关系数据库可以更好地解决这个问题。
我们不能走这条路,因为没有人在设计非关系数据库方面拥有足够的专业知识,但如果 SQL 是可选的,那么您可以在 MongoDB 中以更好、更高效的方式使用当前的方法。

We have been working on a product that required a similar solution. After a lot of research on this and other approaches, we concluded that hierarchyID method is not for us.

So as a direct answer to your question: There is no better way to do this using this approach.

Take a look at Nested Set Models and at Adjacency List Model.

Both of these are far more elegant and efficient solutions to this particular design challenge.

Edit:
As an after-thought, in-case you are not married to SQL - this problem can be much better solved using a non-relational database.
We could not go that way since no one has enough expertise in designing non-relational databases, but in case SQL is optional then you can use your current approach in a much nicer, more efficient way in MongoDB for example.

梦忆晨望 2024-12-06 12:39:52

下面是一种将行从源 @S 一次移动一级到目标 @T 的解决方案。为了简化一点,我添加了一个根节点,以便始终有一个父节点存在,在创建新的 HierarcyID 时使用该父节点。

我从未使用过 HierarchyID,因此可能绝对有更有效的方法来执行此操作,但它至少应该比一次一行执行更有效。

-- Target table
declare @T table 
(
  hierarchy_id hierarchyid primary key,
  node_id bigint
)

insert into @T values
('/',             0), -- Needed for simplicity
('/1/',           1),
('/1/1/',         2),
('/1/2/',         3),
('/1/3/',         4)

-- Source table
declare @S table 
(
  hierarchy_id hierarchyid primary key,
  node_id bigint
)

insert into @S values
('/',               0),
('/1/',             1),
('/1/1/',           3),
('/1/2/',           5),
('/1/1/1/',         6)

declare @lvl int = 1

-- Move rows from @S to @T for each level
while exists(select *
             from @S
             where hierarchy_id.GetLevel() = @lvl)
begin

  insert into @T
  select T.hierarchy_id.GetDescendant(C.MaxID, null),
         S.node_id
  from (select S1.node_id, S2.node_id as ParentID
        from @S as S1
          inner join @S as S2
            on S1.hierarchy_id.GetAncestor(1) = S2.hierarchy_id
        where S1.hierarchy_id.GetLevel() = @lvl and
              S1.node_id not in (select node_id
                                 from @T)
       ) as S
    inner join @T as T
      on S.ParentID = T.node_id
    outer apply (select max(hierarchy_id) as MaxID
                 from @T as T2
                 where T.hierarchy_id = T2.hierarchy_id.GetAncestor(1)) as C       

    set @lvl = @lvl + 1
end

select *, hierarchy_id.ToString()
from @T
where hierarchy_id <> hierarchyid::GetRoot()  

结果:

hierarchy_id  node_id  (No column name)
------------  -------  ----------------
0x58          1        /1/
0x5AC0        2        /1/1/
0x5B40        3        /1/2/
0x5B56        6        /1/2/1/
0x5BC0        4        /1/3/
0x5C20        5        /1/4/

Here is a solution that moves the rows from source @S to target @T one level at a time. To simplify a bit I added a root node just to always have a parent present that is used when creating the new HierarcyID.

I have never used HierarchyID so there might absolutely be more efficient ways to do this but it should at least be more efficient than doing it one row at a time.

-- Target table
declare @T table 
(
  hierarchy_id hierarchyid primary key,
  node_id bigint
)

insert into @T values
('/',             0), -- Needed for simplicity
('/1/',           1),
('/1/1/',         2),
('/1/2/',         3),
('/1/3/',         4)

-- Source table
declare @S table 
(
  hierarchy_id hierarchyid primary key,
  node_id bigint
)

insert into @S values
('/',               0),
('/1/',             1),
('/1/1/',           3),
('/1/2/',           5),
('/1/1/1/',         6)

declare @lvl int = 1

-- Move rows from @S to @T for each level
while exists(select *
             from @S
             where hierarchy_id.GetLevel() = @lvl)
begin

  insert into @T
  select T.hierarchy_id.GetDescendant(C.MaxID, null),
         S.node_id
  from (select S1.node_id, S2.node_id as ParentID
        from @S as S1
          inner join @S as S2
            on S1.hierarchy_id.GetAncestor(1) = S2.hierarchy_id
        where S1.hierarchy_id.GetLevel() = @lvl and
              S1.node_id not in (select node_id
                                 from @T)
       ) as S
    inner join @T as T
      on S.ParentID = T.node_id
    outer apply (select max(hierarchy_id) as MaxID
                 from @T as T2
                 where T.hierarchy_id = T2.hierarchy_id.GetAncestor(1)) as C       

    set @lvl = @lvl + 1
end

select *, hierarchy_id.ToString()
from @T
where hierarchy_id <> hierarchyid::GetRoot()  

Result:

hierarchy_id  node_id  (No column name)
------------  -------  ----------------
0x58          1        /1/
0x5AC0        2        /1/1/
0x5B40        3        /1/2/
0x5B56        6        /1/2/1/
0x5BC0        4        /1/3/
0x5C20        5        /1/4/
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文