分析大图 - 检索簇并计算最强路径

发布于 2024-10-01 07:57:15 字数 4582 浏览 6 评论 0原文

我尝试使用广度优先算法通过以下方式检索所选用户所属的整个连接集群:

CREATE PROCEDURE Breadth_First (@StartNode varchar(50), @LinkStrength decimal(10,7) = 0.1, @EndNode varchar(50) = NULL)
    AS
    BEGIN
        -- Automatically rollback the transaction if something goes wrong.   
        SET XACT_ABORT ON   
        BEGIN TRAN

        -- Increase performance and do not intefere with the results.
        SET NOCOUNT ON;

        -- Create a temporary table for storing the discovered nodes as the algorithm runs
        CREATE TABLE #Discovered
        (
             DiscoveredUser varchar(50) NOT NULL,    -- The Node Id
            Predecessor varchar(50) NULL,    -- The node we came from to get to this node.
            LinkStrength decimal(10,7) NULL, -- positive - from predecessor to  DiscoveredUser, negative - from  DiscoveredUser to predecessor
            OrderDiscovered int -- The order in which the nodes were discovered.
        )

        -- Initially, only the start node is discovered.
        INSERT INTO #Discovered ( DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
        VALUES (@StartNode, NULL, NULL, 0)

        -- Add all nodes that we can get to from the current set of nodes,
        -- that are not already discovered. Run until no more nodes are discovered.
        WHILE @@ROWCOUNT > 0
        BEGIN
            -- If we have found the node we were looking for, abort now.
            IF @EndNode IS NOT NULL
                IF EXISTS (SELECT TOP 1 1 FROM #Discovered WHERE  DiscoveredUser = @EndNode)
                    BREAK   

            -- We need to group by ToNode and select one FromNode since multiple
            -- edges could lead us to new same node, and we only want to insert it once.
            INSERT INTO #Discovered ( DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
            (SELECT mc.called_party, mc.calling_party, mc.link_strength, d.OrderDiscovered + 1
            FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.calling_party
            WHERE mc.called_party NOT IN (SELECT  DiscoveredUser From #Discovered) AND mc.link_strength > @LinkStrength
            UNION
            SELECT mc.calling_party, mc.called_party, mc.link_strength * (-1), d.OrderDiscovered + 1
            FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.called_party
            WHERE mc.calling_party NOT IN (SELECT  DiscoveredUser FROM #Discovered) AND mc.link_strength > @LinkStrength
            )
        END;

        -- Select the results. We use a recursive common table expression to
        -- get the full path from the start node to the current node.
        WITH BacktraceCTE(Predecessor,  DiscoveredUser, LinkStrength, OrderDiscovered, Path)
        AS
        (
            -- Anchor/base member of the recursion, this selects the start node.
            SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered, 
                CAST(n. DiscoveredUser AS varchar(MAX))
            FROM #Discovered d JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
            WHERE d. DiscoveredUser = @StartNode

            UNION ALL

            -- Recursive member, select all the nodes which have the previous
            -- one as their predecessor. Concat the paths together.
            SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered,
                CAST(cte.Path + ',' + CAST(n. DiscoveredUser as varchar(30)) as varchar(MAX))
            FROM #Discovered d JOIN BacktraceCTE cte ON d.Predecessor = cte. DiscoveredUser
            JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
        )

        SELECT Predecessor,  DiscoveredUser, LinkStrength, OrderDiscovered, Path FROM BacktraceCTE
        WHERE  DiscoveredUser = @EndNode OR @EndNode IS NULL -- This kind of where clause can potentially produce
        ORDER BY OrderDiscovered                -- a bad execution plan, but I use it for simplicity here.

        DROP TABLE #Discovered
        COMMIT TRAN
        RETURN 0
    END

我当前正在分析的图(社交网络)有 28M 连接,并且不忽略弱连接(使用 @LinkStrength 设置阈值) )执行运行了很长时间(到目前为止我还没有得到任何结果,并将尝试让它运行过夜)。

不管怎样,下一步是计算两个用户(大约有 300 万用户)之间的最短(最强)链接。我正在考虑使用 Djikstra 算法,但不确定是否可以在我当前使用的 PC(Quad CPU 2.66 GHz,4GB RAM)上分析此类网络,并且数据存储在 MS SQL Server 2008 数据库中。

总而言之,我希望获得以下问题的一些答案/建议:

  1. 是否可以执行 与上面的查询一样复杂的查询 描述图(28M 连接,3M 用户)在所描述的机器上 (2.66 GHz,4GB 内存)?
  2. 如果不可能的话有没有什么办法 其他可能的方法 执行时间可以减少 (例如创建部分表 结果)?
  3. 你有推荐其他的吗 检测簇的算法 并计算最短路径 所描述的图表?

谢谢你!

I tried to use Breadth-First algorithm to retrieve the whole cluster of connections to which the selected user belongs the following way:

CREATE PROCEDURE Breadth_First (@StartNode varchar(50), @LinkStrength decimal(10,7) = 0.1, @EndNode varchar(50) = NULL)
    AS
    BEGIN
        -- Automatically rollback the transaction if something goes wrong.   
        SET XACT_ABORT ON   
        BEGIN TRAN

        -- Increase performance and do not intefere with the results.
        SET NOCOUNT ON;

        -- Create a temporary table for storing the discovered nodes as the algorithm runs
        CREATE TABLE #Discovered
        (
             DiscoveredUser varchar(50) NOT NULL,    -- The Node Id
            Predecessor varchar(50) NULL,    -- The node we came from to get to this node.
            LinkStrength decimal(10,7) NULL, -- positive - from predecessor to  DiscoveredUser, negative - from  DiscoveredUser to predecessor
            OrderDiscovered int -- The order in which the nodes were discovered.
        )

        -- Initially, only the start node is discovered.
        INSERT INTO #Discovered ( DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
        VALUES (@StartNode, NULL, NULL, 0)

        -- Add all nodes that we can get to from the current set of nodes,
        -- that are not already discovered. Run until no more nodes are discovered.
        WHILE @@ROWCOUNT > 0
        BEGIN
            -- If we have found the node we were looking for, abort now.
            IF @EndNode IS NOT NULL
                IF EXISTS (SELECT TOP 1 1 FROM #Discovered WHERE  DiscoveredUser = @EndNode)
                    BREAK   

            -- We need to group by ToNode and select one FromNode since multiple
            -- edges could lead us to new same node, and we only want to insert it once.
            INSERT INTO #Discovered ( DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
            (SELECT mc.called_party, mc.calling_party, mc.link_strength, d.OrderDiscovered + 1
            FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.calling_party
            WHERE mc.called_party NOT IN (SELECT  DiscoveredUser From #Discovered) AND mc.link_strength > @LinkStrength
            UNION
            SELECT mc.calling_party, mc.called_party, mc.link_strength * (-1), d.OrderDiscovered + 1
            FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.called_party
            WHERE mc.calling_party NOT IN (SELECT  DiscoveredUser FROM #Discovered) AND mc.link_strength > @LinkStrength
            )
        END;

        -- Select the results. We use a recursive common table expression to
        -- get the full path from the start node to the current node.
        WITH BacktraceCTE(Predecessor,  DiscoveredUser, LinkStrength, OrderDiscovered, Path)
        AS
        (
            -- Anchor/base member of the recursion, this selects the start node.
            SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered, 
                CAST(n. DiscoveredUser AS varchar(MAX))
            FROM #Discovered d JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
            WHERE d. DiscoveredUser = @StartNode

            UNION ALL

            -- Recursive member, select all the nodes which have the previous
            -- one as their predecessor. Concat the paths together.
            SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered,
                CAST(cte.Path + ',' + CAST(n. DiscoveredUser as varchar(30)) as varchar(MAX))
            FROM #Discovered d JOIN BacktraceCTE cte ON d.Predecessor = cte. DiscoveredUser
            JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
        )

        SELECT Predecessor,  DiscoveredUser, LinkStrength, OrderDiscovered, Path FROM BacktraceCTE
        WHERE  DiscoveredUser = @EndNode OR @EndNode IS NULL -- This kind of where clause can potentially produce
        ORDER BY OrderDiscovered                -- a bad execution plan, but I use it for simplicity here.

        DROP TABLE #Discovered
        COMMIT TRAN
        RETURN 0
    END

The Graph (social network) which I am currently analyzing has 28M connections and without ignoring weak connections (set treshold with @LinkStrength) the execution is running very long time (so far I haven't got any results and will try to leave it running over night).

Anyway the next step is to calculate the shortest (strongest) links between two users (there are around 3M users). I was thinking about using Djikstra algorithm but am not sure whether it is possible to analyze such network on the PC I am currently using (Quad CPU 2.66 GHz, 4GB RAM) and the data is stored in MS SQL Server 2008 database.

To summarize I would appreciate to get some answers/suggestions for the following questions:

  1. Is it possible to execute the
    queries as complex as one above for
    described graph (28M connections, 3M
    users) on the described machine
    (2.66 GHz, 4GB RAM)?
  2. If it is not possible are there any
    other possible approaches by which
    the execution time could be reduced
    (e.g. creating tables with partial
    results)?
  3. Do you recommend any other
    algorithms for detecting clusters
    and calculating shortest paths for
    the described graph?

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

演出会有结束 2024-10-08 07:57:15

首先,使用索引

其次,您需要减少内存需求。这意味着首先为 VARCHAR(50) 提供一个短别名,例如 int,它是 4 个字节而不是 50 个字节。这将使您的速度提高 10 倍。

declare @tmpPeople table(
  ixPerson int identity primary key,
  UserNodeID varchar(50),
  unique(UserNodeID, ix) -- this creates an index
)
Insert @tmpPeople(UserNodeID) select UserNodeID from NormalPeopleTable
declare @relationships table(
  ixParent int,
  ixChild int,
  unique(ixParent, ixChild),
  unique(ixChild, ixParent)
)
insert @relationships(ixParent, ixChild)
select distinct p.ixPerson, c.ixPerson
from NormalRelationshipsTable nr
inner join @tmpPeople p on p.UserNodeID = nr.ParentUserNodeID
inner join @tmpPeople c on c.UserNodeID = nr.ChildUserNodeID

-- OK now got a copy of all relationships, but it is a fraction of the size
-- because we are using integers for the keys.
-- if we need to we can insert the reverse relationships too.

您需要编写一个执行您想要的操作的查询,而不是“通用”的查询。

如果你想找到两个节点之间的最短距离,你可以通过同时从两端搜索来减少搜索时间。

declare @p1 table(
ix int identity primary key,
ixParent int,
ixChild int,
nDeep int,
-- Need indexes
unique(ixParent, ixChild, nDeep, ix),
unique(ixChild, ixParent, nDeep, ix)
)
-- That's now indexed both ways. 
-- If you only need one, you can comment the other out.
-- define @p2 the same

insert @p1 (ixParent, ixChild, nDeep) select @ixUserFrom, @ixUserFrom, 0
insert @p2 ..... @ixUserTo, @ixUserTo, 0

-- Breadth first goes like this.
-- Just keep repeating it till you get as far as you want.
insert @p1 (ixParent, ixChild, nDeep)
select
p1.ixChild, r.ixChild, p1.nDeep+1
from @p1 p1 inner join @relationships r on r.ixParent = p1.ixChild
-- may want to exclude those already in the table
where not exists (
    select 1 from @p1 p1 
    where p1.ixParent = p.ixChild and p1.ixChild = r.ixChild
)

对于“从 Alice 到 Bob 的距离”,您并行执行两次搜索,并在 Alice 的搜索包含 Bob 的搜索中也包含的任何人时停止。这还会将您的时间缩短 n^2 倍,其中 n 是平均连接数。

如果深度变得太多,请不要忘记停止。

First, use indexes.

Second, you need to reduce your memory requirements. That means first provide a short alias for your VARCHAR(50), such as int which is 4 bytes instead of 50. That will get you a 10x speedup.

declare @tmpPeople table(
  ixPerson int identity primary key,
  UserNodeID varchar(50),
  unique(UserNodeID, ix) -- this creates an index
)
Insert @tmpPeople(UserNodeID) select UserNodeID from NormalPeopleTable
declare @relationships table(
  ixParent int,
  ixChild int,
  unique(ixParent, ixChild),
  unique(ixChild, ixParent)
)
insert @relationships(ixParent, ixChild)
select distinct p.ixPerson, c.ixPerson
from NormalRelationshipsTable nr
inner join @tmpPeople p on p.UserNodeID = nr.ParentUserNodeID
inner join @tmpPeople c on c.UserNodeID = nr.ChildUserNodeID

-- OK now got a copy of all relationships, but it is a fraction of the size
-- because we are using integers for the keys.
-- if we need to we can insert the reverse relationships too.

You need to write a query which does what you want, not something "generic".

If you want to find the shortest distance between two nodes, you can cut your search time by searching from both ends at once.

declare @p1 table(
ix int identity primary key,
ixParent int,
ixChild int,
nDeep int,
-- Need indexes
unique(ixParent, ixChild, nDeep, ix),
unique(ixChild, ixParent, nDeep, ix)
)
-- That's now indexed both ways. 
-- If you only need one, you can comment the other out.
-- define @p2 the same

insert @p1 (ixParent, ixChild, nDeep) select @ixUserFrom, @ixUserFrom, 0
insert @p2 ..... @ixUserTo, @ixUserTo, 0

-- Breadth first goes like this.
-- Just keep repeating it till you get as far as you want.
insert @p1 (ixParent, ixChild, nDeep)
select
p1.ixChild, r.ixChild, p1.nDeep+1
from @p1 p1 inner join @relationships r on r.ixParent = p1.ixChild
-- may want to exclude those already in the table
where not exists (
    select 1 from @p1 p1 
    where p1.ixParent = p.ixChild and p1.ixChild = r.ixChild
)

For a "distance from Alice to Bob" you do two searches in parallel, and stop when Alice's search contains anyone also contained in Bob's search. That also cuts your time down by a factor of n^2 where n is the average number of connections.

Don't forget to stop if the depth gets too much.

梅窗月明清似水 2024-10-08 07:57:15

如果你想要一个确切的答案,无论你追求广度优先还是深度优先并不重要。确切的答案需要详尽的搜索,这会很慢。

正如 fmark 所建议的,启发式方法可以帮助您找到具有合理确定性的潜在最大解决方案。它会节省你很多时间,但并不准确。

您必须选择速度或精确度,不能真正两者兼得。这就像照片(或声音或视频)的图像压缩:对于大多数自然场景照片,png 是无损的,但压缩程度不高,jpeg 压缩得很好,但有一些损失。

编辑1:我能想到的唯一可以帮助您进行精确搜索的是稀疏矩阵的数学理论。您的问题类似于将社会关系强度矩阵提升到一系列不同的幂(A 和 B 之间的幂 n = n 步)并找出每个 (A, B) 对的哪些单元格具有最高值。这就是您对查询所做的事情,仅数据库查询可能不是实现此目的的最快方法。

不过,我无法为你提供更多帮助。您可能需要查看稀疏矩阵维基百科

编辑2:我刚刚又想到了一件事。我不明白你如何剔除你知道使用 SQL 查询肯定会很弱的分支,而使用定制的算法来处理你的稀疏矩阵,应该很容易剔除你知道可以消除的分支,基于在你的力量模型上。

Whether you go for breadth first or depth first does not really matter if you want an exact answer. Exact answers will require exhaustive search, which will be slow.

Like fmark suggests, a heuristic could help you find a potentially maximal solution with a reasonable degree of certainty. It will save you a lot of time, but it will not be exact.

You have to choose speed or exactitude, you can't really have both. It's like image compression for photographs (or sound or video): with most photographs of natural scenes png is lossless but does not compress much, jpeg compresses quite well but with some loss.

EDIT 1: The only thing I can think of that could help you with the exact search is the mathematical theory of sparse matrices. Your problem is akin to elevating the social relations strength matrix to a range of different powers (power n = n steps between person A and B) and finding out which cells have the highest values for each (A, B) pair. This is what you are doing with your query, only a DB query might not be the fastest way to achieve this.

I can't help you more with this though. You might want to check out Wikipedia for Sparse Matrix.

EDIT 2: I just thought about one more thing. I can't see how you can cull out branches that you know will be definitely weak with an SQL query whereas with a tailored algorithm to work on your sparse matrix, it should be easy to cull out branches that you know can be eliminated, based on your strength model.

淡淡的优雅 2024-10-08 07:57:15

在进行分析之前,首先迁移到图形数据库也许会有所帮助。我个人没有使用过它们,但有人建议我尝试 neo4j

华泰

Perhaps it would help to first migrate to a Graph DB before doing your analytics. I've not used them personally, but had been recommended that I try neo4j.

HTH

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文