修剪大图的杂散节点
我有一个由大约 35,000 个以纯文本表示的节点组成的图表:
node1 -> node35000
node29420 -> node35000
node2334 -> node4116
...
我想通过删除不属于至少三个长的链的节点来修剪它。因此,如果我只有
1 -> 2;
2 -> 3;
3 -> 4;
0 -> 4;
1、2、3 和 4(因为 1 -> 2 -> 3 -> 4
是四个节点长)但丢弃 0,即删除 0 -> 4.
.
有什么好方法可以做到这一点吗?我尝试了 Perl 和 shell 函数的组合,但我认为我需要更好的方法。除非已经有工具可以做到这一点?数据采用 graphviz 格式,但我没有在该套件中看到任何与手头任务相关的工具。
哦,如果有一种简单的方法来做这样的事情,我愿意接受建议——它不需要完全是我建议的任务。我只是在寻找一种方法来消除大团块周围的大部分噪音(这种团块很少见,而且大多只是一些相交的链)。
I have a graph consisting of about 35,000 nodes represented in plain text:
node1 -> node35000
node29420 -> node35000
node2334 -> node4116
...
I'd like to trim it down by removing nodes that are not part of a chain at least three long. So if I had only
1 -> 2;
2 -> 3;
3 -> 4;
0 -> 4;
I'd like to keep 1, 2, 3, and 4 (since 1 -> 2 -> 3 -> 4
is four nodes long) but discard 0, that is, remove 0 -> 4
.
Any idea of a good way to do this? I tried a combination of Perl and shell functions but I think I need a better approach. Unless maybe there are tools to do this already? The data is in graphviz format but I didn't see any tools in that suite relevant to the task at hand.
Oh, and if there's an easy way to do something like this I'm open to suggestions -- it doesn't need to be exactly the task I suggested. I'm just looking for a way to remove most of the noise surrounding the big clumps (which are rare and mostly just a few intersecting chains).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
该工具 gvpr 是 graphviz 工具 允许将规则应用于图形并输出修改后的图形。
从描述来看:
看起来您想要删除所有入度为 0 的节点,并且仅具有出度为 0 的链接节点(后继节点)。
这是我的
gvpr script
nostraynodes.gv
:脚本的作用如下:
循环所有边一次并填充两个列表:
那么什么会添加到哪个列表中?
此解决方案不是通用,仅适用于问题中所述的问题,即至少仅保留链3 个节点长。它也不会删除短循环(两个节点相互连接)。
您可以使用以下行调用它:
使用示例图的输出是:
请注意,这是我的第一个
gvpr
脚本 - 可能有更好的方法来编写此脚本,但我不确定如何编写这可以处理 35000 个节点,尽管我相信这不是什么大问题。另请参阅Graphviz/Dot - 如何用独特的颜色标记树中的所有叶子? 一个更简单的图形转换示例。
The tool gvpr which is part of the graphviz tools allows to apply rules to a graph and output the modified graph.
From the description:
It looks like you want to remove all nodes having an indegree of 0 and having only linked nodes (successors) with an outdegree of 0.
Here's my version of a
gvpr
scriptnostraynodes.gv
:Here's what the script does:
Loop through all edges one time and populate two lists:
So what gets added to which list?
This solution is not generic and only works for the problem stated in the question, that is keeping only chains at least 3 nodes long. It also won't delete short loops (two nodes connected to each other).
You can call this using the following line:
The output using your sample graph is:
Please note that this is my first
gvpr
script - there are probably better ways to write this, and I'm not sure how this handles 35000 nodes, though I'm confident this should not be a big deal.See also Graphviz/Dot - how to mark all leaves in a tree with a distinctive color? for a simpler example of graph transformation.
Gephi 是一个优秀的开源 GUI 工具,用于可视化和操作图形,您可能会找到某种在那里过滤这类事情......也许度数过滤器可以做到:它会删除只有一条边的节点。您还可以过滤入度、出度,可以计算 PageRank 等。它还有一些非常好的尺寸/标签/颜色选项,并且很容易放大/缩小。
Gephi is an excellent open-source GUI tool for visualizing and manipulating graphs, and you will probably be able to find some kind of filter in there for this sort of thing... Maybe a degree filter would do: it would remove nodes which only have one edge. You can also filter on in-degree, out-degree, you can compute PageRank etc. It's also got some really nice size/label/colour options and is easy to zoom in/out of.
假设任何给定节点可以有任意多个前驱或后继,则节点的入度和出度与解决问题无关。
以下是在路径长度 3 准则下针对 N 个节点和 E 个边的所有图的简单 O(N+E) 算法。该算法可以很容易地用 Perl 或 C 实现。该方法基于定义和断言:将“生成节点”定义为具有父节点和子节点(前驱节点和后继节点)的任何节点。将保留的每个节点都是已创建节点,或者是已创建节点的父节点或子节点。
将状态数组 S[Nmax] 初始化为全零。 Nmax 是最大节点数。如果一开始不知道Nmax,则读取所有数据并找出它。
读入给定的边列表。每个输入项指定从节点 p 到节点 q 的有向边 (p, q)。对于读入的每个 (p, q) 项:将 S[p] 设置为 S[p] | 1 表示 p 有一个孩子,并将 S[q] 设置为 S[q] | 2 表示 q 有一个父对象。 (在此步骤之后,每个生成的节点 n 都有 S[n] == 3。)
再次读取边列表。对于读入的每个 (p, q) 项: If (S[p]==3) or (S[q] == 3) 输出边 (p,q)。
要将此方法扩展到除 3 之外的路径长度 K,请将边列表保留在内存中,使用父链和子链的长度维护 Sp[] 和 Sc[],并执行 K/2 次额外传递。也许可以在 O(N+K*E) 时间内完成。
该问题没有指定图是否是 DAG(有向无环图),但给出的示例是 DAG。对于K>3,这可能会产生影响。
更新 1 这是 K>3 算法的更精确的表述,其中 H[i].p 和 H[i].q 是边 #i 的端点,pc[j], cc[ j] 是关于节点 j 的前驱链和后继链的长度。另外,令 E = 边数; N = 节点数; K = 保持边缘所需的最小链条长度。
将 E 边沿数据条目读入 H[ ] 数组。将所有 pc[j]、cc[j] 条目设置为 0。
对于 i = 1 到 E,设置 cc[H[i].p]=1 和 pc[H[i].q]=1。
对于 j = 1 到 K+1,{ 对于 i = 1 到 E,{ 令 p=H[i].p 且 q=H[i].q。设置 cc[p] = max(cc[p], 1+cc[q]) 和 pc[q] = max(pc[q], 1+pc[p])。 } }
对于 i = 1 到 E,{ 令 p=H[i].p 且 q=H[i].q。如果 pc[p]+cc[p]+1 >= K 且 pc[q]+cc[q]+1 >= K,则输出边缘 (p,q)。}
如果图不是,则此方法可能会出错一个 DAG 并包含短循环路径。例如,如果图边包括 (1,2) 和 (2,1),并且没有其他节点链接到节点 1 或 2,则不应输出这些边;但我们最终得到这些节点的 cc[] 和 pc[] 的 K+2,因此它们无论如何都会得到输出。
Supposing that any given node can have arbitrarily many predecessors or successors, then in-degree and out-degree of nodes is irrelevant to solving the problem.
Following is a simple O(N+E) algorithm for all graphs of N nodes and E edges, under the path-length-3 criterion. This algorithm can be easily implemented in Perl or C. The method is based on a definition and an assertion: Define a "made node" as any node that has a parent and child (predecessor and successor). Every node that will be kept is a made node or is a parent or child of a made node.
Initialize a status array S[Nmax] to all zeroes. Nmax is the maximum node number. If Nmax is not known at outset, read all the data and find it out.
Read in the given list of edges. Each input item specifies a directed edge (p, q) from node p to node q. For each (p, q) item that is read in: Set S[p] to S[p] | 1 to denote that p has a child, and set S[q] to S[q] | 2 to denote that q has a parent. (After this step, every made node n has S[n] == 3.)
Read the list of edges again. For each (p, q) item that is read in: If (S[p]==3) or (S[q] == 3) output edge (p,q).
To extend this method to path length K other than 3, keep the edge list in memory, maintain Sp[] and Sc[] with lengths of parent chains and child chains, and perform K/2 extra passes. Might be possible to do in time O(N+K*E).
The problem does not specify whether the graph is a DAG (directed acyclic graph) but the example given is a DAG. For K>3, it may make a difference.
Update 1 Here's a more precise statement of a K>3 algorithm, with H[i].p and H[i].q being endpoints of edge #i, and pc[j], cc[j] being lengths of predecessor and successor chains about node j. Also, let E = # of edges; N = # of nodes; and K = desired min chain length for keeping an edge.
Read E edge data entries into H[ ] array. Set all pc[j], cc[j] entries to 0.
For i = 1 to E, set cc[H[i].p]=1 and pc[H[i].q]=1.
For j = 1 to K+1, { for i = 1 to E, { Let p=H[i].p and q=H[i].q. Set cc[p] = max(cc[p], 1+cc[q]) and pc[q] = max(pc[q], 1+pc[p]). } }
For i = 1 to E, { Let p=H[i].p and q=H[i].q. Output edge (p,q) if pc[p]+cc[p]+1 >= K and pc[q]+cc[q]+1 >= K.}
This method can make mistakes if graph is not a DAG and contains short looped paths. For example, if graph edges include (1,2) and (2,1) and no other nodes link to nodes 1 or 2, then neither of those edges should be output; but we end up with K+2 for cc[] and pc[] of those nodes, so they get output anyway.