修剪大图的杂散节点

发布于 2024-12-03 16:14:01 字数 570 浏览 1 评论 0原文

我有一个由大约 35,000 个以纯文本表示的节点组成的图表：

node1 -> node35000
node29420 -> node35000
node2334 -> node4116
...

我想通过删除不属于至少三个长的链的节点来修剪它。因此，如果我只有

1 -> 2;
2 -> 3;
3 -> 4;
0 -> 4;

1、2、3 和 4（因为 1 -> 2 -> 3 -> 4 是四个节点长）但丢弃 0，即删除 0 -> 4..

有什么好方法可以做到这一点吗？我尝试了 Perl 和 shell 函数的组合，但我认为我需要更好的方法。除非已经有工具可以做到这一点？数据采用 graphviz 格式，但我没有在该套件中看到任何与手头任务相关的工具。

哦，如果有一种简单的方法来做这样的事情，我愿意接受建议——它不需要完全是我建议的任务。我只是在寻找一种方法来消除大团块周围的大部分噪音（这种团块很少见，而且大多只是一些相交的链）。

原文

I have a graph consisting of about 35,000 nodes represented in plain text:

node1 -> node35000
node29420 -> node35000
node2334 -> node4116
...

I'd like to trim it down by removing nodes that are not part of a chain at least three long. So if I had only

1 -> 2;
2 -> 3;
3 -> 4;
0 -> 4;

I'd like to keep 1, 2, 3, and 4 (since 1 -> 2 -> 3 -> 4 is four nodes long) but discard 0, that is, remove 0 -> 4.

Any idea of a good way to do this? I tried a combination of Perl and shell functions but I think I need a better approach. Unless maybe there are tools to do this already? The data is in graphviz format but I didn't see any tools in that suite relevant to the task at hand.

Oh, and if there's an easy way to do something like this I'm open to suggestions -- it doesn't need to be exactly the task I suggested. I'm just looking for a way to remove most of the noise surrounding the big clumps (which are rare and mostly just a few intersecting chains).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

永言不败 2024-12-10 16:14:01

该工具 gvpr 是 graphviz 工具允许将规则应用于图形并输出修改后的图形。

从描述来看：

它将输入图复制到其输出，可能会改变它们的
结构和属性，创建新图，...

看起来您想要删除所有入度为 0 的节点，并且仅具有出度为 0 的链接节点（后继节点）。

这是我的 gvpr script nostraynodes.gv ：

BEGIN {node_t n; int candidates[]; int keepers[];}
E{
  if (tail.indegree == 0 && head.outdegree == 0)
  {
    candidates[tail] = 1;
    candidates[head] = 1;
  }
  else if (tail.indegree == 0)
  {
    keepers[tail] = 1;
  }
  else if (head.outdegree == 0)
  {
    keepers[head] = 1;
  }
}

END_G {
  for (candidates[n]){
    if (n in keepers == 0)
    {
       delete(NULL, n);
    }
  }
}

脚本的作用如下：

循环所有边一次并填充两个列表：
- 候选 - 可能需要删除的节点列表，以及
- keepers - 可能最终成为候选者但不应被删除的节点列表。
那么什么会添加到哪个列表中？
- 任意两个相互连接的节点，其中尾节点没有任何传入边并且头节点没有任何传出边，形成仅包含 2 个节点的链，因此是候选节点被删除；也就是说，除非相同的节点是长度超过 2 个节点的其他链的一部分：
- 没有任何传入边但连接到本身具有传出边的头节点的尾节点是一个守护者；和
- 没有任何传出边但连接到本身具有传入边的尾节点的头节点也是一个守护者。
删除所有不在keepers中的候选人

此解决方案不是通用，仅适用于问题中所述的问题，即至少仅保留链3 个节点长。它也不会删除短循环（两个节点相互连接）。

您可以使用以下行调用它：

gvpr -c -f .\nostraynodes.gv .\graph.dot

使用示例图的输出是：

digraph g {
    1 -> 2;
    2 -> 3;
    3 -> 4;
}

请注意，这是我的第一个 gvpr 脚本 - 可能有更好的方法来编写此脚本，但我不确定如何编写这可以处理 35000 个节点，尽管我相信这不是什么大问题。

另请参阅Graphviz/Dot - 如何用独特的颜色标记树中的所有叶子？一个更简单的图形转换示例。

The tool gvpr which is part of the graphviz tools allows to apply rules to a graph and output the modified graph.

From the description:

It copies input graphs to its output, possibly transforming their
structure and attributes, creating new graphs, ...

It looks like you want to remove all nodes having an indegree of 0 and having only linked nodes (successors) with an outdegree of 0.

Here's my version of a gvpr script nostraynodes.gv :

BEGIN {node_t n; int candidates[]; int keepers[];}
E{
  if (tail.indegree == 0 && head.outdegree == 0)
  {
    candidates[tail] = 1;
    candidates[head] = 1;
  }
  else if (tail.indegree == 0)
  {
    keepers[tail] = 1;
  }
  else if (head.outdegree == 0)
  {
    keepers[head] = 1;
  }
}

END_G {
  for (candidates[n]){
    if (n in keepers == 0)
    {
       delete(NULL, n);
    }
  }
}

Here's what the script does:

Loop through all edges one time and populate two lists:
- candidates - a list of nodes which may have to be removed, and
- keepers - a list of nodes which may end up in candidates but should not be removed.
So what gets added to which list?
- Any two nodes connected to each other, where the tail node does not have any incoming edges and the head node does not have any outgoing edges, form a chain of only 2 nodes and are therefore candidates to be deleted; that is, unless the same nodes are part of an other chain longer than 2 nodes:
- A tail node without any incoming edges, but connected to a head node which itself has outgoing edges, is a keeper; and
- A head node without any outgoing edges, but connected to a tail node which itself has incoming edges, is also a keeper.
Delete all candidates not in keepers

This solution is not generic and only works for the problem stated in the question, that is keeping only chains at least 3 nodes long. It also won't delete short loops (two nodes connected to each other).

You can call this using the following line:

gvpr -c -f .\nostraynodes.gv .\graph.dot

The output using your sample graph is:

digraph g {
    1 -> 2;
    2 -> 3;
    3 -> 4;
}

Please note that this is my first gvpr script - there are probably better ways to write this, and I'm not sure how this handles 35000 nodes, though I'm confident this should not be a big deal.

See also Graphviz/Dot - how to mark all leaves in a tree with a distinctive color? for a simpler example of graph transformation.

回复收藏 0 原文

等风也等你 2024-12-10 16:14:01

Gephi 是一个优秀的开源 GUI 工具，用于可视化和操作图形，您可能会找到某种在那里过滤这类事情......也许度数过滤器可以做到：它会删除只有一条边的节点。您还可以过滤入度、出度，可以计算 PageRank 等。它还有一些非常好的尺寸/标签/颜色选项，并且很容易放大/缩小。

回复收藏 0 原文

清秋悲枫 2024-12-10 16:14:01

假设任何给定节点可以有任意多个前驱或后继，则节点的入度和出度与解决问题无关。

以下是在路径长度 3 准则下针对 N 个节点和 E 个边的所有图的简单 O(N+E) 算法。该算法可以很容易地用 Perl 或 C 实现。该方法基于定义和断言：将“生成节点”定义为具有父节点和子节点（前驱节点和后继节点）的任何节点。将保留的每个节点都是已创建节点，或者是已创建节点的父节点或子节点。

将状态数组 S[Nmax] 初始化为全零。 Nmax 是最大节点数。如果一开始不知道Nmax，则读取所有数据并找出它。
读入给定的边列表。每个输入项指定从节点 p 到节点 q 的有向边 (p, q)。对于读入的每个 (p, q) 项：将 S[p] 设置为 S[p] | 1 表示 p 有一个孩子，并将 S[q] 设置为 S[q] | 2 表示 q 有一个父对象。（在此步骤之后，每个生成的节点 n 都有 S[n] == 3。）
再次读取边列表。对于读入的每个 (p, q) 项： If (S[p]==3) or (S[q] == 3) 输出边 (p,q)。

要将此方法扩展到除 3 之外的路径长度 K，请将边列表保留在内存中，使用父链和子链的长度维护 Sp[] 和 Sc[]，并执行 K/2 次额外传递。也许可以在 O(N+K*E) 时间内完成。
该问题没有指定图是否是 DAG（有向无环图），但给出的示例是 DAG。对于K＞3，这可能会产生影响。

更新 1 这是 K>3 算法的更精确的表述，其中 H[i].p 和 H[i].q 是边 #i 的端点，pc[j], cc[ j] 是关于节点 j 的前驱链和后继链的长度。另外，令 E = 边数； N = 节点数； K = 保持边缘所需的最小链条长度。

将 E 边沿数据条目读入 H[ ] 数组。将所有 pc[j]、cc[j] 条目设置为 0。
对于 i = 1 到 E，设置 cc[H[i].p]=1 和 pc[H[i].q]=1。
对于 j = 1 到 K+1，{ 对于 i = 1 到 E，{ 令 p=H[i].p 且 q=H[i].q。设置 cc[p] = max(cc[p], 1+cc[q]) 和 pc[q] = max(pc[q], 1+pc[p])。 } }
对于 i = 1 到 E，{ 令 p=H[i].p 且 q=H[i].q。如果 pc[p]+cc[p]+1 >= K 且 pc[q]+cc[q]+1 >= K，则输出边缘 (p,q)。}

如果图不是，则此方法可能会出错一个 DAG 并包含短循环路径。例如，如果图边包括 (1,2) 和 (2,1)，并且没有其他节点链接到节点 1 或 2，则不应输出这些边；但我们最终得到这些节点的 cc[] 和 pc[] 的 K+2，因此它们无论如何都会得到输出。

Supposing that any given node can have arbitrarily many predecessors or successors, then in-degree and out-degree of nodes is irrelevant to solving the problem.

Following is a simple O(N+E) algorithm for all graphs of N nodes and E edges, under the path-length-3 criterion. This algorithm can be easily implemented in Perl or C. The method is based on a definition and an assertion: Define a "made node" as any node that has a parent and child (predecessor and successor). Every node that will be kept is a made node or is a parent or child of a made node.

Initialize a status array S[Nmax] to all zeroes. Nmax is the maximum node number. If Nmax is not known at outset, read all the data and find it out.
Read in the given list of edges. Each input item specifies a directed edge (p, q) from node p to node q. For each (p, q) item that is read in: Set S[p] to S[p] | 1 to denote that p has a child, and set S[q] to S[q] | 2 to denote that q has a parent. (After this step, every made node n has S[n] == 3.)
Read the list of edges again. For each (p, q) item that is read in: If (S[p]==3) or (S[q] == 3) output edge (p,q).

To extend this method to path length K other than 3, keep the edge list in memory, maintain Sp[] and Sc[] with lengths of parent chains and child chains, and perform K/2 extra passes. Might be possible to do in time O(N+K*E).
The problem does not specify whether the graph is a DAG (directed acyclic graph) but the example given is a DAG. For K>3, it may make a difference.

Update 1 Here's a more precise statement of a K>3 algorithm, with H[i].p and H[i].q being endpoints of edge #i, and pc[j], cc[j] being lengths of predecessor and successor chains about node j. Also, let E = # of edges; N = # of nodes; and K = desired min chain length for keeping an edge.

Read E edge data entries into H[ ] array. Set all pc[j], cc[j] entries to 0.
For i = 1 to E, set cc[H[i].p]=1 and pc[H[i].q]=1.
For j = 1 to K+1, { for i = 1 to E, { Let p=H[i].p and q=H[i].q. Set cc[p] = max(cc[p], 1+cc[q]) and pc[q] = max(pc[q], 1+pc[p]). } }
For i = 1 to E, { Let p=H[i].p and q=H[i].q. Output edge (p,q) if pc[p]+cc[p]+1 >= K and pc[q]+cc[q]+1 >= K.}

This method can make mistakes if graph is not a DAG and contains short looped paths. For example, if graph edges include (1,2) and (2,1) and no other nodes link to nodes 1 or 2, then neither of those edges should be output; but we end up with K+2 for cc[] and pc[] of those nodes, so they get output anyway.

回复收藏 0 原文

~没有更多了~