子图枚举
用于枚举父图的所有子图的有效算法是什么?在我的特定情况下,父图是分子图,因此它将被连接并且通常包含少于 100 个顶点。
编辑:我只对连接的子图感兴趣。
What is an efficient algorithm for the enumeration of all subgraphs of a parent graph. In my particular case, the parent graph is a molecular graph, and so it will be connected and typically contain fewer than 100 vertices.
Edit: I am only interested in the connected subgraphs.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这个问题在这个问题的接受答案中有更好的答案。它避免了@ninjagecko 的答案中标记为“您填写上述函数”的计算复杂步骤。它可以有效地处理有多个环的化合物。
有关完整详细信息,请参阅链接的问题,但这是摘要。 (N(v) 表示顶点 v 的邻居集合。在“选择顶点”步骤中,您可以选择任意顶点。)
This question has a better answer in the accepted answer to this question. It avoids the computationally complex step marked "you fill in above function" in @ninjagecko's answer. It can deal efficiently with compounds where there are several rings.
See the linked question for the full details, but here's the summary. (N(v) denotes the set of neighbors of vertex v. In the "choose a vertex" step, you can choose any arbitrary vertex.)
与数学子图的比较:
您可以为每个元素指定一个从 0 到 N 的数字,然后将每个子图枚举为长度为 N 的任意二进制数。您根本不需要扫描该图。
如果您真正想要的是具有不同属性(完全连接等)的子图,那么您需要更新您的问题。正如评论者指出的,2^100 非常大,所以你绝对不想(像上面那样)枚举数学上正确但物理上无聊的断开子图。假设每秒进行 10 亿次枚举,实际上需要至少 40 万亿年才能将它们全部枚举出来。
连通子图生成器:
如果您想要某种枚举来保留某些指标下子图的 DAG 属性,例如 (1,2,3)->(2,3)-> (2), (1,2,3)->(1,2)->(2),您只需要一个可以将所有连接的子图生成为迭代器的算法(产生每个 元素)。这可以通过一次递归地删除一个元素(可选地从“边界”)、检查剩余的元素集是否在缓存中(否则添加它)、产生它并递归来完成。如果你的分子非常呈链状且循环很少,那么这种方法就很有效。例如,如果您的元素是由 N 个元素组成的 5 角星形,则它只会有大约 (100/5)^5 = 320 万个结果(不到一秒)。但如果你开始添加多个环,例如芳香族化合物和其他化合物,你可能会遇到困难。
例如在 python 中
演示:
结果:
Comparison with mathematical subgraphs:
You could give each element a number from 0 to N, then enumerate each subgraph as any binary number of length N. You wouldn't need to scan the graph at all.
If what you really want is subgraphs with a certain property (fully connected, etc.) that is different, and you'd need to update your question. As a commentor noted, 2^100 is very large, so you definitely don't want to (like above) enumerate the mathematically-correct-but-physically-boring disconnected subgraphs. It would literally take you, assuming a billion enumerations per second, at least 40 trillion years to enumerate them all.
Connected-subgraph-generator:
If you want some kind of enumeration that retains the DAG property of subgraphs under some metric, e.g. (1,2,3)->(2,3)->(2), (1,2,3)->(1,2)->(2), you'd just want an algorithm that could generate all CONNECTED subgraphs as an iterator (yielding each element). This can be accomplished by recursively removing a single element at a time (optionally from the "boundary"), checking if the remaining set of elements is in a cache (else adding it), yielding it, and recursing. This works fine if your molecule is very chain-like with very few cycles. For example if your element was a 5-pointed star of N elements, it would only have about (100/5)^5 = 3.2million results (less than a second). But if you start adding in more than a single ring, e.g. aromatic compounds and others, you might be in for a rough ride.
e.g. in python
Demonstration:
Result:
有一种称为 gspan [1] 的算法,已用于计算频繁子图,它也可用于枚举所有子图。您可以在这里找到它的实现 [2]。
这个想法如下:图由所谓的 DFS 代码表示。 DFS 代码对应于图 G 上的深度优先搜索,并具有以下形式的条目
(i, j, l(v_i), l(v_i, v_j), l(v_j)),对于图的每条边 (v_i, v_j),其中顶点下标对应于发现顶点的顺序DFS。可以在所有 DFS 代码的集合上定义全序(如 [1] 中所做的那样),并因此通过计算表示该图的所有 DFS 代码的最小值来获得给定图的规范图标签。这意味着如果两个图具有相同的最小 DFS 代码,则它们是同构的。现在,从长度为 1 的所有可能的 DFS 代码(每条边一个)开始,可以通过随后一次向代码添加一条边来枚举图的所有子图,从而生成一棵枚举树,其中每个节点对应于一个图。如果仔细地进行枚举(即,与DFS代码上的顺序兼容),首先遇到最小的DFS代码。因此,每当遇到非最小的 DFS 代码时,就可以修剪其相应的子树。请参阅[1]了解更多详细信息。
[1] https://sites.cs.ucsb.edu/~xyan /papers/gSpan.pdf
[2] http://www.nowozin.net/sebastian/gboost/
There is this algorithm called gspan [1] that has been used to count frequent subgraphs it can also be used to enumerate all subgraphs. You can find an implementation of it here [2].
The idea is the following: Graphs are represented by so called DFS codes. A DFS code corresponds to a depth first search on a graph G and has an entry of the form
(i, j, l(v_i), l(v_i, v_j), l(v_j)), for each edge (v_i, v_j) of the graph, where the vertex subscripts correspond to the order in which the vertices are discovered by the DFS. It is possible to define a total order on the set of all DFS codes (as is done in [1]) and as a consequence to obtain a canonical graph label for a given graph by computing the minimum over all DFS codes representing this graph. Meaning that if two graphs have the same minimum DFS code they are isomorphic. Now, starting from all possible DFS codes of length 1 (one per edge), all subgraphs of a graph can be enumerated by subsequently adding one edge at a time to the codes which gives rise to an enumeration tree where each node corresponds to a graph. If the enumeration is done carefully (i.e., compatible with the order on the DFS codes) minimal DFS codes are encountered first. Therefore, whenever a DFS code is encountered that is not minimal its corresponding subtree can be pruned. Please consult [1] for further details.
[1] https://sites.cs.ucsb.edu/~xyan/papers/gSpan.pdf
[2] http://www.nowozin.net/sebastian/gboost/