用于计算有向图上非循环路径数量的快速算法
简而言之,我需要一个快速算法来计算简单有向图中有多少条非循环路径。
我所说的“简单”图是指没有自循环或多个边的图。 路径可以从任何节点开始,并且必须以没有传出边的节点结束。如果一条路径中没有边出现两次,则该路径是非循环。
我的图(经验数据集)只有 20-160 个节点,但是,其中一些节点有很多循环,因此会有大量路径,而我的天真的方法对于某些图来说根本不够快我有。
我当前正在做的是使用递归函数沿着所有可能的边缘“下降”,同时跟踪我已经访问过的节点(并避免它们)。迄今为止最快的解决方案是用 C++ 编写的,并在递归函数中使用 std::bitset 参数来跟踪已访问的节点(已访问的节点由位 1 标记)。该程序在示例数据集上运行 1-2 分钟(取决于计算机速度)。对于其他数据集,运行需要一天多的时间,或者显然更长。
示例数据集:http://pastie.org/1763781 (每行都是一个边对)
示例数据集的解决方案(第一个数字是我开始的节点,第二个数字是从该节点开始的路径计数,最后一个数字是总路径计数): http://pastie.org/1763790
如果您对更好的算法有任何想法,请告诉我复杂。我也对近似解决方案感兴趣(使用某种蒙特卡罗方法估计路径数量)。最终我还想测量平均路径长度。
编辑:也在 MathOverflow 上以相同的标题发布,因为它可能在那里更相关。希望这不违反规则。无法链接,因为网站不允许超过 2 个链接...
In short, I need a fast algorithm to count how many acyclic paths are there in a simple directed graph.
By simple graph I mean one without self loops or multiple edges.
A path can start from any node and must end on a node that has no outgoing edges. A path is acyclic if no edge occurs twice in it.
My graphs (empirical datasets) have only between 20-160 nodes, however, some of them have many cycles in them, therefore there will be a very large number of paths, and my naive approach is simply not fast enough for some of the graph I have.
What I'm doing currently is "descending" along all possible edges using a recursive function, while keeping track of which nodes I have already visited (and avoiding them). The fastest solution I have so far was written in C++, and uses std::bitset argument in the recursive function to keep track of which nodes were already visited (visited nodes are marked by bit 1). This program runs on the sample dataset in 1-2 minutes (depending on computer speed). With other datasets it takes more than a day to run, or apparently much longer.
The sample dataset: http://pastie.org/1763781
(each line is an edge-pair)
Solution for the sample dataset (first number is the node I'm starting from, second number is the path-count starting from that node, last number is the total path count):
http://pastie.org/1763790
Please let me know if you have ideas about algorithms with a better complexity. I'm also interested in approximate solutions (estimating the number of paths with some Monte Carlo approach). Eventually I'll also want to measure the average path length.
Edit: also posted on MathOverflow under same title, as it might be more relevant there. Hope this is not against the rules. Can't link as site won't allow more than 2 links ...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
看起来这就是#P-完整的。 (参考http://www.maths.uq.edu.au /~kroese/ps/robkro_rev.pdf)。该链接有一个近似值
如果您可以放宽简单路径要求,您也可以使用 Floyd-Warshall 的修改版本或图形指数有效地计算路径数。请参阅所有对图表上的所有路径
This is #P-complete, it seems. (ref http://www.maths.uq.edu.au/~kroese/ps/robkro_rev.pdf). The link has an approximation
If you can relax the simple path requirement, you can efficiently count the number of paths using a modified version of Floyd-Warshall or graph exponentiation as well. See All pairs all paths on a graph
正如 spin_plate 所提到的,这个问题是 #P-complete 所以开始寻找你的近似值:)。我真的很喜欢这个问题的#P-完整性证明,所以我认为分享它会很好:
设 N 为图中的路径数(从 s 开始)和 p_k是长度k的路径数。我们有:
现在通过将每条边更改为一对平行边来构建第二个图。对于长度为 k 的每条路径,现在将有 k^2 条路径,因此:
重复此过程,但使用 i 条边而不是 2, up n,将为我们提供一个线性系统(带有范德蒙矩阵),使我们能够找到 p_1, ..., p_n。
因此,查找图中的路径数量与查找特定长度的路径数量一样困难。特别是,p_n 是哈密顿路径的数量(从 s 开始),这是一个真正的 #P 完全问题。
我还没有做过数学计算,我也猜测类似的过程应该能够证明仅计算平均长度也很困难。
注意:大多数时候讨论这个问题的路径都是从单边开始并在任何地方停止。这与您的问题相反,但您应该通过反转所有边缘来等效。
As mentioned by spinning_plate, this problem is #P-complete so start looking for your aproximations :). I really like the #P-completeness proof for this problem, so I'd think it would be nice to share it:
Let N be the number of paths (starting at s) in the graph and p_k be the number of paths of length k. We have:
Now build a second graph by changing every edge to a pair of paralel edges.For each path of length k there will now be k^2 paths so:
Repeating this process, but with i edges instead of 2, up n, would give us a linear system (with a Vandermonde matrix) allowing us to find p_1, ..., p_n.
Therefore, finding the number of paths in the graph is just as hard as finding the number of paths of a certain length. In particular, p_n is the number of Hamiltonian Paths (starting at s), a bona-fide #P-complete problem.
I havent done the math I'd also guess that a similar process should be able to prove that just calculating average length is also hard.
Note: most times this problem is discussed the paths start from a single edge and stop wherever. This is the opposite from your problem, but you they should be equivalent by just reversing all the edges.
问题陈述的重要性
目前尚不清楚正在计算什么。
定义你的问题,以免出现歧义。
估计
当为随机构建的有向图设计时,估计可能会出现几个数量级的偏差,并且该图在其构造上在统计上非常倾斜或系统化。这是所有估计过程的典型特征,但在图表中尤其明显,因为它们具有指数模式复杂性潜力。
两个优化点
对于大多数处理器架构,std::bitset 模型将比 bool 值慢,因为指令集机制会在特定位偏移处测试位。当内存占用而不是速度是关键因素时,位集更有用。
消除案件或减少通过扣除很重要。例如,如果存在只有一个出边的节点,则可以计算没有该出边的路径数,并将从其指向的节点开始的路径数添加到子图中的路径数。
诉诸集群
问题可以按照起始节点分布在集群上执行。有些问题只需要超级计算即可解决。如果您有 1,000,000 个起始节点和 10 个处理器,则可以在每个处理器上放置 100,000 个起始节点案例。上述案例的消除和减少应在分发案例之前完成。
典型的深度优先递归以及如何优化它
这是一个小程序,提供基本的深度优先、从任何节点到任何节点的非循环遍历,可以更改、放置在循环中或分布式。如果最大数据集大小已知,则可以使用以大小作为参数的模板将该列表放入静态本机数组中,从而减少迭代和索引时间。
Importance of Problem Statement
It is unclear what is being counted.
Define your problem so that there are no ambiguities.
Estimation
Estimations can be off by orders of magnitude when designed for randomly constructed directed graphs and the graph is very statistically skewed or systematic in its construction. This is typical of all estimation processes, but particularly pronounced in graphs because of their exponential pattern complexity potential.
Two Optimizing Points
The std::bitset model will be slower than bool values for most processor architectures because of the instruction set mechanics of testing a bit at a particular bit offset. The bitset is more useful when memory footprint, not speed is the critical factor.
Eliminating cases or reducing via deductions is important. For instance, if there are nodes for which there is only one outgoing edge, one can calculate the number of paths without it and add to the number of paths in the sub-graph the number of paths from the node from which it points.
Resorting to Clusters
The problem can be executed on a cluster by distributing according to starting node. Some problems simply require super-computing. If you have 1,000,000 starting nodes and 10 processors, you can place 100,000 starting node cases on each processor. The above case eliminations and reductions should be done prior to distributing cases.
A Typical Depth First Recursion and How to Optimize It
Here is a small program that provides a basic depth first, acyclic traversal from any node to any node, which can be altered, placed in a loop, or distributed. The list can be placed into a static native array by using a template with a size as one parameter if the maximum data set size is known, which reduces iteration and indexing times.