我目前正在处理图形实现的性能问题。
使用的技术
它是用 C++ 编程的。目前,该图是通过 BGL 实现的。
关于图
托管图是动态且无向的。它有两种结构:大量的完整子图和少量的单边。唯一需要的信息是顶点的直接邻域。
问题陈述
一开始,完整的子图很小(大约10个顶点)并且数量很多(大约13k)。 BGL 的邻接表实现是完美的。但现在,它被要求管理 5000 个顶点的少数子图。这意味着 5000x5000 边。那么现在在时间和空间上的表现都很差。
被拒绝的解决方案
我的第一个想法是使用 BGL 提供的邻接矩阵实现。但它不允许动态图。为了解决这个约束,有两种解决方案:为动态图提供邻接矩阵的新实现或在静态图中使用可用顶点池。经过反思,我认为这不是一个好主意:空间复杂度仍然是VxV/2。
最终解决方案和问题
所以,这里是我的最终解决方案:不使用 BGL,实现顶点袋来表示完整的子图(不需要边)并直接连接少数单边的顶点。通过这样做,最大子图的空间复杂度下降到它的顶点数,大约 5000。
- 您认为最后一个解决方案是好的吗?
- 如果没有,我可以使用哪种实现?为什么?
更新 1
有关该图的更多信息:
该图有约 100k 个顶点,约 3 个顶点的约 13k 个完整子图,以及大小范围 [10-5000] 的约 100 个完整子图。每条边都有捆绑的属性。
更新 2
感谢 Salim Jouilli,我最近了解到节点包是一种坦率的方法超图,其中超边包含在节点的子集中。
更新3
我已经完成了解决方案的实施。我有效地提高了内存消耗和运行时间:从 6GB 到 24MB,从 50 分钟到 2 分钟 30。
I'm currently dealing with a performance issue of graph implementation.
Used technologies
It is programmed in C++. For the moment, the graph is implemented thanks to the BGL.
About the graph
The managed graph is dynamic and undirected. It has two kinds of structures: a lot of complete subgraphs and few of single edges. The only needed information is the direct neighborhood of a vertex.
Problem statement
At the beginning, the complete subgraphs were small (about 10 vertices) and numerous (about 13k). An adjacency list implementation, the BGL's one, was perfect. But now, it is asked to manage few subgraph of 5000 vertices. That means 5000x5000 edges. The performance in time and space are then very poor now.
Rejected solutions
My first thought was to use the adjacency matrix implementation provided by BGL. But it doesn't allow dynamic graph. To resolve this constraint, two solutions: provide a new implementation of adjacency matrix for dynamic graph or use a pool of available vertices in a static graph. After reflection, I think it's not a good idea: the space complexity is still VxV/2.
Final Solution and question
So, here my final solution: don't use the BGL, implement bags of vertices to represent complete subgraphs (no need of edges) and directly connect vertices for the few single edges. By doing so, the space complexity of the biggest subgraph falls to its number of vertices, about 5000.
- Do you think this last solution is the good one?
- If not, which implementation could I use? And why?
Update 1
More information about the graph:
The graph has ~100k vertices, ~13k complete subgraphs of about 3 vertices, and ~100 complete subgraphs of size range [10-5000]. And each edge has bundled properties.
Update 2
I've recently learned thanks to Salim Jouilli that the bag of nodes is a candid approach of hypergraph where a hyperedge consists in a subset of nodes.
Update 3
I've finished to implement the solution. I've effectively gain in memory consumption and runtime: from 6GB to 24MB and from 50min to 2min30.
发布评论
评论(3)
一般来说,拥有 25M 条边并不是什么大问题。但我只在具有大量节点的相当稀疏的图(街道网络)上使用它。
如果内存使用和访问时间对您的需求变得至关重要,请尝试使用 boost 压缩稀疏图 http://www.boost.org/doc/libs/1_46_1/libs/graph/doc/compressed_sparse_row.html
有点使用起来很烦人,因为它需要以有序的方式插入边缘,但可能没有办法显着提高效率(最多提高几个百分点)。
Having 25M edges is no big deal in general. But I only used it on rather sparse graphs with lots of nodes also (a street network).
If the memory use and access time are getting critical for your need, try using boost compressed sparse graph http://www.boost.org/doc/libs/1_46_1/libs/graph/doc/compressed_sparse_row.html
It's a bit annoying to use as it requires the edges to be inserted in an ordered way, but there is probably no way to be significantly more effective (by a very few percent at most).
你的最终解决方案是我自己会采用的解决方案。听起来非常高效。
Your final solution is the one I would have gone for myself. It sounds as efficient as it can be.
如果您的问题将来会(再次)扩大规模,也许值得使用图形数据库< /a>.通过这种方式,您可以将存储和业务逻辑解耦,并将它们视为单独的问题。
If your problem will scale up (again) in the future, maybe it is worth the effort of using a graph database. This way you can decouple storage and business logic and treat them as separate problems.