Networkx：如何从巨大的图表创建关联矩阵

发布于 2025-01-10 00:20:03 字数 565 浏览 0 评论 0原文

我正在阅读一个非常大的图表，特别是 wiki-topcats (http://snap.stanford .edu/data/wiki-topcats.html），虽然我可以在可接受的时间内读取并创建图表：

graph = nx.read_edgelist("C:/.../wiki-topcats.txt", nodetype=int)

然后我需要提取关联矩阵以创建线图（这是我的目标），但是什么时候我运行：

C = nx.incidence_matrix(graph)

我遇到内存错误，并且我找不到任何有效的方法来处理这个问题，或者可能通过从邻接创建关联矩阵来解决这个问题，但由于图形的大小，我必须使用 scipy 矩阵例如，这个邻接矩阵有效。

A = nx.to_scipy_sparse_matrix(graph)

。任何帮助将不胜感激。

原文

I am reading a very big graph specifically the wiki-topcats (http://snap.stanford.edu/data/wiki-topcats.html) and while I can read and create in acceptable time the graph with:

graph = nx.read_edgelist("C:/.../wiki-topcats.txt", nodetype=int)

Then I need to extract the incidence matrix in order to create the linegraph (that's my goal) , but when I run:

C = nx.incidence_matrix(graph)

I am getting memory errors, and I cannot find any efficient way of dealing with this or maybe a way to work around this by creating the incidence matrix from the adjancency, but I must work with scipy matrices due to magnitude of the graph e.g. this adjacency matrix works.

A = nx.to_scipy_sparse_matrix(graph)

. Any help would be greatly appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心病无药医 2025-01-17 00:20:03

NetworkX 显然不是为计算大图而设计的。它的性能也不是很好（内存消耗和速度）。图数据库旨在通过将图存储在存储设备上来解决这个问题。 Neo4j 是 Python 中可用的此类数据库的一个示例。话虽这么说，您想要计算的图并不是那么大，您可以尝试使用更高效的替代 Python 库，例如 graph-tool。

假设您确实想要/需要使用 NetworkX，那么请注意，该图需要大约 12 GiB 的 RAM，并且许多 NetworkX 函数将返回完整的图，导致分配数十 GiB 的 RAM，这完全是资源浪费。更不用说大多数机器没有那么多内存。图形文件仅占用约 400 MiB，紧凑的表示应该能够以不到 100 MiB 的大小存储在 RAM 中。因此，使用 12 GiB 需要比预期多 120 倍的资源。更不用说 NetworkX 需要很长时间才能填满这样的内存空间。

假设仔细选择了稀疏矩阵类型，则直接在稀疏矩阵中加载文件会更有效。事实上，CSR 矩阵可以非常有效地存储邻接矩阵，而 DOK 矩阵的效率非常低（它需要 6 GiB 的 RAM，并且与 NetworkX 一样慢）。

请注意，np.genfromtxt（旨在加载此类文件）在我的计算机上使用 45-50 GiB RAM 时会变得疯狂。这太疯狂了，因为它实际上比严格需要的大 200 倍！希望 Pandas 的 df = pd.read_table('wiki-topcats.txt', dtype=np.int32, header=None, sep=' ') 能够使用少于 400 的内存加载文件米布。基于此，您可以使用edges = df.to_numpy()轻松地将数据帧转换为Numpy，然后构建相对快速的CSR矩阵。

请注意，您可以使用 Numpy 直接构建（稀疏 CSR）关联矩阵。一种解决方案是使用 idx = np.arange(len(df), dtype=np.int32) 为每条边设置一个 ID。那么，就可以直接知道稀疏矩阵中所有1的位置（m[edges[:,0], idx]和m[edges[:,1], idx]）这应该足以构建它（请注意可能由于节点连接到自身而导致的重复）。

NetworkX is clearly not designed to compute big graphs. It is not very performant either (for both memory consumption and speed). Graph databases are meant to solve this problem by storing graphs on a storage device. Neo4j is an example of such database available in Python. That being said, the graph you want to compute is not so big and you could try to use alternative Python libraries that are more efficient like graph-tool.

Assuming you really want/need to use NetworkX, then be aware that the graph takes about 12 GiB of RAM and many NetworkX function will return a full graph resulting in several dozen of GiB of RAM allocated which is a complete waste of resources. Not to mention that most machine does not have so much memory. The graph file take only about 400 MiB and compact representations should be able to store in RAM in less than 100 MiB. Thus, using 12 GiB for that require >120 times more resources than expected. Not to mention that NetworkX take a very long time to fill such memory space.

Loading the file directly in a sparse matrix is much more efficient assuming the sparse matrix type is carefully chosen. Indeed, CSR matrix can store the adjacency matrix pretty efficiently while a DOK matrix is very inefficient (it takes 6 GiB of RAM and is as slow as NetworkX).

Note that np.genfromtxt (which is intended to load such kind of file) goes crazy by using 45-50 GiB of RAM on my machine. This is insane as it is actually ~200 times bigger than what is strictly needed! Hopefully, df = pd.read_table('wiki-topcats.txt', dtype=np.int32, header=None, sep=' ') from Pandas is able to load the file using less than 400 MiB. Based on that, you can easily convert the dataframe to Numpy with edges = df.to_numpy() and then build a relatively fast CSR matrix.

Note that you can build directly the (sparse CSR) incidence matrix with Numpy. One solution is to use idx = np.arange(len(df), dtype=np.int32) so to set an ID to each edge. Then, you can directly know all the position of the 1 in the sparse matrix (m[edges[:,0], idx] and m[edges[:,1], idx]) which should be enough to build it (be careful about possible duplicates that could be due to a node connected to itself).

回复收藏 0 原文

~没有更多了~