NetworkX 与哪些可扩展性问题相关?

发布于 2024-12-13 01:30:14 字数 1259 浏览 3 评论 0原文

我对具有数百万个节点和数千万条边的大型网络的网络分析感兴趣。我希望能够解析多种格式的网络、查找连接的组件、检测社区以及运行 PageRank 等中心性度量。

我被 NetworkX 所吸引,因为它有一个很好的 API、良好的文档,并且多年来一直在积极开发。另外,因为它是用 python 编写的,所以开发起来应该很快。

在最近的演示文稿中(幻灯片可在 github 此处 上找到),据称:

与许多其他工具不同,NX 旨在大规模处理数据 与现代问题相关...NX 中的大多数核心算法都依赖于极快的遗留代码。

该演示文稿还指出 NetworkX 的基本算法是用 C/Fortran 实现的。

然而,从源代码来看,NetworkX 似乎主要是用 python 编写的。我对源代码不太熟悉,但我知道 NetworkX 使用 numpy 来完成繁重的工作(进而使用 C/Fortran 来完成线性代数)的几个示例。例如,文件 networkx/networkx/algorithms/centrality/eigenvector.py 使用 numpy 来计算特征向量。

有谁知道这种调用像 numpy 这样的优化库的策略是否真的在整个 NetworkX 中普遍存在,或者是否只有少数算法可以做到?还有人可以描述与 NetworkX 相关的其他可扩展性问题吗?

来自 NetworkX 首席程序员的回复 我在 NetworkX 邮件列表上提出了这个问题,Aric Hagberg 回复道:

NetworkX 中使用的数据结构适合扩展到 大问题(例如数据结构是邻接表)。这 算法具有各种缩放属性,但其中一些您可以 提及是可用的(例如PageRank,连接的组件,是线性的 边数的复杂性)。

此时 NetworkX 是纯 Python 代码。邻接结构 使用Python字典进行编码,提供了极大的灵活性 以牺牲内存和计算速度为代价。大图将 占用大量内存,最终会耗尽。

NetworkX 确实使用 NumPy 和 SciPy 来实现主要是 基于线性代数。在这种情况下,图形表示为 使用 NumPy 矩阵或 SciPy (复制)作为邻接矩阵 稀疏矩阵。这些算法可以受益于遗留的 C 和 在 NumPy 和 SciPY 中使用的 FORTRAN 代码。

I'm interested in network analysis on large networks with millions of nodes and tens of millions of edges. I want to be able to do things like parse networks from many formats, find connected components, detect communities, and run centrality measures like PageRank.

I am attracted to NetworkX because it has a nice api, good documentation, and has been under active development for years. Plus because it is in python, it should be quick to develop with.

In a recent presentation (the slides are available on github here), it was claimed that:

Unlike many other tools, NX is designed to handle data on a scale
relevant to modern problems...Most of the core algorithms in NX rely on extremely fast legacy code.

The presentation also states that the base algorithms of NetworkX are implemented in C/Fortran.

However, looking at the source code, it looks like NetworkX is mostly written in python. I am not too familiar with the source code, but I am aware of a couple of examples where NetworkX uses numpy to do heavy lifting (which in turn uses C/Fortran to do linear algebra). For example, the file networkx/networkx/algorithms/centrality/eigenvector.py uses numpy to calculate eigenvectors.

Does anyone know if this strategy of calling an optimized library like numpy is really prevalent throughout NetworkX, or if just a few algorithms do it? Also can anyone describe other scalability issues associated with NetworkX?

Reply from NetworkX Lead Programmer
I posed this question on the NetworkX mailing list, and Aric Hagberg replied:

The data structures used in NetworkX are appropriate for scaling to
large problems (e.g. the data structure is an adjacency list). The
algorithms have various scaling properties but some of the ones you
mention are usable (e.g. PageRank, connected components, are linear
complexity in the number of edges).

At this point NetworkX is pure Python code. The adjacency structure
is encoded with Python dictionaries which provides great flexibility
at the expense of memory and computational speed. Large graphs will
take a lot of memory and you will eventually run out.

NetworkX does use NumPy and SciPy for algorithms that are primarily
based on linear algebra. In that case the graph is represented
(copied) as an adjacency matrix using either NumPy matrices or SciPy
sparse matrices. Those algorithms can benefit from the legacy C and
FORTRAN code that is used under the hood in NumPy and SciPY.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

飘落散花 2024-12-20 01:30:15

你的大问题将是记忆。如果不跳过类实现中的障碍,Python 根本无法处理数千万个对象。许多对象的内存开销太高,达到 2GB,32 位代码就无法工作。有很多方法可以解决这个问题 - 使用槽、数组或 NumPy。它应该没问题,因为networkx是为了性能而编写的,但是如果有一些东西不起作用,我会检查你的内存使用情况。

至于缩放,算法基本上是图唯一重要的事情。如果图算法做错了,那么它们往往会出现非常的缩放,而在 Python 中,它们与任何其他语言一样有可能正确完成。

Your big issue will be memory. Python simply cannot handle tens of millions of objects without jumping through hoops in your class implementation. The memory overhead of many objects is too high, you hit 2GB, and 32-bit code won't work. There are ways around it - using slots, arrays, or NumPy. It should be OK because networkx was written for performance, but if there are a few things that don't work, I will check your memory usage.

As for scaling, algorithms are basically the only thing that matters with graphs. Graph algorithms tend to have really ugly scaling if they are done wrong, and they are just as likely to be done right in Python as any other language.

仅此而已 2024-12-20 01:30:15

我认为 GraphScope 可以很好地解决 NetworkX 的可扩展性问题,通过提供与 NetworkX 兼容的 Python 接口,用 C++ 实现的高效分布式运行时。也就是说,用户只需要通过几行代码修改他们的NetworkX应用程序,同时实现几个数量级的性能提升。简而言之,在 GraphScope 上运行 NetworkX 应用程序具有以下优点:

用户工作量最少:要使 NetworkX 应用程序在 GraphScope 上运行,用户只需替换导入网络使用import graphscope.nx as networkx,因为GraphScope的图形操作和数据加载接口与NetworkX完全兼容。例如,要在 NetworkX 中运行聚类算法,用户可以编写:

import os
import network

g = networkx.read_edgelist(
     os.path.expandvars('PATH_TO_DATASET')
)

result = networkx.clustering(g)

要在 GraphScope 上运行聚类算法,用户只需要修改一行特定代码。

import os
import graphscope.nx as networkx

g = networkx.read_edgelist(
     os.path.expandvars('PATH_TO_DATASET')
)

result = networkx.clustering(g)

高性能:GraphScope的运行时采用C++实现,效率很高。以上述聚类算法为例,在 GraphScope 上运行它比在 NetworkX 上运行快 29 倍以上。此外,GraphScope 允许以分布式方式运行 NetworkX 应用程序,从而实现高可扩展性。要在 K8s 集群上以分布式方式运行 NeteorkX 应用程序,用户需要将 import graphscope.nx as nwtworkx 替换为。

import graphscope
networkx = graphscope.session(num_workers=$NUM_WORKERS).nx()

有关如何在 GraphScope 上运行 NetworkX 应用程序的更多信息,请查看 分析使用 NetworkX 风格的 GraphScope 绘制图形

免责声明:我是 GraphScope 的作者。

I think GraphScope can address the scalability problem of NetworkX very well, by offering Python interfaces that are compatible with NetworkX and an efficient distributed runtime implemented in C++. That is, users only need to modify their NetworkX applications with a few lines of code, while achieving several orders of magnitude performance improvement. In a nutshell, running NetworkX applications on GraphScope has the following advantages:

Minimal user effort: To make NetworkX applications run on GraphScope, users only need to replace import network with import graphscope.nx as networkx, since the graph manipulation and data loading interfaces of GraphScope are fully compatible with NetworkX. For example, to run the clustering algorithm in NetworkX, user can write

import os
import network

g = networkx.read_edgelist(
     os.path.expandvars('PATH_TO_DATASET')
)

result = networkx.clustering(g)

To run the clustering algorithm on GraphScope, users only need to modify one specific line of code.

import os
import graphscope.nx as networkx

g = networkx.read_edgelist(
     os.path.expandvars('PATH_TO_DATASET')
)

result = networkx.clustering(g)

High performance: The runtime of GraphScope is implemented with C++ for high efficiency. Taking the above clustering algorithm as an example, running it on GraphScope is over 29x faster than running over NetworkX. Furthermore, GraphScope enables running NetworkX applications in a distributed manner, allowing for high scalability. To run NeteorkX applications in a distributed fashion on a K8s cluster, users need to replace import graphscope.nx as nwtworkx with

import graphscope
networkx = graphscope.session(num_workers=$NUM_WORKERS).nx()

For more information about how to run NetworkX applications on GraphScope, please check out Analyzing graph with GraphScope in the Style of NetworkX.

Disclaimer: I'm an author of GraphScope.

七分※倦醒 2024-12-20 01:30:15

事实上,networkX 大部分是用 python 编写的,但这并不意味着它不可扩展,也不意味着它是完美的。总是有一个权衡。如果你在“机器”上投入更多的钱,你将拥有你想要的可扩展性以及使用 pythonic 图形库的好处。

如果没有,还有其他解决方案(此处这里 ),这可能会消耗更少的内存(基准测试并看看,我认为 igraph 完全支持 C,所以它会),但你可能会错过 pythonic NX的感觉。

The fact that networkX is mostly written in python does not mean that it is not scalable, nor claims perfection. There is always a trade-off. If you throw more money on your "machines", you'll have as much scalability as you want plus the benefits of using a pythonic graph library.

If not, there are other solutions, ( here and here ), which may consume less memory ( benchmark and see, I think igraph is fully C backed so it will ), but you may miss the pythonic feel of NX.

卖梦商人 2024-12-20 01:30:14

这是一个老问题,但我认为值得一提的是 graph-tool 具有与NetworkX,但它是使用模板在 C++ 中实现的(使用 Boost Graph Library),因此速度更快(up两个数量级)并使用内存少得多。

免责声明:我是图表工具的作者。

This is an old question, but I think it is worth mentioning that graph-tool has a very similar functionality to NetworkX, but it is implemented in C++ with templates (using the Boost Graph Library), and hence is much faster (up to two orders of magnitude) and uses much less memory.

Disclaimer: I'm the author of graph-tool.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文