如何在 Python 中对图进行聚类？

发布于 2024-07-15 02:57:02 字数 1586 浏览 6 评论 0原文

设 G 是一个图。所以G是一组节点和一组链接。我需要找到一种快速的方法来划分图。我现在正在处理的图表只有 120*160 个节点，但我可能很快就会在另一个环境（不是医学，而是网站开发）中处理具有数百万个节点的等效问题。

所以，我所做的就是将所有链接存储到图矩阵中：

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))

现在，如果节点 s 连接到节点 t，则 M 在位置 s,t 处保留 1。我确保 M 是对称的 M[s,t]=M[t,s] 并且每个节点链接到自身 M[s,s]=1。

如果我没记错的话，如果我将 M 乘以 M，结果是一个矩阵，表示连接通过两个步骤到达的顶点的图。

所以我继续将M与其自身相乘，直到矩阵中零的数量不再减少。现在我有了连接组件的列表。现在我需要对这个矩阵进行聚类。

到目前为止我对这个算法非常满意。我认为它简单、优雅并且相当快。我在这部分遇到麻烦。

本质上，我需要将该图拆分为其连接的组件。

我可以遍历所有节点，看看它们连接到什么。

但是如何对矩阵进行排序并重新排序行呢？但我不知道是否可以做到。

以下是到目前为止的代码：

def findzeros(M):
    nZeros=0
    for t in M.flat:
        if not t:
            nZeros+=1
    return nZeros

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))    
for s in data.keys():
    MatrixCells[s,s]=1
    for t in data.keys():
        if t<s:
            if (scipy.corrcoef(data[t],data[s])[0,1])>threashold:
                M[s,t]=1
                M[t,s]=1

nZeros=findzeros(M)
M2=M*M
nZeros2=findzeros(M2)

while (nZeros-nZeros2):
    nZeros=nZeros2
    M=M2
    M2=M*M
    nZeros2=findzeros(M2)

编辑：

有人建议我使用 SVD 分解。这是 5x5 图上问题的一个简单示例。我们将使用它，因为使用 19200x19200 方阵不太容易看到簇。

import numpy
import scipy

M=numpy.mat(numpy.zeros((5,5)))

M[1,3]=1
M[3,1]=1
M[1,1]=1
M[2,2]=1
M[3,3]=1
M[4,4]=1
M[0,0]=1

print M

u,s,vh = numpy.linalg.linalg.svd(M)
print u
print s
print vh

这里本质上有 4 个簇：(0),(1,3),(2),(4) 但我仍然不明白 svn 在这种情况下有何帮助。

原文

Let G be a graph. So G is a set of nodes and set of links. I need to find a fast way to partition the graph. The graph I am now working has only 120*160 nodes, but I might soon be working on an equivalent problem, in another context (not medicine, but website development), with millions of nodes.

So, what I did was to store all the links into a graph matrix:

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))

Now M holds a 1 in position s,t, if node s is connected to node t. I make sure M is symmetrical M[s,t]=M[t,s] and each node links to itself M[s,s]=1.

If I remember well if I multiply M with M, the results is a matrix that represents the graph that connects vertexes that are reached on through two steps.

So I keep on multplying M with itself, until the number of zeros in the matrix do not decrease any longer. Now I have the list of the connected components.
And now I need to cluster this matrix.

Up to now I am pretty satisfied with the algorithm. I think it is easy, elegant, and reasonably fast. I am having trouble with this part.

Essentially I need to split this graph into its connected components.

I can go through all the nodes, and see what are they connected to.

But what about sorting the matrix reordering the lines. But I don't know if it is possible to do it.

What follows is the code so far:

def findzeros(M):
    nZeros=0
    for t in M.flat:
        if not t:
            nZeros+=1
    return nZeros

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))    
for s in data.keys():
    MatrixCells[s,s]=1
    for t in data.keys():
        if t<s:
            if (scipy.corrcoef(data[t],data[s])[0,1])>threashold:
                M[s,t]=1
                M[t,s]=1

nZeros=findzeros(M)
M2=M*M
nZeros2=findzeros(M2)

while (nZeros-nZeros2):
    nZeros=nZeros2
    M=M2
    M2=M*M
    nZeros2=findzeros(M2)

Edit:

It has been suggested that I use SVD decomposition. Here is a simple example of the problem on a 5x5 graph. We shall use this since with the 19200x19200 square matrix is not that easy to see the clusters.

import numpy
import scipy

M=numpy.mat(numpy.zeros((5,5)))

M[1,3]=1
M[3,1]=1
M[1,1]=1
M[2,2]=1
M[3,3]=1
M[4,4]=1
M[0,0]=1

print M

u,s,vh = numpy.linalg.linalg.svd(M)
print u
print s
print vh

Essentially there are 4 clusters here: (0),(1,3),(2),(4)
But I still don't see how the svn can help in this context.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鹿! 2024-07-22 02:57:02

为什么不使用真正的图形库，例如 Python-Graph？它有一个函数来确定连接的组件（尽管没有提供示例）。我想一个专用的库将会比你编写的任何临时图形代码更快。

编辑： NetworkX 似乎它可能是比 python-graph 更好的选择；它的文档（此处用于连接组件功能）当然是。

回复收藏 0 原文

唱一曲作罢 2024-07-22 02:57:02

在 SciPy 中，您可以使用稀疏矩阵。另请注意，还有更有效的矩阵乘法方法。不管怎样，你想要做的事情可以通过 SVD 分解来完成。

带有有用链接的介绍。

回复收藏 0 原文

甲如呢乙后呢 2024-07-22 02:57:02

还有 graph_tool 和 networkit 具有用于连接组件的高效例程，并且都有效地存储网络。如果您要使用数百万个节点，networkx 可能不够（据我所知，它是纯 python）。这两个工具都是用 C++ 编写的，因此可以以合理的运行时间处理大型图的分析。

正如 Phil 指出的那样，您的方法对于大型图的计算时间将非常长（我们谈论的是几天、几周、几个月......），并且您对一百万个节点的图的表示将需要大约一百万 GB 的内存！

回复收藏 0 原文

破晓 2024-07-22 02:57:02

寻找最佳图划分是一个 NP 难题，因此无论采用什么算法，它都将是近似法或启发式算法。毫不奇怪，不同的聚类算法会产生（非常）不同的结果。

纽曼模块化算法的Python实现：
模块化

另外：MCL, MCODE, CFinder, NeMo、clusterONE

回复收藏 0 原文

白况 2024-07-22 02:57:02

这是一些简单的实现，它使用我前段时间写的深度优先搜索找到连接的组件。虽然它非常简单，但它可以很好地扩展到数以万计的顶点和边......


import sys
from operator import gt, lt

class Graph(object):
    def __init__(self):
        self.nodes = set()
        self.edges = {}
        self.cluster_lookup = {}
        self.no_link = {}

    def add_edge(self, n1, n2, w):
        self.nodes.add(n1)
        self.nodes.add(n2)
        self.edges.setdefault(n1, {}).update({n2: w})
        self.edges.setdefault(n2, {}).update({n1: w})

    def connected_components(self, threshold=0.9, op=lt):
        nodes = set(self.nodes)
        components, visited = [], set()
        while len(nodes) > 0:
            connected, visited = self.dfs(nodes.pop(), visited, threshold, op)
            connected = set(connected)
            for node in connected:
                if node in nodes:
                    nodes.remove(node)

            subgraph = Graph()
            subgraph.nodes = connected
            subgraph.no_link = self.no_link
            for s in subgraph.nodes:
                for k, v in self.edges.get(s, {}).iteritems():
                    if k in subgraph.nodes:
                        subgraph.edges.setdefault(s, {}).update({k: v})
                if s in self.cluster_lookup:
                    subgraph.cluster_lookup[s] = self.cluster_lookup[s]

            components.append(subgraph)
        return components

    def dfs(self, v, visited, threshold, op=lt, first=None):
        aux = [v]
        visited.add(v)
        if first is None:
            first = v
        for i in (n for n, w in self.edges.get(v, {}).iteritems()
                  if op(w, threshold) and n not in visited):
            x, y = self.dfs(i, visited, threshold, op, first)
            aux.extend(x)
            visited = visited.union(y)
        return aux, visited

def main(args):
    graph = Graph()
    # first component
    graph.add_edge(0, 1, 1.0)
    graph.add_edge(1, 2, 1.0)
    graph.add_edge(2, 0, 1.0)

    # second component
    graph.add_edge(3, 4, 1.0)
    graph.add_edge(4, 5, 1.0)
    graph.add_edge(5, 3, 1.0)

    first, second = graph.connected_components(op=gt)
    print first.nodes
    print second.nodes

if __name__ == '__main__':
    main(sys.argv)

Here's some naive implementation, which finds the connected components using depth first search, i wrote some time ago. Although it's very simple, it scales well to ten thousands of vertices and edges...


import sys
from operator import gt, lt

class Graph(object):
    def __init__(self):
        self.nodes = set()
        self.edges = {}
        self.cluster_lookup = {}
        self.no_link = {}

    def add_edge(self, n1, n2, w):
        self.nodes.add(n1)
        self.nodes.add(n2)
        self.edges.setdefault(n1, {}).update({n2: w})
        self.edges.setdefault(n2, {}).update({n1: w})

    def connected_components(self, threshold=0.9, op=lt):
        nodes = set(self.nodes)
        components, visited = [], set()
        while len(nodes) > 0:
            connected, visited = self.dfs(nodes.pop(), visited, threshold, op)
            connected = set(connected)
            for node in connected:
                if node in nodes:
                    nodes.remove(node)

            subgraph = Graph()
            subgraph.nodes = connected
            subgraph.no_link = self.no_link
            for s in subgraph.nodes:
                for k, v in self.edges.get(s, {}).iteritems():
                    if k in subgraph.nodes:
                        subgraph.edges.setdefault(s, {}).update({k: v})
                if s in self.cluster_lookup:
                    subgraph.cluster_lookup[s] = self.cluster_lookup[s]

            components.append(subgraph)
        return components

    def dfs(self, v, visited, threshold, op=lt, first=None):
        aux = [v]
        visited.add(v)
        if first is None:
            first = v
        for i in (n for n, w in self.edges.get(v, {}).iteritems()
                  if op(w, threshold) and n not in visited):
            x, y = self.dfs(i, visited, threshold, op, first)
            aux.extend(x)
            visited = visited.union(y)
        return aux, visited

def main(args):
    graph = Graph()
    # first component
    graph.add_edge(0, 1, 1.0)
    graph.add_edge(1, 2, 1.0)
    graph.add_edge(2, 0, 1.0)

    # second component
    graph.add_edge(3, 4, 1.0)
    graph.add_edge(4, 5, 1.0)
    graph.add_edge(5, 3, 1.0)

    first, second = graph.connected_components(op=gt)
    print first.nodes
    print second.nodes

if __name__ == '__main__':
    main(sys.argv)

回复收藏 0 原文