如何在 Python 中对图进行聚类?

发布于 2024-07-15 02:57:02 字数 1586 浏览 6 评论 0原文

设 G 是一个图。 所以G是一组节点和一组链接。 我需要找到一种快速的方法来划分图。 我现在正在处理的图表只有 120*160 个节点,但我可能很快就会在另一个环境(不是医学,而是网站开发)中处理具有数百万个节点的等效问题。

所以,我所做的就是将所有链接存储到图矩阵中:

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))

现在,如果节点 s 连接到节点 t,则 M 在位置 s,t 处保留 1。 我确保 M 是对称的 M[s,t]=M[t,s] 并且每个节点链接到自身 M[s,s]=1。

如果我没记错的话,如果我将 M 乘以 M,结果是一个矩阵,表示连接通过两个步骤到达的顶点的图。

所以我继续将M与其自身相乘,直到矩阵中零的数量不再减少。 现在我有了连接组件的列表。 现在我需要对这个矩阵进行聚类。

到目前为止我对这个算法非常满意。 我认为它简单、优雅并且相当快。 我在这部分遇到麻烦。

本质上,我需要将该图拆分为其连接的组件。

我可以遍历所有节点,看看它们连接到什么。

但是如何对矩阵进行排序并重新排序行呢? 但我不知道是否可以做到。

以下是到目前为止的代码:

def findzeros(M):
    nZeros=0
    for t in M.flat:
        if not t:
            nZeros+=1
    return nZeros

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))    
for s in data.keys():
    MatrixCells[s,s]=1
    for t in data.keys():
        if t<s:
            if (scipy.corrcoef(data[t],data[s])[0,1])>threashold:
                M[s,t]=1
                M[t,s]=1

nZeros=findzeros(M)
M2=M*M
nZeros2=findzeros(M2)

while (nZeros-nZeros2):
    nZeros=nZeros2
    M=M2
    M2=M*M
    nZeros2=findzeros(M2)

编辑:

有人建议我使用 SVD 分解。 这是 5x5 图上问题的一个简单示例。 我们将使用它,因为使用 19200x19200 方阵不太容易看到簇。

import numpy
import scipy

M=numpy.mat(numpy.zeros((5,5)))

M[1,3]=1
M[3,1]=1
M[1,1]=1
M[2,2]=1
M[3,3]=1
M[4,4]=1
M[0,0]=1

print M

u,s,vh = numpy.linalg.linalg.svd(M)
print u
print s
print vh

这里本质上有 4 个簇:(0),(1,3),(2),(4) 但我仍然不明白 svn 在这种情况下有何帮助。

Let G be a graph. So G is a set of nodes and set of links. I need to find a fast way to partition the graph. The graph I am now working has only 120*160 nodes, but I might soon be working on an equivalent problem, in another context (not medicine, but website development), with millions of nodes.

So, what I did was to store all the links into a graph matrix:

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))

Now M holds a 1 in position s,t, if node s is connected to node t. I make sure M is symmetrical M[s,t]=M[t,s] and each node links to itself M[s,s]=1.

If I remember well if I multiply M with M, the results is a matrix that represents the graph that connects vertexes that are reached on through two steps.

So I keep on multplying M with itself, until the number of zeros in the matrix do not decrease any longer. Now I have the list of the connected components.
And now I need to cluster this matrix.

Up to now I am pretty satisfied with the algorithm. I think it is easy, elegant, and reasonably fast. I am having trouble with this part.

Essentially I need to split this graph into its connected components.

I can go through all the nodes, and see what are they connected to.

But what about sorting the matrix reordering the lines. But I don't know if it is possible to do it.

What follows is the code so far:

def findzeros(M):
    nZeros=0
    for t in M.flat:
        if not t:
            nZeros+=1
    return nZeros

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))    
for s in data.keys():
    MatrixCells[s,s]=1
    for t in data.keys():
        if t<s:
            if (scipy.corrcoef(data[t],data[s])[0,1])>threashold:
                M[s,t]=1
                M[t,s]=1

nZeros=findzeros(M)
M2=M*M
nZeros2=findzeros(M2)

while (nZeros-nZeros2):
    nZeros=nZeros2
    M=M2
    M2=M*M
    nZeros2=findzeros(M2)

Edit:

It has been suggested that I use SVD decomposition. Here is a simple example of the problem on a 5x5 graph. We shall use this since with the 19200x19200 square matrix is not that easy to see the clusters.

import numpy
import scipy

M=numpy.mat(numpy.zeros((5,5)))

M[1,3]=1
M[3,1]=1
M[1,1]=1
M[2,2]=1
M[3,3]=1
M[4,4]=1
M[0,0]=1

print M

u,s,vh = numpy.linalg.linalg.svd(M)
print u
print s
print vh

Essentially there are 4 clusters here: (0),(1,3),(2),(4)
But I still don't see how the svn can help in this context.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

鹿! 2024-07-22 02:57:02

为什么不使用真正的图形库,例如 Python-Graph? 它有一个 函数来确定连接的组件(尽管没有提供示例)。 我想一个专用的库将会比你编写的任何临时图形代码更快。

编辑: NetworkX 似乎它可能是比 python-graph 更好的选择; 它的 文档(此处用于连接组件功能) 当然是。

Why not use a real graph library, like Python-Graph? It has a function to determine connected components (though no example is provided). I'd imagine a dedicated library is going to be faster than whatever ad-hoc graph code you've cooked up.

EDIT: NetworkX seems like it might be a better choice than python-graph; its documentation (here for the connected components function) certainly is.

唱一曲作罢 2024-07-22 02:57:02

在 SciPy 中,您可以使用稀疏矩阵。 另请注意,还有更有效的矩阵乘法方法。 不管怎样,你想要做的事情可以通过 SVD 分解来完成。

带有有用链接的介绍

In SciPy you can use sparse matrices. Also note, that there are more efficient ways of multiplying matrix by itself. Anyway, what you're trying to do can by done by SVD decomposition.

Introduction with useful links.

甲如呢乙后呢 2024-07-22 02:57:02

还有 graph_toolnetworkit 具有用于连接组件的高效例程,并且都有效地存储网络。 如果您要使用数百万个节点,networkx 可能不够(据我所知,它是纯 python)。 这两个工具都是用 C++ 编写的,因此可以以合理的运行时间处理大型图的分析。

正如 Phil 指出的那样,您的方法对于大型图的计算时间将非常长(我们谈论的是几天、几周、几个月......),并且您对一百万个节点的图的表示将需要大约一百万 GB 的内存!

There's also graph_tool and networkit that have efficient routines for connected components, and both store the network efficiently. If you're going to work with millions of nodes, networkx will likely not be sufficient (it's pure python afaik). Both those tools are written in C++ so can handle analysis of large graphs with reasonable run times.

As Phil points out, your method will have horribly long compute times for large graphs (we're talking days, weeks, months...), and your representation for a graph of a million nodes will need something like a million gigabytes of memory!

破晓 2024-07-22 02:57:02

寻找最佳图划分是一个 NP 难题,因此无论采用什么算法,它都将是近似法或启发式算法。 毫不奇怪,不同的聚类算法会产生(非常)不同的结果。

纽曼模块化算法的Python实现:
模块化

另外:MCL, MCODE, CFinder, NeMoclusterONE

Finding an optimal graph partition is an NP-hard problem, so whatever the algorithm, it is going to be an approximation or a heuristic. Not surprisingly, different clustering algorithms produce (wildly) different results.

Python implementation of Newman's modularity algorithm:
modularity

Also: MCL, MCODE, CFinder, NeMo, clusterONE

白况 2024-07-22 02:57:02

这是一些简单的实现,它使用我前段时间写的深度优先搜索找到连接的组件。 虽然它非常简单,但它可以很好地扩展到数以万计的顶点和边......


import sys
from operator import gt, lt

class Graph(object):
    def __init__(self):
        self.nodes = set()
        self.edges = {}
        self.cluster_lookup = {}
        self.no_link = {}

    def add_edge(self, n1, n2, w):
        self.nodes.add(n1)
        self.nodes.add(n2)
        self.edges.setdefault(n1, {}).update({n2: w})
        self.edges.setdefault(n2, {}).update({n1: w})

    def connected_components(self, threshold=0.9, op=lt):
        nodes = set(self.nodes)
        components, visited = [], set()
        while len(nodes) > 0:
            connected, visited = self.dfs(nodes.pop(), visited, threshold, op)
            connected = set(connected)
            for node in connected:
                if node in nodes:
                    nodes.remove(node)

            subgraph = Graph()
            subgraph.nodes = connected
            subgraph.no_link = self.no_link
            for s in subgraph.nodes:
                for k, v in self.edges.get(s, {}).iteritems():
                    if k in subgraph.nodes:
                        subgraph.edges.setdefault(s, {}).update({k: v})
                if s in self.cluster_lookup:
                    subgraph.cluster_lookup[s] = self.cluster_lookup[s]

            components.append(subgraph)
        return components

    def dfs(self, v, visited, threshold, op=lt, first=None):
        aux = [v]
        visited.add(v)
        if first is None:
            first = v
        for i in (n for n, w in self.edges.get(v, {}).iteritems()
                  if op(w, threshold) and n not in visited):
            x, y = self.dfs(i, visited, threshold, op, first)
            aux.extend(x)
            visited = visited.union(y)
        return aux, visited

def main(args):
    graph = Graph()
    # first component
    graph.add_edge(0, 1, 1.0)
    graph.add_edge(1, 2, 1.0)
    graph.add_edge(2, 0, 1.0)

    # second component
    graph.add_edge(3, 4, 1.0)
    graph.add_edge(4, 5, 1.0)
    graph.add_edge(5, 3, 1.0)

    first, second = graph.connected_components(op=gt)
    print first.nodes
    print second.nodes

if __name__ == '__main__':
    main(sys.argv)

Here's some naive implementation, which finds the connected components using depth first search, i wrote some time ago. Although it's very simple, it scales well to ten thousands of vertices and edges...


import sys
from operator import gt, lt

class Graph(object):
    def __init__(self):
        self.nodes = set()
        self.edges = {}
        self.cluster_lookup = {}
        self.no_link = {}

    def add_edge(self, n1, n2, w):
        self.nodes.add(n1)
        self.nodes.add(n2)
        self.edges.setdefault(n1, {}).update({n2: w})
        self.edges.setdefault(n2, {}).update({n1: w})

    def connected_components(self, threshold=0.9, op=lt):
        nodes = set(self.nodes)
        components, visited = [], set()
        while len(nodes) > 0:
            connected, visited = self.dfs(nodes.pop(), visited, threshold, op)
            connected = set(connected)
            for node in connected:
                if node in nodes:
                    nodes.remove(node)

            subgraph = Graph()
            subgraph.nodes = connected
            subgraph.no_link = self.no_link
            for s in subgraph.nodes:
                for k, v in self.edges.get(s, {}).iteritems():
                    if k in subgraph.nodes:
                        subgraph.edges.setdefault(s, {}).update({k: v})
                if s in self.cluster_lookup:
                    subgraph.cluster_lookup[s] = self.cluster_lookup[s]

            components.append(subgraph)
        return components

    def dfs(self, v, visited, threshold, op=lt, first=None):
        aux = [v]
        visited.add(v)
        if first is None:
            first = v
        for i in (n for n, w in self.edges.get(v, {}).iteritems()
                  if op(w, threshold) and n not in visited):
            x, y = self.dfs(i, visited, threshold, op, first)
            aux.extend(x)
            visited = visited.union(y)
        return aux, visited

def main(args):
    graph = Graph()
    # first component
    graph.add_edge(0, 1, 1.0)
    graph.add_edge(1, 2, 1.0)
    graph.add_edge(2, 0, 1.0)

    # second component
    graph.add_edge(3, 4, 1.0)
    graph.add_edge(4, 5, 1.0)
    graph.add_edge(5, 3, 1.0)

    first, second = graph.connected_components(op=gt)
    print first.nodes
    print second.nodes

if __name__ == '__main__':
    main(sys.argv)
一口甜 2024-07-22 02:57:02

正如其他人指出的那样,无需重新发明轮子。 人们对最优聚类技术投入了大量的思考。 这里是一个著名的聚类程序。

As others have pointed out, no need to reinvent the wheel. A lot of thought has been put into optimal clustering techniques. Here is one well-known clustering program.

未蓝澄海的烟 2024-07-22 02:57:02

看起来有一个库PyMetis,它会为你分区你的图表,给出一个列表链接。 通过将链接节点的原始列表(而不是矩阵乘法派生的)传递给它,从图中提取链接列表应该相当容易。

对于 M 阶数较大的情况,重复执行 M' = MM 效率不高。对于 N 阶矩阵的完整矩阵乘法将花费每个元素 N 次乘法和 N-1 次加法,其中有 N2,即 O(N3) 次操作。 如果您将其扩展到“数百万个节点”,则每个矩阵-矩阵乘法将执行 O(1018) 次运算,您需要执行其中多次。

简而言之,你不想这样做。 Vartec 的 SVD 建议 将是唯一合适的选择。 您最好的选择是只使用 PyMetis,而不是尝试重新发明图形分区。

Looks like there is a library PyMetis, which will partition your graph for you, given a list of links. It should be fairly easy to extract the list of links from your graph by passing it your original list of linked nodes (not the matrix-multiply-derived one).

Repeatedly performing M' = MM will not be efficient for large orders of M. A full matrix-multiply for matrices of order N will cost N multiplications and N-1 additions per element, of which there are N2, that is O(N3) operations. If you are scaling that to "millions of nodes", that would be O(1018) operations per matrix-matrix multiplication, of which you want to do several.

In short, you don't want to do it this way. The SVD suggestion from Vartec would be the only appropriate choice there. Your best option is just to use PyMetis, and not try to reinvent graph-partitioning.

つ可否回来 2024-07-22 02:57:02

SVD 算法在这里不适用,但在其他方面 Phil H 是正确的。

The SVD algorithm is not applicable here, but otherwise Phil H is correct.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文