Python,Scipy:使用大型邻接矩阵构建三元组
我使用邻接矩阵来表示朋友网络,可以直观地解释为
Mary 0 1 1 1
Joe 1 0 1 1
Bob 1 1 0 1
Susan 1 1 1 0
Mary Joe Bob Susan
使用这个矩阵,我想编译所有可能的友谊三角形的列表,条件是用户 1 是用户 2 的朋友,用户 2 是朋友与用户 3 的关系。对于我的列表,用户 1 并不需要与用户 3 是朋友。
(joe, mary, bob)
(joe, mary, susan)
(bob, mary, susan)
(bob, joe, susan)
我有一些代码可以很好地处理小三角形,但我需要它能够缩放非常大的稀疏矩阵。
from numpy import *
from scipy import *
def buildTriangles(G):
# G is a sparse adjacency matrix
start = time.time()
ctr = 0
G = G + G.T # I do this to make sure it is symmetric
triples = []
for i in arange(G.shape[0] - 1): # for each row but the last one
J,J = G[i,:].nonzero() # J: primary friends of user i
# I do J,J because I do not care about the row values
J = J[ J < i ] # only computer the lower triangle to avoid repetition
for j in J:
K, buff = G[:,j].nonzero() # K: secondary friends of user i
K = K[ K > i ] # only compute below i to avoid repetition
for k in K:
ctr = ctr + 1
triples.append( (i,j,k) )
print("total number of triples: %d" % ctr)
print("run time is %.2f" % (time.time() - start())
return triples
我能够在大约 21 分钟内在 csr_matrix 上运行代码。该矩阵为 1032570 x 1032570,包含 88910 个存储元素。总共生成了 2178893 个三胞胎。
我需要能够对具有 9428596 个存储元素的 1968654 x 1968654 稀疏矩阵执行类似的操作。
我对 python 非常陌生(不到一个月的经验),并且在线性代数方面并不是最出色的,这就是为什么我的代码没有利用矩阵运算的原因。 谁能提出任何改进建议,或者让我知道我的目标是否现实?
I am using an adjacency matrix to represent a network of friends which can be visually interpreted as
Mary 0 1 1 1
Joe 1 0 1 1
Bob 1 1 0 1
Susan 1 1 1 0
Mary Joe Bob Susan
Using this matrix, I want to compile a list of all possible friendship triangles with the condition that user 1 is friends with user 2, and user 2 is friends with user 3. For my list, it is not required that user 1 is friends with user 3.
(joe, mary, bob)
(joe, mary, susan)
(bob, mary, susan)
(bob, joe, susan)
I have a bit of code that works well with small triangles, but I need it to scale for very large sparse matrices.
from numpy import *
from scipy import *
def buildTriangles(G):
# G is a sparse adjacency matrix
start = time.time()
ctr = 0
G = G + G.T # I do this to make sure it is symmetric
triples = []
for i in arange(G.shape[0] - 1): # for each row but the last one
J,J = G[i,:].nonzero() # J: primary friends of user i
# I do J,J because I do not care about the row values
J = J[ J < i ] # only computer the lower triangle to avoid repetition
for j in J:
K, buff = G[:,j].nonzero() # K: secondary friends of user i
K = K[ K > i ] # only compute below i to avoid repetition
for k in K:
ctr = ctr + 1
triples.append( (i,j,k) )
print("total number of triples: %d" % ctr)
print("run time is %.2f" % (time.time() - start())
return triples
I was able to run the code on a csr_matrix in approximately 21 minutes. The matrix was 1032570 x 1032570 and contained 88910 stored elements. There were a total of 2178893 triplets generated.
I need to be able to do something similar with a 1968654 x 1968654 sparse matrix with 9428596 stored elements.
I'm very new to python (little less than a month of experience) and not the greatest at linear algebra, which is why my code does not take advantage of matrices operations.
Can anyone make any suggestions for improvement or let me know if my objective is even realistic?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为你只能在行或列中找到三角形。例如:
表示Mary、Joe、Bob都是Susan的朋友,那么,用组合方式从[Mary、Joe、Bob]中选择两个人,与Susan组合起来就得到一个三角形。 itertools.combinations() 可以快速完成此操作。
这是代码:
I think you can find triangles only in rows or columns. for example:
this means Mary, Joe, Bob are all friends of Susan, so, use combinations to choose two person from [Mary, Joe, Bob], and combine it with Susan will get one triangle. itertools.combinations() do this quickly.
Here is the code:
以下是一些优化建议:
不要在循环中递增,这非常慢。只需
ctr += K.shape[0]
即可。然后,通过将append
替换为“现在”来完全消除最深的嵌套循环,如果您希望在此任务上获得真实性能,则必须学习一些线性代数。 “我想要编译所有可能的友谊三角形的列表”意味着您想要对邻接矩阵进行平方,这可以使用简单的
**2
来完成。然后意识到 1.968.654² 意味着一个非常大的矩阵,尽管它非常稀疏,但它的平方会少得多,并且会占用大量内存。 (我曾经解决过一个类似的问题,我考虑了距离为 2 的维基百科文章之间的链接,花了 20 分钟才能解决,在超级计算机集群节点上,用 C++ 编写。这是不过,维基百科的邻接矩阵要密集几个数量级。)
Here's some suggestions for optimization:
Don't increment in a loop, it's terribly slow. Just
ctr += K.shape[0]
will do. Then, eliminate the most deeply nested loop altogether by replacing theappend
withNow, if you want real performance on this task, you will have to get into some linear algebra. "I want to compile a list of all possible friendship triangles" means that you want to square the adjacency matrix, which you can do with a simple
**2
.Then realize that 1.968.654² means a very big matrix, and even though it's very sparse, its square will be much less so and will take a lot of memory. (I once tackled a similar problem where I considered links between Wikipedia articles at distance two, which took 20 minutes to solve, on a supercomputer cluster node, in C++. This is not a trivial problem. The Wikipedia adjacency matrix was a few orders of magnitude denser, though.)