优化邻接矩阵计算

发布于 2024-12-25 23:42:07 字数 549 浏览 1 评论 0原文

X 是一个文本文件，其中包含100000 相等大小（500 个元素）的位向量（即每行都是 500 个元素的向量）。我使用下面的代码生成一个邻接矩阵（100000 X 100000），但它没有优化并且非常耗时。我该如何改善这一点。

import numpy as np
import scipy.spatial.distance


 readFrom = "vector.txt"
 fout = open("adjacencymatrix.txt","a")

 X = np.genfromtxt(readFrom, dtype=None) 

 for outer in range(0,100000):
    for inner in range(0,100000):
        dis = scipy.spatial.distance.euclidean(X[outer],X[inner])
        tmp += str(dis)+" "
    tmp += "\n"        
    fout.write(tmp)
 fout.close()

谢谢。

原文

X is a text file that contains 100000 equal size (500 elements) bit vector (i.e. each row is a vector of 500 elements). I am generating an adjacency matrix (100000 X 100000) using the code below, but its not optimized and very time consuming. How can I improve that.

import numpy as np
import scipy.spatial.distance


 readFrom = "vector.txt"
 fout = open("adjacencymatrix.txt","a")

 X = np.genfromtxt(readFrom, dtype=None) 

 for outer in range(0,100000):
    for inner in range(0,100000):
        dis = scipy.spatial.distance.euclidean(X[outer],X[inner])
        tmp += str(dis)+" "
    tmp += "\n"        
    fout.write(tmp)
 fout.close()

Thank you.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

淤浪 2025-01-01 23:42:08

对代码进行一些小的优化（我假设您使用的是 Python 2.x）：

import numpy as np
import scipy.spatial.distance

X = np.genfromtxt("vector.txt", dtype=None) 
fout = open("adjacencymatrix.txt", "a")

for outer in xrange(0, 100000):
  fout.write(" ".join(str(scipy.spatial.distance.euclidean(X[outer], X[inner])) for inner in xrange(0, 100000)) + "\n")

fout.close()

我不建议在编写整个矩阵之前预先计算整个矩阵 - 尽管这样做可以让我们利用问题的模拟性并仅迭代一半的元素，但会消耗大量内存。我坚持你所拥有的 - 每行都是在计算后立即写入的。

这里真正的问题是输入数据巨大，距离计算将执行 100,000 x 100,000 = 10,000'000,000 次，并且任何微观优化都不会改变这一点。您确定您必须计算整个矩阵吗？

Some little optimizations over your code (and I'm assuming that you're using Python 2.x):

import numpy as np
import scipy.spatial.distance

X = np.genfromtxt("vector.txt", dtype=None) 
fout = open("adjacencymatrix.txt", "a")

for outer in xrange(0, 100000):
  fout.write(" ".join(str(scipy.spatial.distance.euclidean(X[outer], X[inner])) for inner in xrange(0, 100000)) + "\n")

fout.close()

I wouldn't recommend precomputing the whole matrix before writing it - although doing so would allow us to exploit the simmetry of the problem and iterate over only half of the elements, but it would consume a lot of memory. I'm sticking with what you had - each line is written as soon as is calculated.

The real problem here is that the input data is huge, the distance calculation will be executed 100,000 x 100,000 = 10,000'000,000 times, and no amount of micro-optimizations will change that. Are you sure that you have to calculate the whole matrix?

回复收藏 0 原文

把回忆走一遍 2025-01-01 23:42:08

编辑：更好地理解问题后完成重写。考虑到数据的大小等，这一点很棘手。到目前为止，我通过以下方式获得了最佳的加速结果：

import time
import numpy as np
from scipy import spatial
import multiprocessing as mp

pool = mp.Pool(4)

test_data = np.random.random(100000*500).reshape([100000,500])

outfile = open('/tmp/test.out','w')

def split(data,size):
    for i in xrange(0, len(data), size):
        yield data[i:i+size]

def distance(vecs):
    return spatial.distance.cdist(vecs,test_data)

chunks = list(split(test_data,100))
for chunk in chunks:
    t0 = time.time()
    distances = spatial.distance.cdist(chunk,test_data)
    outfile.write(' '.join([str(x) for x in distances]))
    print 'estimated: %.2f secs'%((time.time()-t0)*len(chunks))

因此，我尝试平衡数据集每个块的大小与内存开销。这让我估计需要 6,600 秒（约 110 分钟）才能完成。您可以看到我也开始考虑是否可以使用多处理池进行并行化。我的策略是异步处理每个块并将它们保存到不同的文本文件中，然后连接文件，但我必须回去工作。

Edit: Complete rewrote after understanding the question better. Given the size of the data, etc. this one is tricky. I got my best results at speedup with the following so far:

import time
import numpy as np
from scipy import spatial
import multiprocessing as mp

pool = mp.Pool(4)

test_data = np.random.random(100000*500).reshape([100000,500])

outfile = open('/tmp/test.out','w')

def split(data,size):
    for i in xrange(0, len(data), size):
        yield data[i:i+size]

def distance(vecs):
    return spatial.distance.cdist(vecs,test_data)

chunks = list(split(test_data,100))
for chunk in chunks:
    t0 = time.time()
    distances = spatial.distance.cdist(chunk,test_data)
    outfile.write(' '.join([str(x) for x in distances]))
    print 'estimated: %.2f secs'%((time.time()-t0)*len(chunks))

So I tried balancing the size of each chunk of the dataset vs. the memory overhead. This got me down to an estimated 6,600 secs to finish, or ~110 mins. You can see I also started seeing if I could parallelize using the multiprocessing pool. My strategy would have been to asynchronously process each chunk and save them to a different text file, then concatenate the files aftwerwards, but I got to get back to work.

回复收藏 0 原文

浊酒尽余欢 2025-01-01 23:42:08

（如果您使用的是 Python 2.x，请使用 xrange 而不是 range。）

要进行计算，您可以使用：

diff_matrix = numpy.subtract.outer(X, X)
result = numpy.sqrt(numpy.abs(diff_matrix))
# output the result.

请注意，要存储 < 的 100,000 × 100,000 矩阵code>double 您将需要 74.5 GB 内存，并且文本输出的文件大小可能需要两倍的内存。你真的需要整个矩阵吗？（您也可以并行计算，但这需要的不仅仅是 numpy。）

(If you're using Python 2.x, use xrange instead of range.)

To compute, you could use:

diff_matrix = numpy.subtract.outer(X, X)
result = numpy.sqrt(numpy.abs(diff_matrix))
# output the result.

Note that to store a 100,000 × 100,000 matrix of double you'll need 74.5 GB of memory, and maybe double that for the filesize of the text output. Do you really need the whole matrix? (You may also parallelize the computation, but that would need more than numpy.)

回复收藏 0 原文