优化邻接矩阵计算
X 是一个文本文件,其中包含100000
相等大小(500 个元素)的位向量(即每行都是 500 个元素的向量)。我使用下面的代码生成一个邻接矩阵(100000 X 100000),但它没有优化并且非常耗时。我该如何改善这一点。
import numpy as np
import scipy.spatial.distance
readFrom = "vector.txt"
fout = open("adjacencymatrix.txt","a")
X = np.genfromtxt(readFrom, dtype=None)
for outer in range(0,100000):
for inner in range(0,100000):
dis = scipy.spatial.distance.euclidean(X[outer],X[inner])
tmp += str(dis)+" "
tmp += "\n"
fout.write(tmp)
fout.close()
谢谢。
X is a text file that contains 100000
equal size (500 elements) bit vector (i.e. each row is a vector of 500 elements). I am generating an adjacency matrix (100000 X 100000) using the code below, but its not optimized and very time consuming. How can I improve that.
import numpy as np
import scipy.spatial.distance
readFrom = "vector.txt"
fout = open("adjacencymatrix.txt","a")
X = np.genfromtxt(readFrom, dtype=None)
for outer in range(0,100000):
for inner in range(0,100000):
dis = scipy.spatial.distance.euclidean(X[outer],X[inner])
tmp += str(dis)+" "
tmp += "\n"
fout.write(tmp)
fout.close()
Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对代码进行一些小的优化(我假设您使用的是 Python 2.x):
我不建议在编写整个矩阵之前预先计算整个矩阵 - 尽管这样做可以让我们利用问题的模拟性并仅迭代一半的元素,但会消耗大量内存。我坚持你所拥有的 - 每行都是在计算后立即写入的。
这里真正的问题是输入数据巨大,距离计算将执行 100,000 x 100,000 = 10,000'000,000 次,并且任何微观优化都不会改变这一点。您确定您必须计算整个矩阵吗?
Some little optimizations over your code (and I'm assuming that you're using Python 2.x):
I wouldn't recommend precomputing the whole matrix before writing it - although doing so would allow us to exploit the simmetry of the problem and iterate over only half of the elements, but it would consume a lot of memory. I'm sticking with what you had - each line is written as soon as is calculated.
The real problem here is that the input data is huge, the distance calculation will be executed 100,000 x 100,000 = 10,000'000,000 times, and no amount of micro-optimizations will change that. Are you sure that you have to calculate the whole matrix?
编辑:更好地理解问题后完成重写。考虑到数据的大小等,这一点很棘手。到目前为止,我通过以下方式获得了最佳的加速结果:
因此,我尝试平衡数据集每个块的大小与内存开销。这让我估计需要 6,600 秒(约 110 分钟)才能完成。您可以看到我也开始考虑是否可以使用多处理池进行并行化。我的策略是异步处理每个块并将它们保存到不同的文本文件中,然后连接文件,但我必须回去工作。
Edit: Complete rewrote after understanding the question better. Given the size of the data, etc. this one is tricky. I got my best results at speedup with the following so far:
So I tried balancing the size of each chunk of the dataset vs. the memory overhead. This got me down to an estimated 6,600 secs to finish, or ~110 mins. You can see I also started seeing if I could parallelize using the multiprocessing pool. My strategy would have been to asynchronously process each chunk and save them to a different text file, then concatenate the files aftwerwards, but I got to get back to work.
(如果您使用的是 Python 2.x,请使用
xrange
而不是range
。)要进行计算,您可以使用:
请注意,要存储 < 的 100,000 × 100,000 矩阵code>double 您将需要 74.5 GB 内存,并且文本输出的文件大小可能需要两倍的内存。你真的需要整个矩阵吗? (您也可以并行计算,但这需要的不仅仅是 numpy。)
(If you're using Python 2.x, use
xrange
instead ofrange
.)To compute, you could use:
Note that to store a 100,000 × 100,000 matrix of
double
you'll need 74.5 GB of memory, and maybe double that for the filesize of the text output. Do you really need the whole matrix? (You may also parallelize the computation, but that would need more than numpy.)我有预感,可以通过使用矩阵运算来计算距离矩阵,而无需显式的 python 循环。
X
及其转置后的外积看起来很有前途,因为它执行每对向量的内积,并将结果保留在所得 100.000 x 100.000 矩阵的每个单元格中,并且内积非常接近与欧氏距离(或其平方)有关。所以我想这是一个调整的问题,以获得两个向量之间的欧几里德距离而不是内积。我的直觉告诉我,复数在这里可能有用。
也许一些更聪明的头脑可以在这里阐明一些观点。
I have a hunch that the distance matrix might be calculated without explicit python loops by using matrix operations.
The outer product of
X
with its transposed seems promising, as it performs the inner product of each pair of vectors and leaves the result in each cell of the resulting 100.000 x 100.000 matrix, and the inner product is closely related with the euclidean distance (or its square).So I guess it's a matter of tweaking that to get the euclidean distance between the two vectors rather than the inner product. My instinct tells me that the complex numbers might be useful here.
Maybe some brighter mind could throw some light up here.