numpy：计算大矩阵的 xT*x

发布于 2024-10-06 02:35:02 字数 315 浏览 10 评论 0原文

在 numpy 中，计算 xT * x 的最有效方法是什么，其中 x 是一个大 (200,000 x 1000) 密集的 float32 矩阵和.T 是转置运算符？

为避免疑义，结果为 1000 x 1000。

编辑：在我最初的问题中，我指出 np.dot(xT, x) 需要几个小时。事实证明，我有一些 NaN 潜入了矩阵，并且由于某种原因，这完全杀死了 np.dot 的性能（任何关于为什么的见解？）现在已经解决了，但原来的问题仍然存在。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

醉殇 2024-10-13 02:35:02

这可能不是您正在寻找的答案，但显着加快速度的一种方法是使用 GPU 而不是 CPU。如果您有一块性能相当强大的显卡，即使您的系统经过精心调校，它的性能也将随时胜过您的 CPU。

为了与 numpy 良好集成，您可以使用 theano （如果您的显卡是 nvidia 制造的）。以下代码中的计算在几秒钟内运行（尽管我有一个非常强大的显卡）：

$ THEANO_FLAGS=device=gpu0 python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import theano
Using gpu device 0: GeForce GTX 480
>>> from theano import tensor as T
>>> import numpy
>>> x = numpy.ones((200000, 1000), dtype=numpy.float32)
>>> m = T.matrix() 
>>> mTm = T.dot(m.T, m)
>>> f = theano.function([m], mTm)
>>> f(x)
array([[ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       ..., 
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.]], dtype=float32)
>>> r = f(x)
>>> r.shape
(1000, 1000)

我打算等待多长时间>>> numpy.dot(xT, x) 通过比较的方式进行，但我感到无聊...

你也可以尝试 PyCuda 或 PyOpenCL（如果你没有 nvidia 显卡），尽管我没有知道他们的 numpy 支持是否那么简单。

This may not be the answer you're looking for, but one way to speed it up considerably is to use a gpu instead of your cpu. If you have a decently powerful graphics card around, it'll outperform your cpu any day, even if your system is very well tuned.

For nice integration with numpy, you could use theano (if your graphics card is made by nvidia). The calculation in the following code runs for me in couple of seconds (although I have a very powerful graphics card):

$ THEANO_FLAGS=device=gpu0 python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import theano
Using gpu device 0: GeForce GTX 480
>>> from theano import tensor as T
>>> import numpy
>>> x = numpy.ones((200000, 1000), dtype=numpy.float32)
>>> m = T.matrix() 
>>> mTm = T.dot(m.T, m)
>>> f = theano.function([m], mTm)
>>> f(x)
array([[ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       ..., 
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.]], dtype=float32)
>>> r = f(x)
>>> r.shape
(1000, 1000)

I was going to wait to find out how long >>> numpy.dot(x.T, x) took by way of comparison, but I got bored...

You can also try PyCuda or PyOpenCL (if you don't have an nvidia graphics card), although I don't know if their numpy support is as straightforward.

回复收藏 0 原文

顾铮苏瑾 2024-10-13 02:35:02

首先，确保您使用优化的 blas/lapack，这可以产生巨大的差异（最多一个数量级）。例如，如果您使用线程 ATLAS，它将相对有效地使用所有核心（不过，您需要使用最新的 ATLAS，并且编译 ATLAS 是一个 PITA）。

至于为什么 Nan 会减慢所做的一切：这几乎是不可避免的，NaN 处理比 CPU 级别的“正常”浮点数慢得多：http://www.cygnus-software.com/papers/x86andinfinity.html。这取决于 CPU 型号、您使用的指令集类型，当然还有您使用的算法/实现。

回复收藏 0 原文

鼻尖触碰 2024-10-13 02:35:02

嗯，x 大约是 800 Mb，假设它需要相同的结果，您确定您有足够的物理内存并且它没有交换吗？

除此之外，numpy 应该使用 BLAS 函数，尽管 numpy 使用的默认库可能相对较慢，但对于这个大小来说它应该可以正常工作。

编辑

import numpy as npy
import time

def mm_timing():
  print "   n   Gflops/s"
  print "==============="
  m = 1000
  n = 200000
  a = npy.random.rand(n, m)
  flops = (2 * float(n) - 1) * float(m)**2
  t1 = time.time()
  c = npy.dot(a.T, a)
  t2 = time.time()
  perf = flops / (t2 - t1) / 1.e9
  print "%4i" % n + "     " + "%6.3f" % perf

mm_timing()

hmm, x is about 800 Mb, assuming it needs the same for the result, are you sure you have enough physical memory and it's not swapping?

other than that, numpy should use a BLAS function, and even though the default library that numpy uses may be relatively slow, it should work ok for this size.

edit

import numpy as npy
import time

def mm_timing():
  print "   n   Gflops/s"
  print "==============="
  m = 1000
  n = 200000
  a = npy.random.rand(n, m)
  flops = (2 * float(n) - 1) * float(m)**2
  t1 = time.time()
  c = npy.dot(a.T, a)
  t2 = time.time()
  perf = flops / (t2 - t1) / 1.e9
  print "%4i" % n + "     " + "%6.3f" % perf

mm_timing()

回复收藏 0 原文

~没有更多了~