numpy:计算大矩阵的 xT*x
在 numpy 中,计算 xT * x 的最有效方法是什么,其中 x 是一个大 (200,000 x 1000) 密集的 float32 矩阵和.T
是转置运算符?
为避免疑义,结果为 1000 x 1000。
编辑:在我最初的问题中,我指出 np.dot(xT, x)
需要几个小时。事实证明,我有一些 NaN
潜入了矩阵,并且由于某种原因,这完全杀死了 np.dot
的性能(任何关于为什么的见解?)现在已经解决了,但原来的问题仍然存在。
In numpy
, what's the most efficient way to compute x.T * x
, where x
is a large (200,000 x 1000) dense float32
matrix and .T
is the transpose operator?
For the avoidance of doubt, the result is 1000 x 1000.
edit: In my original question I stated that np.dot(x.T, x)
was taking hours. It turned out that I had some NaNs
sneak into the matrix, and for some reason that was completely killing the performance of np.dot
(any insights as to why?) This is now resolved, but the original question stands.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这可能不是您正在寻找的答案,但显着加快速度的一种方法是使用 GPU 而不是 CPU。如果您有一块性能相当强大的显卡,即使您的系统经过精心调校,它的性能也将随时胜过您的 CPU。
为了与 numpy 良好集成,您可以使用 theano (如果您的显卡是 nvidia 制造的)。以下代码中的计算在几秒钟内运行(尽管我有一个非常强大的显卡):
我打算等待多长时间
>>> numpy.dot(xT, x)
通过比较的方式进行,但我感到无聊...你也可以尝试 PyCuda 或 PyOpenCL(如果你没有 nvidia 显卡),尽管我没有知道他们的 numpy 支持是否那么简单。
This may not be the answer you're looking for, but one way to speed it up considerably is to use a gpu instead of your cpu. If you have a decently powerful graphics card around, it'll outperform your cpu any day, even if your system is very well tuned.
For nice integration with numpy, you could use theano (if your graphics card is made by nvidia). The calculation in the following code runs for me in couple of seconds (although I have a very powerful graphics card):
I was going to wait to find out how long
>>> numpy.dot(x.T, x)
took by way of comparison, but I got bored...You can also try PyCuda or PyOpenCL (if you don't have an nvidia graphics card), although I don't know if their numpy support is as straightforward.
首先,确保您使用优化的 blas/lapack,这可以产生巨大的差异(最多一个数量级)。例如,如果您使用线程 ATLAS,它将相对有效地使用所有核心(不过,您需要使用最新的 ATLAS,并且编译 ATLAS 是一个 PITA)。
至于为什么 Nan 会减慢所做的一切:这几乎是不可避免的,NaN 处理比 CPU 级别的“正常”浮点数慢得多:http://www.cygnus-software.com/papers/x86andinfinity.html。这取决于 CPU 型号、您使用的指令集类型,当然还有您使用的算法/实现。
First, make sure you use an optimized blas/lapack, this can make a tremendous difference (up to one order of magnitude). If you use a threaded ATLAS, for example, it will use all your cores relatively efficiently (you need to use a recent ATLAS, though, and compiling ATLAS is a PITA).
As for why Nan slows everything done: that's pretty much unavoidable, NaN handling is much slower than "normal" float at the CPU level: http://www.cygnus-software.com/papers/x86andinfinity.html. It depends on the CPU model, what kind of instruction set you are using, and of course the algorithms/implementation you are using.
嗯,x 大约是 800 Mb,假设它需要相同的结果,您确定您有足够的物理内存并且它没有交换吗?
除此之外,numpy 应该使用 BLAS 函数,尽管 numpy 使用的默认库可能相对较慢,但对于这个大小来说它应该可以正常工作。
编辑
hmm, x is about 800 Mb, assuming it needs the same for the result, are you sure you have enough physical memory and it's not swapping?
other than that, numpy should use a BLAS function, and even though the default library that numpy uses may be relatively slow, it should work ok for this size.
edit