下面的代码最好地说明了我的问题:
控制台的输出(注意,即使是第一个测试也需要大约 8 分钟才能运行)显示 512x512x512x16 位数组分配的消耗不超过预期(每个分配 256MByte),并查看“ top”,该进程通常如预期的那样保持在 600MByte 以下。
但是,当调用函数的矢量化版本时,进程会扩展到巨大大小(超过 7GByte!)。即使是我能想到的最明显的解释 - 矢量化正在内部将输入和输出转换为 float64 - 也只能解释几个 GB,即使矢量化函数返回一个 int16,并且返回的数组肯定是一个 int16。有什么办法可以避免这种情况的发生吗?我使用/理解 Vectorize 的 otypes 参数是否错误?
import numpy as np
import subprocess
def logmem():
subprocess.call('cat /proc/meminfo | grep MemFree',shell=True)
def fn(x):
return np.int16(x*x)
def test_plain(v):
print "Explicit looping:"
logmem()
r=np.zeros(v.shape,dtype=np.int16)
for z in xrange(v.shape[0]):
for y in xrange(v.shape[1]):
for x in xrange(v.shape[2]):
r[z,y,x]=fn(x)
print type(r[0,0,0])
logmem()
return r
vecfn=np.vectorize(fn,otypes=[np.int16])
def test_vectorize(v):
print "Vectorize:"
logmem()
r=vecfn(v)
print type(r[0,0,0])
logmem()
return r
logmem()
s=(512,512,512)
v=np.ones(s,dtype=np.int16)
logmem()
test_plain(v)
test_vectorize(v)
v=None
logmem()
我正在使用 amd64 Debian Squeeze 系统上当前的 Python/numpy 版本(Python 2.6.6、numpy 1.4.1)。
This code below best illustrates my problem:
The output to the console (NB it takes ~8 minutes to run even the first test) shows the 512x512x512x16-bit array allocations consuming no more than expected (256MByte for each one), and looking at "top" the process generally remains sub-600MByte as expected.
However, while the vectorized version of the function is being called, the process expands to enormous size (over 7GByte!). Even the most obvious explanation I can think of to account for this - that vectorize is converting the inputs and outputs to float64 internally - could only account for a couple of gigabytes, even though the vectorized function returns an int16, and the returned array is certainly an int16. Is there some way to avoid this happening ? Am I using/understanding vectorize's otypes argument wrong ?
import numpy as np
import subprocess
def logmem():
subprocess.call('cat /proc/meminfo | grep MemFree',shell=True)
def fn(x):
return np.int16(x*x)
def test_plain(v):
print "Explicit looping:"
logmem()
r=np.zeros(v.shape,dtype=np.int16)
for z in xrange(v.shape[0]):
for y in xrange(v.shape[1]):
for x in xrange(v.shape[2]):
r[z,y,x]=fn(x)
print type(r[0,0,0])
logmem()
return r
vecfn=np.vectorize(fn,otypes=[np.int16])
def test_vectorize(v):
print "Vectorize:"
logmem()
r=vecfn(v)
print type(r[0,0,0])
logmem()
return r
logmem()
s=(512,512,512)
v=np.ones(s,dtype=np.int16)
logmem()
test_plain(v)
test_vectorize(v)
v=None
logmem()
I'm using whichever versions of Python/numpy are current on an amd64 Debian Squeeze system (Python 2.6.6, numpy 1.4.1).
发布评论
评论(2)
向量化的一个基本问题是所有中间值也是向量。虽然这是一种获得不错的速度增强的便捷方法,但它对于内存使用效率可能非常低,并且会不断地破坏您的 CPU 缓存。为了解决这个问题,您需要使用一种方法,该方法具有以编译速度而不是 python 速度运行的显式循环。执行此操作的最佳方法是使用 cython,这是用 f2py 或 numexpr。您可以在此处找到这些方法的比较这更注重速度而不是内存使用。
It is a basic problem of vectorisation that all intermediate values are also vectors. While this is a convenient way to get a decent speed enhancement, it can be very inefficient with memory usage, and will be constantly thrashing your CPU cache. To overcome this problem, you need to use an approach which has explicit loops running at compiled speed, not at python speed. The best ways to do this are to use cython, fortran code wrapped with f2py or numexpr. You can find a comparison of these approaches here, although this focuses more on speed than memory usage.
你可以阅读vectorize()的源代码。它将数组的 dtype 转换为对象,并调用 np.frompyfunc() 从 python 函数创建 ufunc,ufunc 返回对象数组,最后 vectorize() 将对象数组转换为 int16 数组。
当数组的数据类型为object时,会占用大量内存。
使用python函数进行逐元素计算很慢,即使是通过frompyfunc()转换为ufunc。
you can read the source code of vectorize(). It convert the array's dtype to object, and call np.frompyfunc() to create the ufunc from your python function, the ufunc returns object array, and finally vectorize() convert object array to int16 array.
It will use many memory when the dtype of array is object.
Using python function to do element wise calculation is slow, even is's converted to ufunc by frompyfunc().