在循环中使用 numpy load 时内存溢出

发布于 2025-01-04 07:17:24 字数 911 浏览 1 评论 0原文

循环加载 npz 文件会导致内存溢出(取决于文件 列表长度)。

以下内容似乎都没有帮助

  1. 删除在文件中存储数据的变量。

  2. 使用 mmap。

  3. 调用 gc.collect()(垃圾收集)。

以下代码应该重现该现象:

import numpy as np

# generate a file for the demo
X = np.random.randn(1000,1000)
np.savez('tmp.npz',X=X)


# here come the overflow:
for i in xrange(1000000):
    data = np.load('tmp.npz')
    data.close()  # avoid the "too many files are open" error

在我的实际应用程序中,循环遍历文件列表,并且溢出超过 24GB RAM! 请注意,这是在 ubuntu 11.10 和 numpy v 上尝试过的 1.5.1 以及 1.6.0

我已在 numpy Ticket 2048 中提交了一份报告,但是这可能引起更广泛的兴趣,所以我也将其发布在这里(此外,我不确定这是一个错误,但可能是我糟糕的编程的结果)。

解决方案(由 HYRY 提供):

命令

del data.f

应位于命令之前,

data.close()

以获取更多信息和找到解决方案的方法,请阅读下面 HYRY 的友善回答

Looping over npz files load causes memory overflow (depending on the file
list length).

None of the following seems to help

  1. Deleting the variable which stores the data in the file.

  2. Using mmap.

  3. calling gc.collect() (garbage collection).

The following code should reproduce the phenomenon:

import numpy as np

# generate a file for the demo
X = np.random.randn(1000,1000)
np.savez('tmp.npz',X=X)


# here come the overflow:
for i in xrange(1000000):
    data = np.load('tmp.npz')
    data.close()  # avoid the "too many files are open" error

in my real application the loop is over a list of files and the overflow exceeds 24GB of RAM!
please note that this was tried on ubuntu 11.10, and for both numpy v
1.5.1 as well as 1.6.0

I have filed a report in numpy ticket 2048 but this may be of a wider interest and so I am posting it here as well (moreover, I am not sure that this is a bug but may result of my bad programming).

SOLUTION (by HYRY):

the command

del data.f

should precede the command

data.close()

for more information and a method to find the solution, please read HYRY's kind answer below

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

溇涏 2025-01-11 07:17:24

我认为这是一个错误,也许我找到了解决方案:调用“del data.f”。

for i in xrange(10000000):
    data = np.load('tmp.npz')
    del data.f
    data.close()  # avoid the "too many files are open" error

发现这种内存泄漏。您可以使用以下代码:

import numpy as np
import gc
# here come the overflow:
for i in xrange(10000):
    data = np.load('tmp.npz')
    data.close()  # avoid the "too many files are open" error

d = dict()
for o in gc.get_objects():
    name = type(o).__name__
    if name not in d:
        d[name] = 1
    else:
        d[name] += 1

items = d.items()
items.sort(key=lambda x:x[1])
for key, value in items:
    print key, value

在测试程序之后,我创建了一个字典并在 gc.get_objects() 中计数对象。输出如下:

...
wrapper_descriptor 1382
function 2330
tuple 9117
BagObj 10000
NpzFile 10000
list 20288
dict 21001

从结果中我们知道 BagObj 和 NpzFile 有问题。找到代码:

class NpzFile(object):
    def __init__(self, fid, own_fid=False):
        ...
        self.zip = _zip
        self.f = BagObj(self)
        if own_fid:
            self.fid = fid
        else:
            self.fid = None

    def close(self):
        """
        Close the file.

        """
        if self.zip is not None:
            self.zip.close()
            self.zip = None
        if self.fid is not None:
            self.fid.close()
            self.fid = None

    def __del__(self):
        self.close()

class BagObj(object):
    def __init__(self, obj):
        self._obj = obj
    def __getattribute__(self, key):
        try:
            return object.__getattribute__(self, '_obj')[key]
        except KeyError:
            raise AttributeError, key

NpzFile有del(),NpzFile.f是一个BagObj,而BagObj._obj是NpzFile,这是一个引用循环,会导致NpzFile和BagObj都不可收集。以下是Python文档中的一些解释: http://docs.python.org/library /gc.html#gc.garbage

因此,要打破引用循环,需要调用“del data.f”

I think this is a bug, and maybe I found the solution: call "del data.f".

for i in xrange(10000000):
    data = np.load('tmp.npz')
    del data.f
    data.close()  # avoid the "too many files are open" error

to found this kind of memory leak. you can use the following code:

import numpy as np
import gc
# here come the overflow:
for i in xrange(10000):
    data = np.load('tmp.npz')
    data.close()  # avoid the "too many files are open" error

d = dict()
for o in gc.get_objects():
    name = type(o).__name__
    if name not in d:
        d[name] = 1
    else:
        d[name] += 1

items = d.items()
items.sort(key=lambda x:x[1])
for key, value in items:
    print key, value

After the test program, I created a dict and count objects in gc.get_objects(). Here is the output:

...
wrapper_descriptor 1382
function 2330
tuple 9117
BagObj 10000
NpzFile 10000
list 20288
dict 21001

From the result we know that there are something wrong with BagObj and NpzFile. Find the code:

class NpzFile(object):
    def __init__(self, fid, own_fid=False):
        ...
        self.zip = _zip
        self.f = BagObj(self)
        if own_fid:
            self.fid = fid
        else:
            self.fid = None

    def close(self):
        """
        Close the file.

        """
        if self.zip is not None:
            self.zip.close()
            self.zip = None
        if self.fid is not None:
            self.fid.close()
            self.fid = None

    def __del__(self):
        self.close()

class BagObj(object):
    def __init__(self, obj):
        self._obj = obj
    def __getattribute__(self, key):
        try:
            return object.__getattribute__(self, '_obj')[key]
        except KeyError:
            raise AttributeError, key

NpzFile has del(), NpzFile.f is a BagObj, and BagObj._obj is NpzFile, this is a reference cycle and will cause both NpzFile and BagObj uncollectable. Here is some explanation in Python document: http://docs.python.org/library/gc.html#gc.garbage

So, to break the reference cycle, will need to call "del data.f"

花开雨落又逢春i 2025-01-11 07:17:24

我找到的解决方案:(python==3.8 和 numpy==1.18.5)

import gc # import garbage collector interface

for i in range(1000):
   data = np.load('tmp.npy')

   # process data

   del data
   gc.collect()

What I found as the solution: (python==3.8 and numpy==1.18.5)

import gc # import garbage collector interface

for i in range(1000):
   data = np.load('tmp.npy')

   # process data

   del data
   gc.collect()

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文