numpy Loadtxt 函数似乎消耗了太多内存

发布于 2024-12-12 06:29:02 字数 602 浏览 0 评论 0原文

当我使用 numpy.loadtxt 加载数组时，它似乎占用了太多内存。例如，

a = numpy.zeros(int(1e6))

导致内存增加约 8MB（使用 htop，或仅 8 字节*100 万\约 8MB）。另一方面，如果我保存然后加载这个数组，

numpy.savetxt('a.csv', a)
b = numpy.loadtxt('a.csv')

我的内存使用量会增加大约 100MB！我再次用 htop 观察到了这一点。在 iPython shell 中以及使用 Pdb++ 单步执行代码时观察到了这一点。

知道这是怎么回事吗？

阅读 jozzas 的答案后，我意识到，如果我提前知道数组大小，如果说 'a' 是 mxn 数组，则有一种更节省内存的方法：

b = numpy.zeros((m,n))
with open('a.csv', 'r') as f:
    reader = csv.reader(f)
    for i, row in enumerate(reader):
        b[i,:] = numpy.array(row)

原文

When I load an array using numpy.loadtxt, it seems to take too much memory. E.g.

a = numpy.zeros(int(1e6))

causes an increase of about 8MB in memory (using htop, or just 8bytes*1million \approx 8MB). On the other hand, if I save and then load this array

numpy.savetxt('a.csv', a)
b = numpy.loadtxt('a.csv')

my memory usage increases by about 100MB! Again I observed this with htop. This was observed while in the iPython shell, and also while stepping through code using Pdb++.

Any idea what's going on here?

After reading jozzas's answer, I realized that if I know ahead of time the array size, there is a much more memory efficient way to do things if say 'a' was an mxn array:

b = numpy.zeros((m,n))
with open('a.csv', 'r') as f:
    reader = csv.reader(f)
    for i, row in enumerate(reader):
        b[i,:] = numpy.array(row)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

复古式 2024-12-19 06:29:03

这就是我最终为解决这个问题所做的事情。即使您提前不知道形状，它也能发挥作用。这首先执行到浮点数的转换，然后组合数组（与@JohnLyon的答案，它组合字符串数组然后转换为浮点数）。这对我来说使用的内存少了一个数量级，尽管可能有点慢。但是，我实际上没有使用 np.loadtxt 所需的内存，因此如果您没有足够的内存，那么这会更好：

def numpy_loadtxt_memory_friendly(the_file, max_bytes = 1000000, **loadtxt_kwargs):
    numpy_arrs = []
    with open(the_file, 'rb') as f:
        i = 0
        while True:
            print(i)
            some_lines = f.readlines(max_bytes)
            if len(some_lines) == 0:
                break
            vec = np.loadtxt(some_lines, **loadtxt_kwargs)
            if len(vec.shape) < 2:
                vec = vec.reshape(1,-1)
            numpy_arrs.append(vec)
            i+=len(some_lines)
    return np.concatenate(numpy_arrs, axis=0)

Here is what I ended up doing to solve this problem. It works even if you don't know the shape ahead of time. This performs the conversion to float first, and then combines the arrays (as opposed to @JohnLyon's answer, which combines the arrays of string then converts to float). This used an order of magnitude less memory for me, although perhaps was a bit slower. However, I literally did not have the requisite memory to use np.loadtxt, so if you don't have sufficient memory, then this will be better:

def numpy_loadtxt_memory_friendly(the_file, max_bytes = 1000000, **loadtxt_kwargs):
    numpy_arrs = []
    with open(the_file, 'rb') as f:
        i = 0
        while True:
            print(i)
            some_lines = f.readlines(max_bytes)
            if len(some_lines) == 0:
                break
            vec = np.loadtxt(some_lines, **loadtxt_kwargs)
            if len(vec.shape) < 2:
                vec = vec.reshape(1,-1)
            numpy_arrs.append(vec)
            i+=len(some_lines)
    return np.concatenate(numpy_arrs, axis=0)

回复收藏 0 原文

狠疯拽 2024-12-19 06:29:02

将此浮点数数组保存到文本文件中会创建一个 24M 的文本文件。当您重新加载此文件时，numpy 会逐行浏览文件，解析文本并重新创建对象。

我预计内存使用量会在这段时间内激增，因为 numpy 在到达文件末尾之前不知道结果数组需要有多大，所以我预计至少有 24M + 8M + 其他使用临时内存。

以下是 numpy 代码的相关部分，来自 /lib/npyio.py：

    # Parse each line, including the first
    for i, line in enumerate(itertools.chain([first_line], fh)):
        vals = split_line(line)
        if len(vals) == 0:
            continue
        if usecols:
            vals = [vals[i] for i in usecols]
        # Convert each value according to its column and store
        items = [conv(val) for (conv, val) in zip(converters, vals)]
        # Then pack it according to the dtype's nesting
        items = pack_items(items, packing)
        X.append(items)

    #...A bit further on
    X = np.array(X, dtype)

这种额外的内存使用不应成为问题，因为这就是 python 的工作方式 - 而您的 python 进程似乎如果使用 100M 内存，它会在内部维护哪些项目不再使用的信息，并将重新使用该内存。例如，如果您要在一个程序中重新运行此保存加载过程（保存，加载，保存，加载），您的内存使用量将不会增加到200M。

Saving this array of floats to a text file creates a 24M text file. When you re-load this, numpy goes through the file line-by-line, parsing the text and recreating the objects.

I would expect memory usage to spike during this time, as numpy doesn't know how big the resultant array needs to be until it gets to the end of the file, so I'd expect there to be at least 24M + 8M + other temporary memory used.

Here's the relevant bit of the numpy code, from /lib/npyio.py:

    # Parse each line, including the first
    for i, line in enumerate(itertools.chain([first_line], fh)):
        vals = split_line(line)
        if len(vals) == 0:
            continue
        if usecols:
            vals = [vals[i] for i in usecols]
        # Convert each value according to its column and store
        items = [conv(val) for (conv, val) in zip(converters, vals)]
        # Then pack it according to the dtype's nesting
        items = pack_items(items, packing)
        X.append(items)

    #...A bit further on
    X = np.array(X, dtype)

This additional memory usage shouldn't be a concern, as this is just the way python works - while your python process appears to be using 100M of memory, internally it maintains knowledge of which items are no longer used, and will re-use that memory. For example, if you were to re-run this save-load procedure in the one program (save, load, save, load), your memory usage will not increase to 200M.

回复收藏 0 原文

~没有更多了~