从 numpy 数组列表创建 numpy 数组的 Pythonic 方法

发布于 2024-08-18 18:40:33 字数 753 浏览 14 评论 0原文

我在循环中生成一维 numpy 数组的列表,然后将此列表转换为二维 numpy 数组。如果我提前知道项目的数量,我会预先分配一个 2d numpy 数组,但我不知道,因此我将所有内容都放在列表中。

模型如下:

>>> list_of_arrays = map(lambda x: x*ones(2), range(5))
>>> list_of_arrays
[array([ 0.,  0.]), array([ 1.,  1.]), array([ 2.,  2.]), array([ 3.,  3.]), array([ 4.,  4.])]
>>> arr = array(list_of_arrays)
>>> arr
array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.]])

我的问题如下:

是否有更好的方法(性能方面)来完成收集顺序数值数据(在我的例子中为 numpy 数组)的任务,而不是将它们放入列表中然后用它制作一个 numpy.array (我正在创建一个新的 obj 并复制数据)?经过充分测试的模块中是否有可用的“可扩展”矩阵数据结构?

我的二维矩阵的典型大小将在 100x10 和 5000x10 浮点数之间

编辑: 在这个示例中,我使用的是地图,但在我的实际应用程序中,我有一个 for 循环

I generate a list of one dimensional numpy arrays in a loop and later convert this list to a 2d numpy array. I would've preallocated a 2d numpy array if i knew the number of items ahead of time, but I don't, therefore I put everything in a list.

The mock up is below:

>>> list_of_arrays = map(lambda x: x*ones(2), range(5))
>>> list_of_arrays
[array([ 0.,  0.]), array([ 1.,  1.]), array([ 2.,  2.]), array([ 3.,  3.]), array([ 4.,  4.])]
>>> arr = array(list_of_arrays)
>>> arr
array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.]])

My question is the following:

Is there a better way (performancewise) to go about the task of collecting sequential numerical data (in my case numpy arrays) than putting them in a list and then making a numpy.array out of it (I am creating a new obj and copying the data)? Is there an "expandable" matrix data structure available in a well tested module?

A typical size of my 2d matrix would be between 100x10 and 5000x10 floats

EDIT: In this example i'm using map, but in my actual application I have a for loop

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

要走干脆点 2024-08-25 18:40:33

方便的方法,使用 numpy.concatenate< /代码>。我相信它也比 @unutbu 的答案更快:

In [32]: import numpy as np 

In [33]: list_of_arrays = list(map(lambda x: x * np.ones(2), range(5)))

In [34]: list_of_arrays
Out[34]: 
[array([ 0.,  0.]),
 array([ 1.,  1.]),
 array([ 2.,  2.]),
 array([ 3.,  3.]),
 array([ 4.,  4.])]

In [37]: shape = list(list_of_arrays[0].shape)

In [38]: shape
Out[38]: [2]

In [39]: shape[:0] = [len(list_of_arrays)]

In [40]: shape
Out[40]: [5, 2]

In [41]: arr = np.concatenate(list_of_arrays).reshape(shape)

In [42]: arr
Out[42]: 
array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.]])

Convenient way, using numpy.concatenate. I believe it's also faster, than @unutbu's answer:

In [32]: import numpy as np 

In [33]: list_of_arrays = list(map(lambda x: x * np.ones(2), range(5)))

In [34]: list_of_arrays
Out[34]: 
[array([ 0.,  0.]),
 array([ 1.,  1.]),
 array([ 2.,  2.]),
 array([ 3.,  3.]),
 array([ 4.,  4.])]

In [37]: shape = list(list_of_arrays[0].shape)

In [38]: shape
Out[38]: [2]

In [39]: shape[:0] = [len(list_of_arrays)]

In [40]: shape
Out[40]: [5, 2]

In [41]: arr = np.concatenate(list_of_arrays).reshape(shape)

In [42]: arr
Out[42]: 
array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.]])
故事和酒 2024-08-25 18:40:33

假设您知道最终数组 arr 永远不会大于 5000x10。
然后你可以预先分配一个最大大小的数组,用数据填充它,如下所示
你遍历循环,然后使用 arr.resize 将其缩小到
退出循环后发现大小。

下面的测试表明这样做会比构建中间体稍微快一些
python 列出了无论数组的最终大小是多少。

此外,arr.resize 会取消分配未使用的内存,因此最终(尽管可能不是中间)内存占用量小于 python_lists_to_array 使用的内存量。

这表明 numpy_all_the_way 更快:

% python -mtimeit -s"import test" "test.numpy_all_the_way(100)"
100 loops, best of 3: 1.78 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(1000)"
100 loops, best of 3: 18.1 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(5000)"
10 loops, best of 3: 90.4 msec per loop

% python -mtimeit -s"import test" "test.python_lists_to_array(100)"
1000 loops, best of 3: 1.97 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(1000)"
10 loops, best of 3: 20.3 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(5000)"
10 loops, best of 3: 101 msec per loop

这表明 numpy_all_the_way 使用更少的内存:

% test.py
Initial memory usage: 19788
After python_lists_to_array: 20976
After numpy_all_the_way: 20348

test.py:

import numpy as np
import os


def memory_usage():
    pid = os.getpid()
    return next(line for line in open('/proc/%s/status' % pid).read().splitlines()
                if line.startswith('VmSize')).split()[-2]

N, M = 5000, 10


def python_lists_to_array(k):
    list_of_arrays = list(map(lambda x: x * np.ones(M), range(k)))
    arr = np.array(list_of_arrays)
    return arr


def numpy_all_the_way(k):
    arr = np.empty((N, M))
    for x in range(k):
        arr[x] = x * np.ones(M)
    arr.resize((k, M))
    return arr

if __name__ == '__main__':
    print('Initial memory usage: %s' % memory_usage())
    arr = python_lists_to_array(5000)
    print('After python_lists_to_array: %s' % memory_usage())
    arr = numpy_all_the_way(5000)
    print('After numpy_all_the_way: %s' % memory_usage())

Suppose you know that the final array arr will never be larger than 5000x10.
Then you could pre-allocate an array of maximum size, populate it with data as
you go through the loop, and then use arr.resize to cut it down to the
discovered size after exiting the loop.

The tests below suggest doing so will be slightly faster than constructing intermediate
python lists no matter what the ultimate size of the array is.

Also, arr.resize de-allocates the unused memory, so the final (though maybe not the intermediate) memory footprint is smaller than what is used by python_lists_to_array.

This shows numpy_all_the_way is faster:

% python -mtimeit -s"import test" "test.numpy_all_the_way(100)"
100 loops, best of 3: 1.78 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(1000)"
100 loops, best of 3: 18.1 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(5000)"
10 loops, best of 3: 90.4 msec per loop

% python -mtimeit -s"import test" "test.python_lists_to_array(100)"
1000 loops, best of 3: 1.97 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(1000)"
10 loops, best of 3: 20.3 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(5000)"
10 loops, best of 3: 101 msec per loop

This shows numpy_all_the_way uses less memory:

% test.py
Initial memory usage: 19788
After python_lists_to_array: 20976
After numpy_all_the_way: 20348

test.py:

import numpy as np
import os


def memory_usage():
    pid = os.getpid()
    return next(line for line in open('/proc/%s/status' % pid).read().splitlines()
                if line.startswith('VmSize')).split()[-2]

N, M = 5000, 10


def python_lists_to_array(k):
    list_of_arrays = list(map(lambda x: x * np.ones(M), range(k)))
    arr = np.array(list_of_arrays)
    return arr


def numpy_all_the_way(k):
    arr = np.empty((N, M))
    for x in range(k):
        arr[x] = x * np.ones(M)
    arr.resize((k, M))
    return arr

if __name__ == '__main__':
    print('Initial memory usage: %s' % memory_usage())
    arr = python_lists_to_array(5000)
    print('After python_lists_to_array: %s' % memory_usage())
    arr = numpy_all_the_way(5000)
    print('After numpy_all_the_way: %s' % memory_usage())
终遇你 2024-08-25 18:40:33

比 @Gill Bates 的答案更简单,这是一行代码:

np.stack(list_of_arrays, axis=0)

Even simpler than @Gill Bates' answer, here is an one line code:

np.stack(list_of_arrays, axis=0)
迷爱 2024-08-25 18:40:33

你正在做的是标准的方式。 numpy 数组的一个属性是它们需要连续的内存。我能想到的唯一可能的“漏洞”是 PyArrayObjectstrides 成员,但这并不影响这里的讨论。由于 numpy 数组具有连续的内存并且是“预先分配的”,因此添加新行/列意味着分配新内存、复制数据,然后释放旧内存。如果你经常这样做,效率就不是很高。

有人可能不想创建一个列表,然后最终将其转换为 numpy 数组的一种情况是,当列表包含大量数字时:numpy 数字数组比原生 Python 数字列表占用的空间要少得多(因为本机 Python 列表存储 Python 对象)。对于典型的数组大小,我认为这不是问题。

当您从数组列表创建最终数组时,您将所有数据复制到新(示例中为二维)数组的新位置。这仍然比使用 numpy 数组并在每次获取新数据时执行 next = numpy.vstack((next, new_row)) 的效率要高得多。 vstack() 将复制每个“行”的所有数据。

有一个 numpy-discussion 邮件列表上的线程 前段时间讨论了添加新的 numpy 数组类型以允许高效扩展/追加的可能性。当时似乎对此很感兴趣,尽管我不知道是否有什么结果。您可能想看看该线程。

我想说你正在做的事情非常Pythonic,而且高效,所以除非你真的需要其他东西(也许更高的空间效率?),你应该没问题。当我一开始不知道数组中元素的数量时,这就是我创建 numpy 数组的方式。

What you are doing is the standard way. A property of numpy arrays is that they need contiguous memory. The only possibility of "holes" that I can think of is possible with the strides member of PyArrayObject, but that doesn't affect the discussion here. Since numpy arrays have contiguous memory and are "preallocated", adding a new row/column means allocating new memory, copying data, and then freeing the old memory. If you do that a lot, it is not very efficient.

One case where someone might not want to create a list and then convert it to a numpy array in the end is when the list contains a lot of numbers: a numpy array of numbers takes much less space than a native Python list of numbers (since the native Python list stores Python objects). For your typical array sizes, I don't think that is an issue.

When you create your final array from a list of arrays, you are copying all the data to a new location for the new (2-d in your example) array. This is still much more efficient than having a numpy array and doing next = numpy.vstack((next, new_row)) every time you get new data. vstack() will copy all the data for every "row".

There was a thread on numpy-discussion mailing list some time ago which discussed the possibility of adding a new numpy array type that allows efficient extending/appending. It seems there was significant interest in this at that time, although I don't know if something came out of it. You might want to look at that thread.

I would say that what you're doing is very Pythonic, and efficient, so unless you really need something else (more space efficiency, maybe?), you should be okay. That is how I create my numpy arrays when I don't know the number of elements in the array in the beginning.

不交电费瞎发啥光 2024-08-25 18:40:33

我将添加我自己版本的 ~unutbu 答案。与 numpy_all_the 方式类似,但如果出现索引错误,则可以动态调整大小。我认为对于小数据集来说它会更快一点,但是它有点慢 - 边界检查使速度减慢太多。

initial_guess = 1000

def my_numpy_all_the_way(k):
    arr=np.empty((initial_guess,M))
    for x,row in enumerate(make_test_data(k)):
        try:
            arr[x]=row
        except IndexError:
            arr.resize((arr.shape[0]*2, arr.shape[1]))
            arr[x]=row
    arr.resize((k,M))
    return arr

I'll add my own version of ~unutbu's answer. Similar to numpy_all_the way, but you dynamically resize if you have an index error. I thought it would have been a little faster for small data sets, but it's a little slower - the bounds checking slows things down too much.

initial_guess = 1000

def my_numpy_all_the_way(k):
    arr=np.empty((initial_guess,M))
    for x,row in enumerate(make_test_data(k)):
        try:
            arr[x]=row
        except IndexError:
            arr.resize((arr.shape[0]*2, arr.shape[1]))
            arr[x]=row
    arr.resize((k,M))
    return arr
梦开始←不甜 2024-08-25 18:40:33

更简单的@fnjn 答案

np.vstack(list_of_arrays)

Even simpler @fnjn answer

np.vstack(list_of_arrays)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文