NumPy 2d 数组的切片,或者如何从 nxn 数组 (n>m) 中提取 mxm 子矩阵?
我想对 NumPy nxn 数组进行切片。我想提取该数组的 m 行和列的任意选择(即行/列数没有任何模式),使其成为一个新的 mxm 数组。对于这个例子,我们假设数组是 4x4,我想从中提取一个 2x2 数组。
这是我们的数组:
from numpy import *
x = range(16)
x = reshape(x,(4,4))
print x
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
要删除的行和列是相同的。最简单的情况是当我想提取位于开头或结尾的 2x2 子矩阵时,即:
In [33]: x[0:2,0:2]
Out[33]:
array([[0, 1],
[4, 5]])
In [34]: x[2:,2:]
Out[34]:
array([[10, 11],
[14, 15]])
但是如果我需要删除行/列的另一个混合怎么办?如果我需要删除第一行和第三行,从而提取子矩阵 [[5,7],[13,15]]
该怎么办?行/线可以是任意组合。我在某处读到,我只需要使用行和列的索引数组/列表来索引我的数组,但这似乎不起作用:
In [35]: x[[1,3],[1,3]]
Out[35]: array([ 5, 15])
我找到了一种方法,即:
In [61]: x[[1,3]][:,[1,3]]
Out[61]:
array([[ 5, 7],
[13, 15]])
第一个问题是它几乎不可读,尽管我可以忍受。如果有人有更好的解决方案,我当然想听听。
另一件事是我在论坛上读到索引数组 with arrays 强制 NumPy 制作所需数组的副本,因此在处理大型数组时这可能会成为问题。为什么会这样/这个机制是如何运作的?
I want to slice a NumPy nxn array. I want to extract an arbitrary selection of m rows and columns of that array (i.e. without any pattern in the numbers of rows/columns), making it a new, mxm array. For this example let us say the array is 4x4 and I want to extract a 2x2 array from it.
Here is our array:
from numpy import *
x = range(16)
x = reshape(x,(4,4))
print x
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
The line and columns to remove are the same. The easiest case is when I want to extract a 2x2 submatrix that is at the beginning or at the end, i.e. :
In [33]: x[0:2,0:2]
Out[33]:
array([[0, 1],
[4, 5]])
In [34]: x[2:,2:]
Out[34]:
array([[10, 11],
[14, 15]])
But what if I need to remove another mixture of rows/columns? What if I need to remove the first and third lines/rows, thus extracting the submatrix [[5,7],[13,15]]
? There can be any composition of rows/lines. I read somewhere that I just need to index my array using arrays/lists of indices for both rows and columns, but that doesn't seem to work:
In [35]: x[[1,3],[1,3]]
Out[35]: array([ 5, 15])
I found one way, which is:
In [61]: x[[1,3]][:,[1,3]]
Out[61]:
array([[ 5, 7],
[13, 15]])
First issue with this is that it is hardly readable, although I can live with that. If someone has a better solution, I'd certainly like to hear it.
Other thing is I read on a forum that indexing arrays with arrays forces NumPy to make a copy of the desired array, thus when treating with large arrays this could become a problem. Why is that so / how does this mechanism work?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
为了回答这个问题,我们必须看看在 Numpy 中索引多维数组是如何工作的。首先假设您有问题中的数组
x
。分配给x
的缓冲区将包含 16 个从 0 到 15 的升序整数。如果访问一个元素,例如x[i,j]
,NumPy 必须计算出内存该元素相对于缓冲区开头的位置。这是通过有效计算i*x.shape[1]+j
(并乘以 int 的大小以获得实际的内存偏移量)来完成的。如果您通过基本切片(如 y = x[0:2,0:2])提取子数组,则生成的对象将与 x 共享底层缓冲区。但是如果您访问 y[i,j] 会发生什么? NumPy 无法使用 i*y.shape[1]+j 来计算数组中的偏移量,因为属于 y 的数据在内存中不连续。
NumPy 通过引入步幅解决了这个问题。在计算访问
x[i,j]
的内存偏移量时,实际计算的是i*x.strides[0]+j*x.strides[1]
(这已经包括了 int 大小的因素):当像上面那样提取 y 时,NumPy 不会创建一个新的缓冲区,但它确实创建一个新的缓冲区引用相同缓冲区的数组对象(否则
y
将等于x
。)新数组对象将具有与x
不同的形状,并且可能是缓冲区的不同起始偏移量,但将与 x 共享步幅(至少在这种情况下):这样,计算 y[i,j] 的内存偏移量code> 将产生正确的结果。
但是 NumPy 应该如何处理像
z=x[[1,3]]
这样的事情呢?如果原始缓冲区用于z
,则步幅机制将不允许正确索引。理论上,NumPy 可以添加一些比步幅更复杂的机制,但这会使元素访问相对昂贵,在某种程度上违背了数组的整体理念。此外,视图不再是一个真正的轻量级对象。有关索引的 NumPy 文档对此进行了深入介绍。
哦,几乎忘记了您的实际问题:这是如何使多个列表的索引按预期工作:
这是因为索引数组是 广播 为通用形状。
当然,对于这个特定的例子,你也可以使用基本的切片:
To answer this question, we have to look at how indexing a multidimensional array works in Numpy. Let's first say you have the array
x
from your question. The buffer assigned tox
will contain 16 ascending integers from 0 to 15. If you access one element, sayx[i,j]
, NumPy has to figure out the memory location of this element relative to the beginning of the buffer. This is done by calculating in effecti*x.shape[1]+j
(and multiplying with the size of an int to get an actual memory offset).If you extract a subarray by basic slicing like
y = x[0:2,0:2]
, the resulting object will share the underlying buffer withx
. But what happens if you accesy[i,j]
? NumPy can't usei*y.shape[1]+j
to calculate the offset into the array, because the data belonging toy
is not consecutive in memory.NumPy solves this problem by introducing strides. When calculating the memory offset for accessing
x[i,j]
, what is actually calculated isi*x.strides[0]+j*x.strides[1]
(and this already includes the factor for the size of an int):When
y
is extracted like above, NumPy does not create a new buffer, but it does create a new array object referencing the same buffer (otherwisey
would just be equal tox
.) The new array object will have a different shape thenx
and maybe a different starting offset into the buffer, but will share the strides withx
(in this case at least):This way, computing the memory offset for
y[i,j]
will yield the correct result.But what should NumPy do for something like
z=x[[1,3]]
? The strides mechanism won't allow correct indexing if the original buffer is used forz
. NumPy theoretically could add some more sophisticated mechanism than the strides, but this would make element access relatively expensive, somehow defying the whole idea of an array. In addition, a view wouldn't be a really lightweight object anymore.This is covered in depth in the NumPy documentation on indexing.
Oh, and nearly forgot about your actual question: Here is how to make the indexing with multiple lists work as expected:
This is because the index arrays are broadcasted to a common shape.
Of course, for this particular example, you can also make do with basic slicing:
正如 Sven 提到的,
x[[[0],[2]],[1,3]]
将返回与 1 和 3 列匹配的 0 和 2 行,而x [[0,2],[1,3]]
将返回数组中的值 x[0,1] 和 x[2,3]。有一个有用的函数可以帮助完成我给出的第一个示例,
numpy.ix_
。您可以使用x[numpy.ix_([0,2],[1,3])]
执行与我的第一个示例相同的操作。这可以让您不必输入所有这些额外的括号。As Sven mentioned,
x[[[0],[2]],[1,3]]
will give back the 0 and 2 rows that match with the 1 and 3 columns whilex[[0,2],[1,3]]
will return the values x[0,1] and x[2,3] in an array.There is a helpful function for doing the first example I gave,
numpy.ix_
. You can do the same thing as my first example withx[numpy.ix_([0,2],[1,3])]
. This can save you from having to enter in all of those extra brackets.我不认为
x[[1,3]][:,[1,3]]
难以阅读。如果您想更清楚地表达自己的意图,您可以这样做:我不是切片专家,但通常情况下,如果您尝试切片为数组并且值是连续的,您会得到一个步幅值发生更改的视图。
例如,在输入 33 和 34 中,虽然您得到的是 2x2 数组,但步幅为 4。因此,当您索引下一行时,指针将移动到内存中的正确位置。
显然,这种机制不能很好地应用于索引数组的情况。因此,numpy 必须制作副本。毕竟,许多其他矩阵数学函数依赖于大小、步幅和连续内存分配。
I don't think that
x[[1,3]][:,[1,3]]
is hardly readable. If you want to be more clear on your intent, you can do:I am not an expert in slicing but typically, if you try to slice into an array and the values are continuous, you get back a view where the stride value is changed.
e.g. In your inputs 33 and 34, although you get a 2x2 array, the stride is 4. Thus, when you index the next row, the pointer moves to the correct position in memory.
Clearly, this mechanism doesn't carry well into the case of an array of indices. Hence, numpy will have to make the copy. After all, many other matrix math function relies on size, stride and continuous memory allocation.
如果您想跳过每隔一行和每隔一列,那么您可以使用基本切片来完成:
这将返回一个视图,而不是数组的副本。
而
z=x[(1,3),:][:,(1,3)]
使用高级索引,因此返回副本:请注意,
x
未更改:如果您想选择任意行和列,则不能使用基本切片。您必须使用高级索引,例如
x[rows,:][:,columns]
,其中rows
和columns
是序列。这当然会给您原始数组的副本,而不是视图。这是人们应该预料到的,因为 numpy 数组使用连续内存(具有恒定的步幅),并且无法生成具有任意行和列的视图(因为这需要非常量的步幅)。If you want to skip every other row and every other column, then you can do it with basic slicing:
This returns a view, not a copy of your array.
while
z=x[(1,3),:][:,(1,3)]
uses advanced indexing and thus returns a copy:Note that
x
is unchanged:If you wish to select arbitrary rows and columns, then you can't use basic slicing. You'll have to use advanced indexing, using something like
x[rows,:][:,columns]
, whererows
andcolumns
are sequences. This of course is going to give you a copy, not a view, of your original array. This is as one should expect, since a numpy array uses contiguous memory (with constant strides), and there would be no way to generate a view with arbitrary rows and columns (since that would require non-constant strides).使用 numpy,您可以为索引的每个组件传递一个切片 - 因此,上面的
x[0:2,0:2]
示例有效。如果您只想均匀地跳过列或行,您可以传递具有三个组件的切片
(即开始、停止、步骤)。
再次,对于上面的示例:
这基本上是:在第一个维度中切片,从索引 1 开始,当索引等于或大于 4 时停止,并在每次传递中向索引添加 2。第二个维度也是如此。再次强调:这仅适用于恒定步骤。
您必须在内部执行一些完全不同的操作的语法 -
x[[1,3]][:,[1,3]]
实际上所做的是创建一个新数组,其中仅包含第 1 行和第 3 行原始数组(使用x[[1,3]]
部分完成),然后重新切片 - 创建第三个数组 - 仅包括前一个数组的第 1 列和第 3 列。With numpy, you can pass a slice for each component of the index - so, your
x[0:2,0:2]
example above works.If you just want to evenly skip columns or rows, you can pass slices with three components
(i.e. start, stop, step).
Again, for your example above:
Which is basically: slice in the first dimension, with start at index 1, stop when index is equal or greater than 4, and add 2 to the index in each pass. The same for the second dimension. Again: this only works for constant steps.
The syntax you got to do something quite different internally - what
x[[1,3]][:,[1,3]]
actually does is create a new array including only rows 1 and 3 from the original array (done with thex[[1,3]]
part), and then re-slice that - creating a third array - including only columns 1 and 3 of the previous array.我在这里有一个类似的问题: 以最Python风格的方式写入ndarray的子ndarray。蟒蛇2
。
按照上一篇文章针对您的案例的解决方案,解决方案如下所示:
使用 ix_:
其中是:
I have a similar question here: Writting in sub-ndarray of a ndarray in the most pythonian way. Python 2
.
Following the solution of previous post for your case the solution looks like:
An using ix_:
Which is:
我不确定这有多高效,但您可以使用 range() 在两个轴上进行切片
I'm not sure how efficient this is but you can use range() to slice in both axis