切片稀疏（scipy）矩阵

发布于 2024-12-07 03:12:18 字数 419 浏览 6 评论 0原文

我将不胜感激任何帮助，以理解从 scipy.sparse 包中切片 lil_matrix (A) 时的以下行为。

实际上，我想根据行和列的任意索引列表提取子矩阵。

当我使用这两行代码时：

x1 = A[list 1,:]
x2 = x1[:,list 2]

一切都很好，我可以提取正确的子矩阵。

当我尝试在一行中执行此操作时，它失败了（返回矩阵为空）

x=A[list 1,list 2]

为什么会这样？总的来说，我在 matlab 中使用了类似的命令并且它可以工作。既然第一个有效，为什么不使用第一个呢？看来是相当费时间的。由于我必须浏览大量条目，因此我想使用单个命令来加快速度。也许我使用了错误的稀疏矩阵类型......有什么想法吗？

原文

I would appreciate any help, to understand following behavior when slicing a lil_matrix (A) from the scipy.sparse package.

Actually, I would like to extract a submatrix based on an arbitrary index list for both rows and columns.

When I used this two lines of code:

x1 = A[list 1,:]
x2 = x1[:,list 2]

Everything was fine and I could extract the right submatrix.

When I tried to do this in one line, it failed (The returning matrix was empty)

x=A[list 1,list 2]

Why is this so? Overall, I have used a similar command in matlab and there it works.
So, why not use the first, since it works? It seems to be quite time consuming. Since I have to go through a large amount of entries, I would like to speed it up using a single command. Maybe I use the wrong sparse matrix type...Any idea?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

芸娘子的小脾气 2024-12-14 03:12:19

对我来说，unutbu 的解决方案效果很好，但速度很慢。

我发现作为一种快速替代方案，

A = B.tocsr()[np.array(list1),:].tocsc()[:,np.array(list2)]

您可以看到行和列分别被剪切，但每个都转换为最快的稀疏格式，以便这次获取索引。

在我的测试环境中，这段代码比另一段代码快 1000 倍。

我希望，我不会说错话或犯错误。

for me the solution from unutbu works well, but is slow.

I found as a fast alternative,

A = B.tocsr()[np.array(list1),:].tocsc()[:,np.array(list2)]

You can see that row'S and col's get cut separately, but each one converted to the fastest sparse format, to get index this time.

In my test environment this code is 1000 times faster than the other one.

I hope, I don't tell something wrong or make a mistake.

回复收藏 0 原文

雪化雨蝶 2024-12-14 03:12:19

B[arr1, arr2] 中的同时索引确实有效，并且比侦听器解决方案更快我的机器。请参阅下面 Jupyter 示例中的In [5]。要将其与上述答案进行比较，请参阅[6]。此外，我的解决方案不需要 .tocsc() 转换，使其更具可读性（IMO）。

请注意，要使 B[arr1, arr2] 正常工作，arr1 和 arr2 必须为可广播 numpy 数组。

然而，更快的解决方案是使用 B[list1][:, list2] 作为 unutbu指出。请参阅下面的[7]。

In [1]: from scipy import sparse
      : import numpy as np
      : 
      : 

In [2]: B = sparse.rand(1000, 1000, .1, format='lil')
      : list1=[1,4,6,8]
      : list2=[2,4]
      : 
      : 

In [3]: arr1 = np.array(list1)[:, None]  # make arr1 a (n x 1)-array
      : arr1
      : 
      : 
Out[3]: 
array([[1],
       [4],
       [6],
       [8]])

In [4]: arr2 = np.array(list2)[None, :]  # make arr2 a (1 x m)-array
      : arr2
      : 
      : 
Out[4]: array([[2, 4]])

In [5]: %timeit A = B.tocsr()[arr1, arr2]
100 loops, best of 3: 13.1 ms per loop

In [6]: %timeit A = B.tocsr()[np.array(list1),:].tocsc()[:,np.array(list2)]
100 loops, best of 3: 14.6 ms per loop

In [7]: %timeit B[list1][:, list2]
1000 loops, best of 3: 205 µs per loop

Simultaneous indexing as in B[arr1, arr2] does work and it's faster than listener's solution on my machine. See In [5] in the Jupyter example below. To compare it with the mentioned answer refer to In [6]. Furthermore, my solution doesn't need the .tocsc() conversion, making it more readable IMO.

Please note that for B[arr1, arr2] to work, arr1 and arr2 must be broadcastable numpy arrays.

A much faster solution, however, is using B[list1][:, list2] as pointed out by unutbu. See In [7] below.

In [1]: from scipy import sparse
      : import numpy as np
      : 
      : 

In [2]: B = sparse.rand(1000, 1000, .1, format='lil')
      : list1=[1,4,6,8]
      : list2=[2,4]
      : 
      : 

In [3]: arr1 = np.array(list1)[:, None]  # make arr1 a (n x 1)-array
      : arr1
      : 
      : 
Out[3]: 
array([[1],
       [4],
       [6],
       [8]])

In [4]: arr2 = np.array(list2)[None, :]  # make arr2 a (1 x m)-array
      : arr2
      : 
      : 
Out[4]: array([[2, 4]])

In [5]: %timeit A = B.tocsr()[arr1, arr2]
100 loops, best of 3: 13.1 ms per loop

In [6]: %timeit A = B.tocsr()[np.array(list1),:].tocsc()[:,np.array(list2)]
100 loops, best of 3: 14.6 ms per loop

In [7]: %timeit B[list1][:, list2]
1000 loops, best of 3: 205 µs per loop

回复收藏 0 原文

茶花眉 2024-12-14 03:12:19

切片使用以下语法进行：

a[1:4]

对于 a = array([1,2,3,4,5,6,7,8,9])，结果为

array([2, 3, 4])

元组的第一个参数表示要保留的第一个值，并且第二个参数表示第一个不保留的值。

如果两侧都使用列表，则意味着数组的维度与列表长度一样多。

因此，根据您的语法，您可能需要这样的东西：

x = A[list1,:,list2]

取决于 A 的形状。

希望它对您有所帮助。

slicing happens with this syntax :

a[1:4]

for a = array([1,2,3,4,5,6,7,8,9]), the result is

array([2, 3, 4])

The first parameter of the tuple indicates the first value to be retained, and the second parameter indicates the first value not to be retained.

If you use lists on both sides, it means that your array has as many dimensions as the lists length.

So, with your syntax, you will probably need something like this :

x = A[list1,:,list2]

depending on the shape of A.

Hope it did help you.

回复收藏 0 原文

怀念你的温柔 2024-12-14 03:12:18

您已经使用的方法

A[list1, :][:, list2]

似乎是从备件矩阵中选择所需值的最快方法。请参阅下面的基准。

但是，要回答有关如何使用单个索引从 A 的任意行和列中选择值的问题，
您需要使用所谓的 "高级索引"< /a>：

A[np.array(list1)[:,np.newaxis], np.array(list2)]

使用高级索引，如果 arr1 和 arr2 是 NDarray，则 A[arr1 的 (i,j) 组件, arr2] 等于

A[arr1[i,j], arr2[i,j]]

因此，您希望所有 j 的 arr1[i,j] 等于 list1[i]，并且
对于所有 i，arr2[i,j] 等于 list2[j]。

这可以在广播（见下文）的帮助下通过设置进行安排
arr1 = np.array(list1)[:,np.newaxis] 和 arr2 = np.array(list2)。

arr1 的形状为 (len(list1), 1)，而 arr2 的形状为
(len(list2), ) 广播到 (1, len(list2)) 因为添加了新轴
需要时自动在左侧。

每个数组都可以进一步广播为形状 (len(list1),len(list2))。
这正是我们想要的
A[arr1[i,j],arr2[i,j]] 有意义，因为我们希望 (i,j) 遍历 a 的所有可能索引结果数组的形状为(len(list1),len(list2))。

这是一个测试用例的微基准测试，表明 A[list1, :][:, list2] 是最快的选项：

In [32]: %timeit orig(A, list1, list2)
10 loops, best of 3: 110 ms per loop

In [34]: %timeit using_listener(A, list1, list2)
1 loop, best of 3: 1.29 s per loop

In [33]: %timeit using_advanced_indexing(A, list1, list2)
1 loop, best of 3: 1.8 s per loop

这是我用于基准测试的设置：

import numpy as np
import scipy.sparse as sparse
import random
random.seed(1)

def setup(N):
    A = sparse.rand(N, N, .1, format='lil')
    list1 = np.random.choice(N, size=N//10, replace=False).tolist()
    list2 = np.random.choice(N, size=N//20, replace=False).tolist()
    return A, list1, list2

def orig(A, list1, list2):
    return A[list1, :][:, list2]

def using_advanced_indexing(A, list1, list2):
    B = A.tocsc()  # or `.tocsr()`
    B = B[np.array(list1)[:, np.newaxis], np.array(list2)]
    return B

def using_listener(A, list1, list2):
    """https://stackoverflow.com/a/26592783/190597 (listener)"""
    B = A.tocsr()[list1, :].tocsc()[:, list2]
    return B

N = 10000
A, list1, list2 = setup(N)
B = orig(A, list1, list2)
C = using_advanced_indexing(A, list1, list2)
D = using_listener(A, list1, list2)
assert np.allclose(B.toarray(), C.toarray())
assert np.allclose(B.toarray(), D.toarray())

The method you are already using,

A[list1, :][:, list2]

seems to be the fastest way to select the desired values from a spares matrix. See below for a benchmark.

However, to answer your question about how to select values from arbitrary rows and columns of A with a single index,
you would need to use so-called "advanced indexing":

A[np.array(list1)[:,np.newaxis], np.array(list2)]

With advanced indexing, if arr1 and arr2 are NDarrays, the (i,j) component of A[arr1, arr2] equals

A[arr1[i,j], arr2[i,j]]

Thus you would want arr1[i,j] to equal list1[i] for all j, and
arr2[i,j] to equal list2[j] for all i.

That can be arranged with the help of broadcasting (see below) by setting
arr1 = np.array(list1)[:,np.newaxis], and arr2 = np.array(list2).

The shape of arr1 is (len(list1), 1) while the shape of arr2 is
(len(list2), ) which broadcasts to (1, len(list2)) since new axes are added
on the left automatically when needed.

Each array can be further broadcasted to shape (len(list1),len(list2)).
This is exactly what we want for
A[arr1[i,j],arr2[i,j]] to make sense, since we want (i,j) to run over all possible indices for a result array of shape (len(list1),len(list2)).

Here is a microbenchmark for one test case which suggests that A[list1, :][:, list2] is the fastest option:

In [32]: %timeit orig(A, list1, list2)
10 loops, best of 3: 110 ms per loop

In [34]: %timeit using_listener(A, list1, list2)
1 loop, best of 3: 1.29 s per loop

In [33]: %timeit using_advanced_indexing(A, list1, list2)
1 loop, best of 3: 1.8 s per loop

Here is the setup I used for the benchmark:

import numpy as np
import scipy.sparse as sparse
import random
random.seed(1)

def setup(N):
    A = sparse.rand(N, N, .1, format='lil')
    list1 = np.random.choice(N, size=N//10, replace=False).tolist()
    list2 = np.random.choice(N, size=N//20, replace=False).tolist()
    return A, list1, list2

def orig(A, list1, list2):
    return A[list1, :][:, list2]

def using_advanced_indexing(A, list1, list2):
    B = A.tocsc()  # or `.tocsr()`
    B = B[np.array(list1)[:, np.newaxis], np.array(list2)]
    return B

def using_listener(A, list1, list2):
    """https://stackoverflow.com/a/26592783/190597 (listener)"""
    B = A.tocsr()[list1, :].tocsc()[:, list2]
    return B

N = 10000
A, list1, list2 = setup(N)
B = orig(A, list1, list2)
C = using_advanced_indexing(A, list1, list2)
D = using_listener(A, list1, list2)
assert np.allclose(B.toarray(), C.toarray())
assert np.allclose(B.toarray(), D.toarray())

回复收藏 0 原文

~没有更多了~