迭代 scipy.sparse 向量(或矩阵)

发布于 2024-10-04 21:45:22 字数 624 浏览 0 评论 0原文

我想知道最好的方法是使用 scipy.sparse 迭代稀疏矩阵的非零条目。例如,如果我执行以下操作:

from scipy.sparse import lil_matrix

x = lil_matrix( (20,1) )
x[13,0] = 1
x[15,0] = 2

c = 0
for i in x:
  print c, i
  c = c+1

输出

0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13   (0, 0) 1.0
14 
15   (0, 0) 2.0
16 
17 
18 
19  

看起来迭代器正在触及每个元素,而不仅仅是非零条目。我查看了 API

http:// docs.scipy.org/doc/scipy/reference/ generated/scipy.sparse.lil_matrix.html

并进行了一些搜索,但我似乎找不到有效的解决方案。

I'm wondering what the best way is to iterate nonzero entries of sparse matrices with scipy.sparse. For example, if I do the following:

from scipy.sparse import lil_matrix

x = lil_matrix( (20,1) )
x[13,0] = 1
x[15,0] = 2

c = 0
for i in x:
  print c, i
  c = c+1

the output is

0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13   (0, 0) 1.0
14 
15   (0, 0) 2.0
16 
17 
18 
19  

so it appears the iterator is touching every element, not just the nonzero entries. I've had a look at the API

http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html

and searched around a bit, but I can't seem to find a solution that works.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

败给现实 2024-10-11 21:45:25

尝试使用 filter(lambda x:x, x) 而不是 x

Try filter(lambda x:x, x) instead of x.

亚希 2024-10-11 21:45:24

编辑:bbtrb的方法(使用coo_matrix)更快比我原来的建议,使用 非零。 Sven Marnach 建议使用 itertools.izip 也提高了速度。当前最快的是 using_tocoo_izip

import scipy.sparse
import random
import itertools

def using_nonzero(x):
    rows,cols = x.nonzero()
    for row,col in zip(rows,cols):
        ((row,col), x[row,col])

def using_coo(x):
    cx = scipy.sparse.coo_matrix(x)    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        (i,j,v)

def using_tocoo(x):
    cx = x.tocoo()    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        (i,j,v)

def using_tocoo_izip(x):
    cx = x.tocoo()    
    for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
        (i,j,v)

N=200
x = scipy.sparse.lil_matrix( (N,N) )
for _ in xrange(N):
    x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)

产生以下 timeit 结果:

% python -mtimeit -s'import test' 'test.using_tocoo_izip(test.x)'
1000 loops, best of 3: 670 usec per loop
% python -mtimeit -s'import test' 'test.using_tocoo(test.x)'
1000 loops, best of 3: 706 usec per loop
% python -mtimeit -s'import test' 'test.using_coo(test.x)'
1000 loops, best of 3: 802 usec per loop
% python -mtimeit -s'import test' 'test.using_nonzero(test.x)'
100 loops, best of 3: 5.25 msec per loop

Edit: bbtrb's method (using coo_matrix) is much faster than my original suggestion, using nonzero. Sven Marnach's suggestion to use itertools.izip also improves the speed. Current fastest is using_tocoo_izip:

import scipy.sparse
import random
import itertools

def using_nonzero(x):
    rows,cols = x.nonzero()
    for row,col in zip(rows,cols):
        ((row,col), x[row,col])

def using_coo(x):
    cx = scipy.sparse.coo_matrix(x)    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        (i,j,v)

def using_tocoo(x):
    cx = x.tocoo()    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        (i,j,v)

def using_tocoo_izip(x):
    cx = x.tocoo()    
    for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
        (i,j,v)

N=200
x = scipy.sparse.lil_matrix( (N,N) )
for _ in xrange(N):
    x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)

yields these timeit results:

% python -mtimeit -s'import test' 'test.using_tocoo_izip(test.x)'
1000 loops, best of 3: 670 usec per loop
% python -mtimeit -s'import test' 'test.using_tocoo(test.x)'
1000 loops, best of 3: 706 usec per loop
% python -mtimeit -s'import test' 'test.using_coo(test.x)'
1000 loops, best of 3: 802 usec per loop
% python -mtimeit -s'import test' 'test.using_nonzero(test.x)'
100 loops, best of 3: 5.25 msec per loop
冰雪之触 2024-10-11 21:45:24

最快的方法应该是转换为 coo_matrix

cx = scipy.sparse.coo_matrix(x)

for i,j,v in zip(cx.row, cx.col, cx.data):
    print "(%d, %d), %s" % (i,j,v)

The fastest way should be by converting to a coo_matrix:

cx = scipy.sparse.coo_matrix(x)

for i,j,v in zip(cx.row, cx.col, cx.data):
    print "(%d, %d), %s" % (i,j,v)
白首有我共你 2024-10-11 21:45:24

要从 scipy.sparse 代码部分循环各种稀疏矩阵,我将使用这个小包装函数(请注意,对于 Python-2,鼓励您使用 xrange 和 < code>izip 可以在大型矩阵上获得更好的性能):

from scipy.sparse import *
def iter_spmatrix(matrix):
    """ Iterator for iterating the elements in a ``scipy.sparse.*_matrix`` 

    This will always return:
    >>> (row, column, matrix-element)

    Currently this can iterate `coo`, `csc`, `lil` and `csr`, others may easily be added.

    Parameters
    ----------
    matrix : ``scipy.sparse.sp_matrix``
      the sparse matrix to iterate non-zero elements
    """
    if isspmatrix_coo(matrix):
        for r, c, m in zip(matrix.row, matrix.col, matrix.data):
            yield r, c, m

    elif isspmatrix_csc(matrix):
        for c in range(matrix.shape[1]):
            for ind in range(matrix.indptr[c], matrix.indptr[c+1]):
                yield matrix.indices[ind], c, matrix.data[ind]

    elif isspmatrix_csr(matrix):
        for r in range(matrix.shape[0]):
            for ind in range(matrix.indptr[r], matrix.indptr[r+1]):
                yield r, matrix.indices[ind], matrix.data[ind]

    elif isspmatrix_lil(matrix):
        for r in range(matrix.shape[0]):
            for c, d in zip(matrix.rows[r], matrix.data[r]):
                yield r, c, d

    else:
        raise NotImplementedError("The iterator for this sparse matrix has not been implemented")

To loop a variety of sparse matrices from the scipy.sparse code section I would use this small wrapper function (note that for Python-2 you are encouraged to use xrange and izip for better performance on large matrices):

from scipy.sparse import *
def iter_spmatrix(matrix):
    """ Iterator for iterating the elements in a ``scipy.sparse.*_matrix`` 

    This will always return:
    >>> (row, column, matrix-element)

    Currently this can iterate `coo`, `csc`, `lil` and `csr`, others may easily be added.

    Parameters
    ----------
    matrix : ``scipy.sparse.sp_matrix``
      the sparse matrix to iterate non-zero elements
    """
    if isspmatrix_coo(matrix):
        for r, c, m in zip(matrix.row, matrix.col, matrix.data):
            yield r, c, m

    elif isspmatrix_csc(matrix):
        for c in range(matrix.shape[1]):
            for ind in range(matrix.indptr[c], matrix.indptr[c+1]):
                yield matrix.indices[ind], c, matrix.data[ind]

    elif isspmatrix_csr(matrix):
        for r in range(matrix.shape[0]):
            for ind in range(matrix.indptr[r], matrix.indptr[r+1]):
                yield r, matrix.indices[ind], matrix.data[ind]

    elif isspmatrix_lil(matrix):
        for r in range(matrix.shape[0]):
            for c, d in zip(matrix.rows[r], matrix.data[r]):
                yield r, c, d

    else:
        raise NotImplementedError("The iterator for this sparse matrix has not been implemented")
女皇必胜 2024-10-11 21:45:24

tocoo() 将整个矩阵物化为不同的结构,这不是 python 3 的首选 MO。您还可以考虑这个迭代器,它对于大型矩阵特别有用。

from itertools import chain, repeat
def iter_csr(matrix):
  for (row, col, val) in zip(
    chain(*(
          repeat(i, r)
          for (i,r) in enumerate(comparisons.indptr[1:] - comparisons.indptr[:-1])
    )),
    matrix.indices,
    matrix.data
  ):
    yield (row, col, val)

我必须承认我使用了很多 python 结构,这些结构可能应该被 numpy 结构(尤其是枚举)替换。

NB

In [43]: t=time.time(); sum(1 for x in rather_dense_sparse_matrix.data); print(time.time()-t)
52.48686504364014
In [44]: t=time.time(); sum(1 for x in enumerate(rather_dense_sparse_matrix.data)); print(time.time()-t)
70.19013023376465
In [45]: rather_dense_sparse_matrix
<99829x99829 sparse matrix of type '<class 'numpy.float16'>'
with 757622819 stored elements in Compressed Sparse Row format>

所以是的,枚举有点慢(ish)

对于迭代器:

In [47]: it = iter_csr(rather_dense_sparse_matrix)
In [48]: t=time.time(); sum(1 for x in it); print(time.time()-t)
113.something something

所以你决定这个开销是否可以接受,在我的例子中,tocoo 导致了MemoryOverflows

恕我直言:这样的迭代器应该是 csr_matrix 接口的一部分,类似于 dict() 中的 items() :)

tocoo() materializes the entire matrix into a different structure, which is not the preferred MO for python 3. You can also consider this iterator, which is especially useful for large matrices.

from itertools import chain, repeat
def iter_csr(matrix):
  for (row, col, val) in zip(
    chain(*(
          repeat(i, r)
          for (i,r) in enumerate(comparisons.indptr[1:] - comparisons.indptr[:-1])
    )),
    matrix.indices,
    matrix.data
  ):
    yield (row, col, val)

I have to admit that I'm using a lot of python-constructs which possibly should be replaced by numpy-constructs (especially enumerate).

NB:

In [43]: t=time.time(); sum(1 for x in rather_dense_sparse_matrix.data); print(time.time()-t)
52.48686504364014
In [44]: t=time.time(); sum(1 for x in enumerate(rather_dense_sparse_matrix.data)); print(time.time()-t)
70.19013023376465
In [45]: rather_dense_sparse_matrix
<99829x99829 sparse matrix of type '<class 'numpy.float16'>'
with 757622819 stored elements in Compressed Sparse Row format>

So yes, enumerate is somewhat slow(ish)

For the iterator:

In [47]: it = iter_csr(rather_dense_sparse_matrix)
In [48]: t=time.time(); sum(1 for x in it); print(time.time()-t)
113.something something

So you decide whether this overhead is acceptable, in my case the tocoo caused MemoryOverflows's.

IMHO: such an iterator should be part of the csr_matrix interface, similar to items() in a dict() :)

╰沐子 2024-10-11 21:45:24

我遇到了同样的问题,实际上,如果您只关心速度,最快的方法(快超过 1 个数量级)是将稀疏矩阵转换为密集矩阵 (x.todense()),并迭代非零稠密矩阵中的元素。 (当然,这种方法需要更多的内存)

I had the same problem and actually, if your concern is only speed, the fastest way (more than 1 order of magnitude faster) is to convert the sparse matrix to a dense one (x.todense()), and iterating over the nonzero elements in the dense matrix. (Though, of course, this approach requires a lot more memory)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文