python中的反向排序和argsort

发布于 2024-12-20 12:37:09 字数 817 浏览 5 评论 0 原文

我正在尝试用 Python 编写一个函数（仍然是菜鸟！），它返回按 tfidf 分数的内积排序的文档索引和分数。过程是：

计算 doc idx 和所有其他文档之间的内积向量
按降序排序
返回从第二个到末尾的“分数”和索引（即不是它本身）

我的代码目前是：

import h5py
import numpy as np

def get_related(tfidf, idx) :
    ''' return the top documents '''

    # calculate inner product   
    v = np.inner(tfidf, tfidf[idx].transpose())

    # sort
    vs = np.sort(v.toarray(), axis=0)[::-1]
    scores = vs[1:,]

    # sort indices
    vi = np.argsort(v.toarray(), axis=0)[::-1]
    idxs = vi[1:,] 

    return (scores, idxs)

其中 tfidf 是类型为 '' 的稀疏矩阵。

这看起来效率很低，因为排序执行了两次（sort() 然后 argsort()），并且结果必须反转。

这可以更有效地完成吗？
可以在不使用 toarray() 转换稀疏矩阵的情况下完成此操作吗？

原文

I'm trying to write a function in Python (still a noob!) which returns indices and scores of documents ordered by the inner products of their tfidf scores. The procedure is:

Compute vector of inner products between doc idx and all other documents
Sort in descending order
Return the "scores" and indices from the second one to the end (i.e. not itself)

The code I have at the moment is:

import h5py
import numpy as np

def get_related(tfidf, idx) :
    ''' return the top documents '''

    # calculate inner product   
    v = np.inner(tfidf, tfidf[idx].transpose())

    # sort
    vs = np.sort(v.toarray(), axis=0)[::-1]
    scores = vs[1:,]

    # sort indices
    vi = np.argsort(v.toarray(), axis=0)[::-1]
    idxs = vi[1:,] 

    return (scores, idxs)

where tfidf is a sparse matrix of type '<type 'numpy.float64'>'.

This seems inefficient, as the sort is performed twice (sort() then argsort()), and the results have to then be reversed.

Can this be done more efficiently?
Can this be done without converting the sparse matrix using toarray()?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

木格 2024-12-27 12:37:09

我认为没有必要跳过toarray。 v 数组只有 n_docs 长，与 n_docs × n_terms tf- 的大小相比显得相形见绌。实际情况中的idf矩阵。此外，它会非常密集，因为两个文档共享的任何术语都会赋予它们非零的相似度。稀疏矩阵表示仅在您存储的矩阵非常稀疏时才有效（我已经看到Matlab的>80％的数字，并假设Scipy将类似，尽管我没有准确数字）。

可以通过执行以下操作来跳过双重排序

v = v.toarray()
vi = np.argsort(v, axis=0)[::-1]
vs = v[vi]

。顺便说一句，您在稀疏矩阵上使用 np.inner 不适用于最新版本的 NumPy；获取两个稀疏矩阵的内积的安全方法是

v = (tfidf * tfidf[idx, :]).transpose()

I don't think there's any real need to skip the toarray. The v array will be only n_docs long, which is dwarfed by the size of the n_docs × n_terms tf-idf matrix in practical situations. Also, it will be quite dense since any term shared by two documents will give them a non-zero similarity. Sparse matrix representations only pay off when the matrix you're storing is very sparse (I've seen >80% figures for Matlab and assume that Scipy will be similar, though I don't have an exact figure).

The double sort can be skipped by doing

v = v.toarray()
vi = np.argsort(v, axis=0)[::-1]
vs = v[vi]

Btw., your use of np.inner on sparse matrices is not going to work with the latest versions of NumPy; the safe way of taking an inner product of two sparse matrices is