python中的反向排序和argsort

发布于 2024-12-20 12:37:09 字数 817 浏览 5 评论 0 原文

我正在尝试用 Python 编写一个函数(仍然是菜鸟!),它返回按 tfidf 分数的内积排序的文档索引和分数。过程是:

  • 计算 doc idx 和所有其他文档之间的内积向量
  • 按降序排序
  • 返回从第二个到末尾的“分数”和索引(即不是它本身)

我的代码目前是:

import h5py
import numpy as np

def get_related(tfidf, idx) :
    ''' return the top documents '''

    # calculate inner product   
    v = np.inner(tfidf, tfidf[idx].transpose())

    # sort
    vs = np.sort(v.toarray(), axis=0)[::-1]
    scores = vs[1:,]

    # sort indices
    vi = np.argsort(v.toarray(), axis=0)[::-1]
    idxs = vi[1:,] 

    return (scores, idxs)

其中 tfidf 是类型为 '' 的稀疏矩阵。

这看起来效率很低,因为排序执行了两次(sort() 然后 argsort()),并且结果必须反转。

  • 这可以更有效地完成吗?
  • 可以在不使用 toarray() 转换稀疏矩阵的情况下完成此操作吗?

I'm trying to write a function in Python (still a noob!) which returns indices and scores of documents ordered by the inner products of their tfidf scores. The procedure is:

  • Compute vector of inner products between doc idx and all other documents
  • Sort in descending order
  • Return the "scores" and indices from the second one to the end (i.e. not itself)

The code I have at the moment is:

import h5py
import numpy as np

def get_related(tfidf, idx) :
    ''' return the top documents '''

    # calculate inner product   
    v = np.inner(tfidf, tfidf[idx].transpose())

    # sort
    vs = np.sort(v.toarray(), axis=0)[::-1]
    scores = vs[1:,]

    # sort indices
    vi = np.argsort(v.toarray(), axis=0)[::-1]
    idxs = vi[1:,] 

    return (scores, idxs)

where tfidf is a sparse matrix of type '<type 'numpy.float64'>'.

This seems inefficient, as the sort is performed twice (sort() then argsort()), and the results have to then be reversed.

  • Can this be done more efficiently?
  • Can this be done without converting the sparse matrix using toarray()?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

木格 2024-12-27 12:37:09

我认为没有必要跳过toarrayv 数组只有 n_docs 长,与 n_docs × n_terms tf- 的大小相比显得相形见绌。实际情况中的idf矩阵。此外,它会非常密集,因为两个文档共享的任何术语都会赋予它们非零的相似度。稀疏矩阵表示仅在您存储的矩阵非常稀疏时才有效(我已经看到Matlab的>80%的数字,并假设Scipy将类似,尽管我没有准确数字)。

可以通过执行以下操作来跳过双重排序

v = v.toarray()
vi = np.argsort(v, axis=0)[::-1]
vs = v[vi]

。顺便说一句,您在稀疏矩阵上使用 np.inner 不适用于最新版本的 NumPy;获取两个稀疏矩阵的内积的安全方法是

v = (tfidf * tfidf[idx, :]).transpose()

I don't think there's any real need to skip the toarray. The v array will be only n_docs long, which is dwarfed by the size of the n_docs × n_terms tf-idf matrix in practical situations. Also, it will be quite dense since any term shared by two documents will give them a non-zero similarity. Sparse matrix representations only pay off when the matrix you're storing is very sparse (I've seen >80% figures for Matlab and assume that Scipy will be similar, though I don't have an exact figure).

The double sort can be skipped by doing

v = v.toarray()
vi = np.argsort(v, axis=0)[::-1]
vs = v[vi]

Btw., your use of np.inner on sparse matrices is not going to work with the latest versions of NumPy; the safe way of taking an inner product of two sparse matrices is

v = (tfidf * tfidf[idx, :]).transpose()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文