我正在尝试用 Python 编写一个函数(仍然是菜鸟!),它返回按 tfidf 分数的内积排序的文档索引和分数。过程是:
- 计算 doc
idx
和所有其他文档之间的内积向量
- 按降序排序
- 返回从第二个到末尾的“分数”和索引(即不是它本身)
我的代码目前是:
import h5py
import numpy as np
def get_related(tfidf, idx) :
''' return the top documents '''
# calculate inner product
v = np.inner(tfidf, tfidf[idx].transpose())
# sort
vs = np.sort(v.toarray(), axis=0)[::-1]
scores = vs[1:,]
# sort indices
vi = np.argsort(v.toarray(), axis=0)[::-1]
idxs = vi[1:,]
return (scores, idxs)
其中 tfidf 是类型为 '' 的稀疏矩阵。
这看起来效率很低,因为排序执行了两次(sort()
然后 argsort()
),并且结果必须反转。
- 这可以更有效地完成吗?
- 可以在不使用 toarray() 转换稀疏矩阵的情况下完成此操作吗?
I'm trying to write a function in Python (still a noob!) which returns indices and scores of documents ordered by the inner products of their tfidf scores. The procedure is:
- Compute vector of inner products between doc
idx
and all other documents
- Sort in descending order
- Return the "scores" and indices from the second one to the end (i.e. not itself)
The code I have at the moment is:
import h5py
import numpy as np
def get_related(tfidf, idx) :
''' return the top documents '''
# calculate inner product
v = np.inner(tfidf, tfidf[idx].transpose())
# sort
vs = np.sort(v.toarray(), axis=0)[::-1]
scores = vs[1:,]
# sort indices
vi = np.argsort(v.toarray(), axis=0)[::-1]
idxs = vi[1:,]
return (scores, idxs)
where tfidf
is a sparse matrix of type '<type 'numpy.float64'>'
.
This seems inefficient, as the sort is performed twice (sort()
then argsort()
), and the results have to then be reversed.
- Can this be done more efficiently?
- Can this be done without converting the sparse matrix using
toarray()
?
发布评论
评论(1)
我认为没有必要跳过
toarray
。v
数组只有n_docs
长,与n_docs
×n_terms
tf- 的大小相比显得相形见绌。实际情况中的idf矩阵。此外,它会非常密集,因为两个文档共享的任何术语都会赋予它们非零的相似度。稀疏矩阵表示仅在您存储的矩阵非常稀疏时才有效(我已经看到Matlab的>80%的数字,并假设Scipy将类似,尽管我没有准确数字)。可以通过执行以下操作来跳过双重排序
。顺便说一句,您在稀疏矩阵上使用
np.inner
不适用于最新版本的 NumPy;获取两个稀疏矩阵的内积的安全方法是I don't think there's any real need to skip the
toarray
. Thev
array will be onlyn_docs
long, which is dwarfed by the size of then_docs
×n_terms
tf-idf matrix in practical situations. Also, it will be quite dense since any term shared by two documents will give them a non-zero similarity. Sparse matrix representations only pay off when the matrix you're storing is very sparse (I've seen >80% figures for Matlab and assume that Scipy will be similar, though I don't have an exact figure).The double sort can be skipped by doing
Btw., your use of
np.inner
on sparse matrices is not going to work with the latest versions of NumPy; the safe way of taking an inner product of two sparse matrices is