从 lucene 索引查询中获取向量空间模型 (tf-idf)
我需要从 lucene 查询的结果中获取向量空间模型(带有 tf-idf 权重),但不知道该怎么做。看起来应该很简单,在这个阶段也许你们中的一个人可以给我指出正确的方向。
我已经尝试弄清楚如何做到这一点有一段时间了,要么我还没有意识到我读过的东西是我需要的(很有可能),要么解决方案还没有发布到我的特别的问题。我什至尝试直接根据查询结果自己计算 VSM,但我的解决方案非常复杂。
编辑:对于任何其他偶然发现这一点的人,有一个解决方案@更清晰的问题 这里 我需要的可以通过 IndexReader.getTermFreqVector(String field, int docid) 方法获得。
不幸的是,这对我不起作用,因为我正在处理的索引没有存储术语频率向量,所以我想我仍在寻找更多帮助!
I need to get the Vector Space Model(with tf-idf weighting) from the results of a lucene query, and cant figure out how to do it. It seems like it should be simple, and at this stage maybe one of you guys can point me in the right direction.
I have been trying to figure out how to do this for a good while, and either I haven't copped how the stuff i have read is what i need yet (more than likely), or a solution hasn't been posted to my particular problem. I even tried computing the VSM myself direct from the query results, but my solution has hideous complexity.
Edit: For anyone else who stumbles upon this, there is a solution @ the much clearer question here What i need can be gotten by the IndexReader.getTermFreqVector(String field, int docid) method.
Unfortunately this doesn't work for me as the index I am working off hasn't stored the term frequency vectors, so I guess I'm still looking for more help on this!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
要回答这个问题,您可以使用 IndexReader.getTermFreqVector() 和 Searcher.docFreq() 类计算一组 lucene 结果的 TF-IDF 加权向量空间模型。 Lucene 中无法直接获取一组结果的 VSM。
To answer this question, you can compute a TF-IDF weighted vector space model for a set of lucene results using the IndexReader.getTermFreqVector() and Searcher.docFreq() classes. There is no way of directly getting the VSM for a set of results in Lucene.
也许我误解了你想要做什么,但 Lucene 的评分使用向量空间模型。如果您想了解有关如何在给定文档和查询的情况下计算分数的更多详细信息,请使用 Searcher.explain(Query query, int doc) 。
Maybe I'm misunderstanding what you're trying to do, but Lucene's scoring uses the vector space model. If you want more details for how the scores are calculated, given a document and a query, use Searcher.explain(Query query, int doc) .
如果我从您的评论中理解正确,您需要计算文档之间而不是查询和文档之间的 VSM 余弦相似度。我不知道具体如何做到这一点,但我会向您指出 Lucene API 页面的
相似度
类。您可能必须派生并使用Similarity
的自定义子类来更改coord
和queryNorm
成员,并找到一种方法将文档转换为查询对象。(不保证;我只是想自己算出这个分数。)
If I understand correctly from your comment, you want the compute VSM cosine similarity between documents rather than between a query and a document. I don't know exactly how to do this, but I'd point you to the Lucene API page for the
Similarity
class. You'd probably have to derive and use a custom subclass ofSimilarity
that changes thecoord
andqueryNorm
members and find a way to turn documents into query objects.(No guarantees; I'm just trying to figure out this scoring myself.)