Lucene:通过添加IR信息来输出详细数据
我需要处理数据库,以便将 td-idf 权重等元信息添加到文档术语中。
接下来,我需要创建具有相似性度量的文档对,例如 td-idf 余弦相似度等......
我计划使用 Apache Lucene 来完成此任务。实际上,我对检索或运行查询不感兴趣,而是对数据进行索引并对其进行详细说明,以便生成具有上述文档对和相似度分数的输出文件。下一步是将这些结果传递给 Weka 分类器。
我可以用 Lucene 轻松做到这一点吗? 谢谢
I need to process a database in order to add meta-information such as td-idf weights to the documents terms.
Successively I need to create document pairs with similarity measures such as td-idf cosine similarity, etc...
I'm planning to use Apache Lucene for this task. I'm actually not interested in the retrieval, or running a query, but in indexing the data and elaborate them in order to generate an output file with the above mentioned document pairs and similarity scores. The next step would be to pass these results to a Weka classifier.
Can I easily do it with Lucene ?
thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
尝试 将 Apache Mahout 与 Apache Lucene 和 Solr 集成。将“Mahout”替换为“Weka”。祝你好运。
Try Integrating Apache Mahout with Apache Lucene and Solr. Replace the places that say "Mahout" with "Weka". Good Luck.