Lucene 中的 Jaccard 相似度
我需要使用 Jaccard 相似度与 n 元语法来计算 Lucene 中查询和文档的相似度。由于 Jaccard 相似度是 IR 中非常常见的度量,我希望找到它的 Lucene 实现,但我找不到。
有人知道这样的实施吗?
I need to calculate the similarity of a query and document in Lucene using Jaccard similarity over n-grams. As Jaccard similarity is is a very common measure in IR, I expected to find a Lucene implementation for it, but I couldn't.
Is anyone aware of such an implementation?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
据我所知,唯一可以轻松与 Lucene 集成的实现是 LingPipe 的实现(请注意,它仅对非商业/研究用途免费)。 这里是一篇博客文章,展示了如何在 LingPipe 中使用它。有关如何连接这两个库的详细说明,请访问 LingPipe 网站和本书。
然而,我还没有评估过,如果你自己集成一些其他实现不是更容易(也是从许可的角度来看)——这只是一个对我有用的解决方案。
The only implementation I'm aware of that can be easily integrated with Lucene is the one from LingPipe (please note that it's free only for non-commercial/research usage). Here is a blog post showing how to use it in LingPipe. A detailed explanation on how to connect both libraries is available on LingPipe website and in this book.
I haven't evaluated however, if it wouldn't be easier (also from license point of view) to integrate some other implementation on your own -- it's just a solution that worked for me.
尝试这个库 http://sourceforge.net/projects/simmetrics/ 你会发现更多相似函数。但
我会推荐你使用 http://secondstring.sourceforge.net/ 中的 SoftTFIDF,它具有最好的精度/recall 根据“名称匹配任务的字符串距离度量的比较”。威廉·W·科恩等人。
Try this library http://sourceforge.net/projects/simmetrics/ you find much more similarity functions. But
I will recommend you to use SoftTFIDF from http://secondstring.sourceforge.net/, it has the best precision/recall according "A Comparison of String Distance Metrics for Name-Matching Tasks". William W. Cohen and others.