测量文档集之间的相似性
出于说明目的,我们假设这是一个论坛服务。我需要计算每个用户帖子之间的“相似度”,这样结果就会是这样的:
among posts by user A, similarity 60%
among posts by user B, similarity 20%
...
我正在处理多字节字符串,所以我想我在这里被搜索引擎困住了。我们已经使用 Solr,已经实现了 moreLikeThis,但我不太确定如何构建查询。任何帮助表示赞赏!
For illustration purposes, let's assume this is a forum service. I need to calculate the "similarity" among each users' posts, so that the result would be something like:
among posts by user A, similarity 60%
among posts by user B, similarity 20%
...
I'm dealing with multibyte strings, so I guess I'm stuck with search engines here. We already use Solr, already have moreLikeThis implemented, but I'm not quite sure how to construct the query. Any help appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
可能 Carrot2 会让您感兴趣(并且 此博客与之相关)
Possibly Carrot2 will interest you (and this blog related to it)
奇怪的问题有两个方面: 1. 为什么一定要处理SOLR? 2. 相似性的种类取决于目标问题。你的问题对我来说听起来太笼统了。语义相似性领域正在进行研究。有编辑距离算法,这可能不是您想要的。
因此,更准确地定义您的问题,您就会得到更好的答案。
strange question in two ways: 1. Why do you have to deal with SOLR? 2. The kind of similarity depends on the target problem. Your question sounds too generic to me. There is research going on in the area of semantic similarity. There is edit-distance algorithm, which is probably not what you want.
So, define you question more precisely and you get better answers.
相似性的度量有多种,一种简单而有效的度量是余弦相似性。
还有更复杂的,例如 Smith-Waterman 等,
请参阅 http://sourceforge.net/projects/simmetrics/
There are several measures of similarity, a simple and effective one is Cosine similarity.
There are more sophisticated ones such as Smith-Waterman etc,
Look at http://sourceforge.net/projects/simmetrics/