协作过滤程序:当没有足够的数据时如何获取 Pearson 分数
我正在使用协作过滤构建推荐引擎。对于相似性分数,我使用皮尔逊相关性。这在大多数情况下都很棒,但有时我的用户只共享 1 个或 2 个字段。例如:
User 1{
a: 4
b: 2
}
User 2{
a: 4
b: 3
}
由于这只有 2 个数据点,因此 Pearson 相关性始终为 1(直线或完美相关性)。这显然不是我想要的,那么我应该使用什么值呢?我可以丢弃所有这样的实例(给出相关性为 0),但我的数据现在非常稀疏,我不想丢失任何东西。是否有任何相似性分数可以与我的其他相似性分数(所有皮尔逊分数)相匹配?
I'm building a recommendation engine using collaborative filtering. For similarity scores, I use a Pearson correlation. This is great most of the time, but sometimes I have users that only share a 1 or 2 fields. For example:
User 1{
a: 4
b: 2
}
User 2{
a: 4
b: 3
}
Since this is only 2 data points, a Pearson correlation would always be 1 (a straight line or perfect correlation). This obviously isn't what I want, so what value should I use instead? I could just throw away all instances like this (give a correlation of 0), but my data is really sparse right now and I don't want to lose anything. Is there any similarity score I could use that would fit in with the rest of my similarity scores (all Pearson)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可能需要考虑使用余弦相似度而不是皮尔逊相关性。它不会遇到这个问题,并且在推荐系统文献中被广泛使用。Herlocker 等人描述的规范解决方案。在“基于邻域的协同过滤算法中的设计选择的实证分析”中,目的是“抑制”皮尔逊相关性,以纠正具有较小共同评分集的用户之间过高的相关性。基本上,您可以将 Pearson 相关性乘以 1 和 cc/50 中的较小者,其中 cc 是两个用户评分的项目数。结果是,如果它们至少有 50 个共同点,则相似度是原始 Pearson 相似度;否则,它会随着它们共有的评分项目的数量线性缩放。它将虚假相关性 1 转变为相似度 0.02。
50 可能需要根据您的域和系统进行调整。
您还可以使用余弦相似度,它不会以同样的方式受到此限制。
对于用户-用户 CF,通常首选 Pearson 相关性。更新:在最近的工作中,我们发现对于基于用户的 CF 来说,余弦相似度被过早地忽略了。余弦相似度,当对标准化数据执行时(在计算余弦相似度之前从每个评分中减去用户的平均值——结果与帕森相关性非常相似,除了它有一个内置的自阻尼项),在以下方面优于皮尔逊相关性: “标准”环境。当然,如果可能的话,您应该对自己的数据和环境进行一些测试,看看哪种效果最好。论文在这里:http://grouplens.org/node/479
免责声明:我是以下专业的学生生产上述赫洛克论文的实验室。
You might want to consider using cosine similarity rather than Pearson correlation. It does not suffer from this problem, and is widely used in the recommender systems literature.The canonical solution to this, described by Herlocker et al. in "Empirical Analysis of Design Choices in Neighborhood-based Collaborative Filtering Algorithms", is to "damp" the Pearson correlation to correct for excessively high correlation between users with small co-rating sets. Basically, you multiply the Pearson correlation by the lesser of 1 and cc/50 where cc is the number of items both users have rated. The effect is that, if they have at least 50 items in common, the similarity is raw Pearson; otherwise, it is scaled linearly with the number of rated items they have in common. It turns that spurious correlation of 1 into a similarity of 0.02.
50 may need to be adapted based on your domain and system.
You can also use cosine similarity, which does not suffer from this limitation in the same way.
For user-user CF, however, Pearson correlation is generally preferred.Update: In more recent work, we found that cosine similarity was prematurely dismissed for user-based CF. Cosine similarity, when performed on normalized data (subtract the user's mean from each rating prior to computing cosine similarity --- the result is very similar to Parson correlation, except that it has a built-in self-damping term), outperforms Pearson in a "standard" environment. Of course, if possible, you should do some testing on your own data and environment to see what works best. Paper here: http://grouplens.org/node/479
Disclaimer: I'm a student in the lab that produced the above-mentioned Herlocker paper.
是的,Pearson 在推荐引擎文章中经常被提及,它的工作原理很合理,但有一些像这样的怪癖。 (顺便说一句,在你的例子中相关性是 1,而不是 0。)
余弦度量相似度确实是一个很好的选择。但是,如果您在计算之前将数据“居中”(移动,使平均值为 0),并且有理由您应该这样做,那么它就会减少到与皮尔逊相关性相同。因此,您最终会遇到类似的问题,或者,会遇到与不居中不同的问题。
考虑基于欧几里德距离的相似性度量——相似性与距离成反比,其中用户评分被视为空间中的点。它不存在这种稀疏性问题,尽管它需要对维度进行归一化,以便不利于对许多项目进行共同评分的用户,并且到目前为止,因为他们的距离沿着许多维度增加。
但实际上,我建议您查看基于对数似然的相似性度量。它也不存在这些问题,甚至不需要评级值。这是一个很好的默认设置。
还有更多需要考虑的因素不会出现此问题:Spearman 相关性、基于 Tanimoto 距离(Jaccard 系数)。
您可以在哪里了解更多信息并获得实施?瞧, Apache Mahout
Yes, Pearson is commonly mentioned in recommender engine writeups, and it works reasonably, but has some quirks like this. (By the way the correlation is 1 in your example, not 0.)
The cosine measure similarity is indeed a good alternative. However if you "center" the data (shift so mean is 0) before computing, and there are reasons you should, then it reduces to be identical to the Pearson correlation. So you end up with similar issues, or else, have a different set of issues from not centering.
Consider a Euclidean distance-based similarity metric — similarity is inversely related to distance, where user ratings are viewed as points in space. It doesn't have this sparseness problem, though it needs to be normalized for dimension in order to not favor users who co-rate many items and are thus far since their distance is increased along many dimensions.
But really, I'd suggest you look at a log-likelihood-based similarity metric. It also doesn't have these issues, and doesn't even need rating values. This is a great default.
There are more to consider that wouldn't have this issue: Spearman correlation, Tanimoto distance (Jaccard coefficient)-based.
Where can you learn more and get an implementation? Voila, Apache Mahout
我认为你应该计算项目相似度而不是用户相似度,这样你就可以向评分项目很少的用户推荐新项目。
I think you should calculate item similarity instead of user similarity, so you could recommend new items to the users that have few rated items.
一如既往地感谢肖恩的提示!我同意 LogLikelihood 是最好的“默认”指标,因为它可以使用二进制和非二进制评级集,并且返回 (0,1) 之间的相似性分数。
根据我的经验,使用将相似性分数映射到范围 (0,1) 的度量是一个重要的属性,因为它可以避免在估计偏好计算期间限制估计偏好。如果您不希望您的最佳项目因其他数百个实际上与最佳得分相同的低分项目而丢失,那么这一点至关重要封顶。
Thanks for the Tips as always Sean! I agree that LogLikelihood is the best 'default' metric to start off with because it can work with binary and non-binary rating sets and it returns similarity scores between (0,1).
In my experience, using a metric that maps similarity scores to the range (0,1) is an important property because it avoids capping estimated preferences during estimated preference calculation. This is essential if you don't want your best items to be lost in the hundreds of other lower-scoring items that actually have the same score as the best ones because of capping.