Mahout Log Likelihood 相似性度量行为
我试图解决的问题是为我的数据找到正确的相似性度量、重新记录器启发式和过滤级别。 (我使用“过滤级别”来表示用户或项目必须与其关联才能进入生产数据库的评级量)。
设置
我正在使用 mahout 的品味协同过滤框架。我的数据以三元组的形式出现,其中项目的评级包含在集合 {1,2,3,4,5} 中。我在 logLikelihood 相似性度量之上使用 itemBased 推荐器。我过滤掉了对生产数据集中评分少于 20 个项目的用户。 RMSE 看起来不错(1.17 左右),并且没有数据上限,但有一种奇怪的行为,这是不受欢迎的,并且接近于错误。
问题
第一次通话 - 生成“热门项目”列表,不包含用户信息。为此,我使用我所说的中心总和:
for i in items
for r in i's ratings
sum += r - center
where center = (5+1)/2 , if you allow ratings in the scale of 1 to 5 for example
我使用中心总和而不是平均评分来生成顶级项目列表,主要是因为我希望项目已收到的评分数量因素纳入排名。
第二次调用 - 我要求提供 9 个与第一次调用中返回的最重要项目相似的项目。对于我要求相似商品的每个顶级商品,返回的 9 个相似商品中有 7 个是相同的(因为其他顶级商品返回的相似商品集)!
是时候尝试重新评分了吗?也许将两个游戏的相似度乘以(共同评分项目的数量)/x,其中 x 被调整(大约 50 或开始时的某个值)。
预先感谢各位
The problem I'm trying to solve is finding the right similarity metric, rescorer heuristic and filtration level for my data. (I'm using 'filtration level' to mean the amount of ratings that a user or item must have associated with it to make it into the production database).
Setup
I'm using mahout's taste collaborative filtering framework. My data comes in the form of triplets where an item's rating are contained in the set {1,2,3,4,5}. I'm using an itemBased recommender atop a logLikelihood similarity metric. I filter out users who rate fewer than 20 items from the production dataset. RMSE looks good (1.17ish) and there is no data capping going on, but there is an odd behavior that is undesireable and borders on error-like.
Question
First Call -- Generate a 'top items' list with no info from the user. To do this I use, what I call, a Centered Sum:
for i in items
for r in i's ratings
sum += r - center
where center = (5+1)/2 , if you allow ratings in the scale of 1 to 5 for example
I use a centered sum instead of average ratings to generate a top items list mainly because I want the number of ratings that an item has received to factor into the ranking.
Second Call -- I ask for 9 similar items to each of the top items returned in the first call. For each top item I asked for similar items for, 7 out of 9 of the similar items returned are the same (as the similar items set returned for the other top items)!
Is it about time to try some rescoring? Maybe multiplying the similarity of two games by (number of co-rated items)/x, where x is tuned (around 50 or something to begin with).
Thanks in advance fellas
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您要求 50 个与某个项目 X 相似的项目。然后您为这 50 个项目中的每一个寻找 9 个相似的项目。其中大多数都是相同的。为什么这令人惊讶?相似的物品应该与其他相同的物品相似。
什么是“中心”总和?如果每次计算的总和中的项目数量大致相似,则按总和而不是平均值排名仍然会提供相对相似的输出。
您想解决什么问题?因为这些似乎都与您描述的正在使用和工作的推荐系统没有关系。对数似然相似性甚至不基于评级。
You are asking for 50 items similar to some item X. Then you look for 9 similar items for each of those 50. And most of them are the same. Why is that surprising? Similar items ought to be similar to the same other items.
What's a "centered" sum? ranking by sum rather than average still gives you a relatively similar output if the number of items in the sum for each calculation is roughly similar.
What problem are you trying to solve? Because none of this seems to have a bearing on the recommender system you describe that you're using and works. Log-likelihood similarity is not even based on ratings.