Mahout Log Likelihood 相似性度量行为

发布于 2024-11-29 11:36:07 字数 856 浏览 7 评论 0原文

我试图解决的问题是为我的数据找到正确的相似性度量、重新记录器启发式和过滤级别。（我使用“过滤级别”来表示用户或项目必须与其关联才能进入生产数据库的评级量）。

设置
我正在使用 mahout 的品味协同过滤框架。我的数据以三元组的形式出现，其中项目的评级包含在集合 {1,2,3,4,5} 中。我在 logLikelihood 相似性度量之上使用 itemBased 推荐器。我过滤掉了对生产数据集中评分少于 20 个项目的用户。 RMSE 看起来不错（1.17 左右），并且没有数据上限，但有一种奇怪的行为，这是不受欢迎的，并且接近于错误。

问题

第一次通话 - 生成“热门项目”列表，不包含用户信息。为此，我使用我所说的中心总和：

for i in items
 for r in i's ratings
  sum += r - center

where center = (5+1)/2 , if you allow ratings in the scale of 1 to 5 for example

我使用中心总和而不是平均评分来生成顶级项目列表，主要是因为我希望项目已收到的评分数量因素纳入排名。

第二次调用 - 我要求提供 9 个与第一次调用中返回的最重要项目相似的项目。对于我要求相似商品的每个顶级商品，返回的 9 个相似商品中有 7 个是相同的（因为其他顶级商品返回的相似商品集）！

是时候尝试重新评分了吗？也许将两个游戏的相似度乘以（共同评分项目的数量）/x，其中 x 被调整（大约 50 或开始时的某个值）。

预先感谢各位

原文

The problem I'm trying to solve is finding the right similarity metric, rescorer heuristic and filtration level for my data. (I'm using 'filtration level' to mean the amount of ratings that a user or item must have associated with it to make it into the production database).

Setup
I'm using mahout's taste collaborative filtering framework. My data comes in the form of triplets where an item's rating are contained in the set {1,2,3,4,5}. I'm using an itemBased recommender atop a logLikelihood similarity metric. I filter out users who rate fewer than 20 items from the production dataset. RMSE looks good (1.17ish) and there is no data capping going on, but there is an odd behavior that is undesireable and borders on error-like.

Question

First Call -- Generate a 'top items' list with no info from the user. To do this I use, what I call, a Centered Sum:

for i in items
 for r in i's ratings
  sum += r - center

where center = (5+1)/2 , if you allow ratings in the scale of 1 to 5 for example

I use a centered sum instead of average ratings to generate a top items list mainly because I want the number of ratings that an item has received to factor into the ranking.

Second Call -- I ask for 9 similar items to each of the top items returned in the first call. For each top item I asked for similar items for, 7 out of 9 of the similar items returned are the same (as the similar items set returned for the other top items)!

Is it about time to try some rescoring? Maybe multiplying the similarity of two games by (number of co-rated items)/x, where x is tuned (around 50 or something to begin with).

Thanks in advance fellas

分享到QQ

分享到微博