生成“邻居” 基于评级的用户
我正在寻找为我正在开发的网站上的用户生成“邻居”(具有相似品味的人)的技术; 类似于last.fm 的工作方式。
目前,我有一个可以发挥作用的用户兼容性功能。 它根据 1) 对相似项目进行评分 2) 对项目进行相似评分对用户进行排名。 该函数对点 2 的权重更高,如果我在生成“邻居”时只需要使用这些因素之一,这将是最重要的。
我的一个想法是计算每种用户组合的兼容性,并选择评分最高的用户作为该用户的邻居。 这样做的缺点是,随着用户数量的增加,这个过程可能需要很长时间。 对于 1000 个用户来说,它需要 1000C2 (0.5 * 1000 * 999 == 499 500) 次调用兼容性函数,这对服务器来说也可能非常繁重。
因此,我正在寻找有关如何最好地实现这样的系统的任何建议、文章链接等。
I'm looking for techniques to generate 'neighbours' (people with similar taste) for users on a site I am working on; something similar to the way last.fm works.
Currently, I have a compatibilty function for users which could come into play. It ranks users on having 1) rated similar items 2) rated the item similarly. The function weighs point 2 heigher and this would be the most important if I had to use only one of these factors when generating 'neighbours'.
One idea I had would be to just calculate the compatibilty of every combination of users and selecting the highest rated users to be the neighbours for the user. The downside of this is that as the number of users go up then this process couls take a very long time. For just a 1000 users, it needs 1000C2 (0.5 * 1000 * 999 = = 499 500) calls to the compatibility function which could be very heavy on the server also.
So I am looking for any advice, links to articles etc on how best to achieve a system like this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
请务必查看协作过滤。 许多推荐系统使用协同过滤来向用户推荐项目。 他们通过寻找“邻居”,然后推荐你的邻居评价很高但你没有评价的物品来做到这一点。 你甚至可以找到邻居,谁知道呢,也许你将来会需要推荐。
GroupLens 是明尼苏达大学的一个研究实验室,研究协作过滤技术。 他们拥有大量已发表的研究成果以及一些样本数据集。
Netflix 奖是一项竞赛,旨在确定谁能最有效地解决此类问题。 请点击 LeaderBoard 上的链接。 一些竞争对手分享了他们的解决方案。
至于计算成本低廉的解决方案,您可以尝试以下操作:
这种方法不太准确,但速度很快。
干杯。
Be sure to look at Collaborative Filtering. Many recommendation systems use collaborative filtering to suggest items to users. They do it by finding 'neighbors' and then suggesting items your neighbors rated highly but you haven't rated. You could go as far as finding neighbors, and who knows, maybe you'll want recommendations in the future.
GroupLens is a research lab at the University of Minnesota that studies collaborative filtering techniques. They have a ton of published research as well as a few sample datasets.
The Netflix Prize is a competition to determine who can most effectively solve this sort of problem. Follow the links off their LeaderBoard. A few of the competitors share their solutions.
As far as a computationally inexpensive solution, you could try this:
This method wouldn't be as accurate but it's fast.
Cheers.
看来您需要阅读聚类算法。 总体思路是,每次将每个点划分为相似点的簇时,不要将每个点与其他每个点进行比较。 那么邻域可能是同一簇中的所有点。 聚类的数量/大小通常是聚类算法的参数。
您可以在 Google 的有关 集群计算和mapreduce。
It looks like you need to read about clustering algorithms. The general idea is that instead of comparing every point with every other point each time you divide them in clusters of similar points. Then the neighborhood may be all the points in the same cluster. The number/size of the clusters is usually a parameter of the clustering algorithm.
Yo can find a video about clustering in Google's series about cluster computing and mapreduce.
在《集体智慧编程》一书中
http://oreilly.com/catalog/9780596529321
第 2 章“提出建议”概述了根据用户之间的相似性向人们推荐项目的方法,这确实做得很好。 您可以使用相似性算法来查找您正在寻找的“邻居”。 本章可在 Google 图书搜索中找到:
http://books.google.com/books?id=fEsZ3Ey- Hq4C&printsec=frontcover
In the book Programming Collective Intelligence
http://oreilly.com/catalog/9780596529321
Chapter 2 "Making Recommendations" does a really good job of outlining methods of recommending items to people based on similarities between users. You could use the similarity algorithms to find the 'neighbours' you are looking for. The chapter is available on google book search here:
http://books.google.com/books?id=fEsZ3Ey-Hq4C&printsec=frontcover
您需要的是一种聚类算法,它可以自动将相似的用户分组在一起。 您面临的第一个困难是大多数聚类算法期望它们聚类的项目被表示为欧几里得空间中的点。 就您而言,您没有这些点的坐标。 相反,您可以计算它们对之间的“相似性”函数的值。
这里一个很好的可能性是使用光谱聚类,它恰恰需要你所拥有的:相似度矩阵。 缺点是您仍然需要计算每对点的兼容性函数,即算法为 O(n^2)。
如果您绝对需要比 O(n^2) 更快的算法,那么您可以尝试一种名为 不同空间。 这个想法很简单。 您可以反转兼容性函数(例如,通过取其倒数)将其转换为相异性或距离的度量。 然后,您将每个项目(在您的例子中为用户)与一组原型项目进行比较,并将所得距离视为空间中的坐标。 例如,如果您有 100 个原型,则每个用户将由 100 个元素组成的向量表示,即 100 维空间中的一个点。 然后,您可以使用任何标准聚类算法,例如 K-means。
现在的问题是如何选择原型以及需要多少个。 已经尝试了各种启发式方法,但是,这里有一篇论文,它认为随机选择原型可能就足够了。 它表明使用 100 或 200 个随机选择的原型进行的实验产生了良好的结果。 在您的例子中,如果您有 1000 个用户,并且选择其中 200 个作为原型,那么您需要评估您的兼容性函数 200,000 次,这比每对进行比较提高了 2.5 倍。 但真正的优势在于,对于 1,000,000 个用户来说,200 个原型仍然足够,并且您需要进行 200,000,000 次比较,而不是 500,000,000,000 次比较,提高了 2500 倍。您得到的是 O(n) 算法,即尽管常数因子可能很大,但优于 O(n^2)。
What you need is a clustering algorithm, which would automatically group similar users together. The first difficulty that you are facing is that most clustering algorithms expect the items they cluster to be represented as points in a Euclidean space. In your case, you don't have the coordinates of the points. Instead, you can compute the value of the "similarity" function between pairs of them.
One good possibility here is to use spectral clustering, which needs precisely what you have: a similarity matrix. The downside is that you still need to compute your compatibility function for every pair of points, i. e. the algorithm is O(n^2).
If you absolutely need an algorithm faster than O(n^2), then you can try an approach called dissimilarity spaces. The idea is very simple. You invert your compatibility function (e. g. by taking its reciprocal) to turn it into a measure of dissimilarity or distance. Then you compare every item (user, in your case) to a set of prototype items, and treat the resulting distances as coordinates in a space. For instance, if you have 100 prototypes, then each user would be represented by a vector of 100 elements, i. e. by a point in 100-dimensional space. Then you can use any standard clustering algorithm, such as K-means.
The question now is how do you choose the prototypes, and how many do you need. Various heuristics have been tried, however, here is a dissertation which argues that choosing prototypes randomly may be sufficient. It shows experiments in which using 100 or 200 randomly selected prototypes produced good results. In your case if you have 1000 users, and you choose 200 of them to be prototypes, then you would need to evaluate your compatibility function 200,000 times, which is an improvement of a factor of 2.5 over comparing every pair. The real advantage, though, is that for 1,000,000 users 200 prototypes would still be sufficient, and you would need to make 200,000,000 comparisons, rather than 500,000,000,000 an improvement of a factor of 2500. What you get is O(n) algorithm, which is better than O(n^2), despite a potentially large constant factor.
这个问题似乎是“分类问题”。 是的,有很多解决方案和方法。
要开始探索,请检查以下内容:
http://en.wikipedia.org/wiki/Statistical_classification
The problem seems like to be 'classification problems'. Yes there are so many solutions and approaches.
To start exploration check this:
http://en.wikipedia.org/wiki/Statistical_classification
您听说过 kohonen 网络吗?
它是一种自组织学习算法,将相似的变量聚集到相似的槽中。 尽管大多数网站(例如我链接到的网站)将网络显示为二维,但几乎没有涉及将算法扩展到多维超立方体。
使用这样的数据结构,查找和存储具有相似品味的邻居是微不足道的,因为相似的用户应该存储到相似的位置(几乎就像反向哈希码)。
这将您的问题简化为寻找定义相似性的变量并建立可能的枚举值之间的距离之一,例如古典音乐和原声音乐很接近,而死亡金属和雷鬼音乐则相当遥远(至少在我看来
)为了找到好的划分变量,最好的算法是决策树。 更接近根的节点将是建立“紧密度”的最重要变量。
Have you heard of kohonen networks?
Its a self organing learning algorithm that clusters similar variables into similar slots. Although most sites like the one I link you to displays the net as bidimensional there is little involved in extending the algorithm into a multiple dimension hypercube.
With such a data structure finding and storing neighbours with similar tastes is trivial as similar users should be stores into similar locations (almost like a reverse hash code).
This reduces your problem into one of finding the variables that will define similarity and establishing distances between possible enumerate values ,like for example classical and acoustic are close toghether while death metal and reggae are quite distant (at least in my oppinion)
By the way in order to find good dividing variables the best algorithm is a decision tree. The nodes closer to the root will be the most important variables to establish 'closeness'.
如果您将其视为构建/批处理问题而不是实时查询,则可以大大减轻对性能的担忧。
该图可以静态计算,然后潜在更新,例如每小时、每天等,然后生成针对运行时查询优化的边和存储,例如每个用户的前10个相似用户。
集体智慧编程也+1 - 它的信息非常丰富 - 希望它不是(或者我是!)作为面向Python的,但仍然很好。
Concerns over performance can be greatly mitigated if you consider this as a build/batch problem rather than a realtime query.
The graph can be statically computed then latently updated e.g. hourly, daily etc. to then generate edges and storage optimized for runtime query e.g. top 10 similar users for each user.
+1 for Programming Collective Intelligence too - it is very informative - wish it wasn't (or I was!) as Python-oriented, but still good.