用于查找相似项目和用户的推荐算法（和实现）

发布于 2024-12-28 02:09:37 字数 862 浏览 6 评论 0原文

我有一个大约 70 万用户的数据库以及他们看过/听过/读过/买过/等等的项目。我想构建一个推荐引擎，根据对事物有相似品味的用户所喜欢的内容来推荐新项目，以及在我正在构建的社交网络上实际找到用户可能想成为朋友的人（类似于最后一个）。调频）。

我的要求如下：

我的数据库中的大多数“用户”实际上并不是我网站的用户。它们是从第三方来源挖掘的数据。但是，在推荐用户时，我想将搜索限制为我网站的成员（同时仍然利用更大的数据集）。
我需要考虑多个项目。不是“喜欢你喜欢的这一件物品的人……”，而是“喜欢你喜欢的大部分物品的人……”。
我需要计算用户之间的相似性，并在查看他们的个人资料时向他们展示（taste-o-meter）。
有些项目已评级，有些则未评级。评级范围为 1-10，而不是布尔值。在大多数情况下，如果其他统计数据不存在，则可以从其中扣除评级值（例如，如果用户收藏了某个项目，但尚未对其进行评级，我可以假设评级为 9）。
它必须以一种或另一种方式与 Python 代码交互。最好，它应该使用单独的（可能是 NoSQL）数据库并公开一个 API 以在我的 Web 后端中使用。我正在制作的项目使用 Pyramid 和 SQLAlchemy。
我想考虑项目类型。
我想根据其流派（可能是标签）和喜欢该项目的用户喜欢的内容（例如亚马逊的“购买该项目的人”和 Last.fm 艺术家页面）在项目页面上显示类似的项目。来自不同类型的项目仍应显示，但相似度值较低。
我更喜欢有详细记录的算法实现以及一些示例。

请不要给出诸如“使用 pysuggest 或 mahout”之类的答案，因为它们实现了大量算法，而我正在寻找最适合我的数据/使用的算法。我一直对 Neo4j 以及如何将其表达为用户和项目之间的连接图感兴趣。

I have a database of about 700k users along with items they have watched/listened to/read/bought/etc.
I would like to build a recommendation engine that recommends new items based on what users with similar taste in things have enjoyed, as well as actually finding people the user might want to be friends with on a social network I'm building (similar to last.fm).

My requirements are as follows:

Majority of the "users" in my database aren't actually users of my website. They have been data mined from third-party sources. However, when recommending users, I would like to limit the search to people who are members of my website (while still taking advantage of the bigger data set).
I need to take multiple items into consideration. Not "people who like this one item you enjoyed...", but "people who like most of the items you enjoyed...".
I need to compute similarities between users and show them when viewing their profiles (taste-o-meter).
Some items are rated, others are not. Ratings are from 1-10, not boolean values. In most cases it would be possible to deduct a rating value from other stats if it's not present (e.g. if the user has favourited an item, but hasn't rated it, I could just assume a rating of 9).
It has to interact with Python code in one way or another. Preferably, it should use a seperate (possibly NoSQL) database and expose an API to use in my web back-end. The project I'm making uses Pyramid and SQLAlchemy.
I would like to take item genres into account.
I would like to display similar items on item pages based on both its genre (possibly tags) and what users who enjoyed the item liked (like Amazon's "people who bought this item" and Last.fm artist pages). Items from different genres should still be shown, but have a lower similarity value.
I would prefer a well-documented implementation of an algorithm with some examples.

Please don't give an answer like "use pysuggest or mahout", since those implement a plethora of algorithms and I'm looking for one that's most suitable for my data/use. I've been interested in Neo4j and how it all could be expressed as a graph of connections between users and items.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

画尸师 2025-01-04 02:09:37

要确定用户之间的相似度，您可以在用户向量上运行余弦或皮尔逊相似度（在 Mahout 和网络上的任何地方都可以找到！）。因此，您的数据表示形式应该类似于

 u1  [1,2,3,4,5,6] 
 u2  [35,24,3,4,5,6] 
 u1  [35,3,9,2,1,11]

在您想要考虑多个项目的情况下，您可以使用上述内容来确定某人的个人资料的相似程度。相关性得分越高，它们具有非常相似的项目的可能性就越大。您可以设置一个阈值，以便相似度为 0.75 的人在其个人资料中拥有一组相似的项目。

如果你缺少价值观，你当然可以弥补自己的价值观。我只是将它们保留为二进制并尝试混合各种不同的算法。这就是所谓的合奏。

总的来说，您正在寻找一种称为基于项目的协作过滤的东西作为您设置的推荐方面，并且也用于识别类似的项目。这是一种标准的推荐算法，几乎可以完成您所要求的一切。

当尝试查找相似用户时，您可以在用户向量中执行某种类型的相似性度量。

关于Python，一本名为“集体智慧编程”的书给出了Python中的所有示例，所以去拿一份副本并阅读第一章。将

所有这些表示为图表会有点问题，因为你的永恒表示是一个二分图。有很多推荐方法都使用基于图的方法，但它通常不是性能最好的方法。

To determine similarity between users you can run cosine or pearson similarity (Found in Mahout and everywhere on the net really!) across the user vector. So your data representation should look something like

 u1  [1,2,3,4,5,6] 
 u2  [35,24,3,4,5,6] 
 u1  [35,3,9,2,1,11]

In the point where you want to take multiple items into consideration you can use the above to determine how similar someones profiles are. The higher the correlation score the likelihood they have very similar items is. You can set a threshold so someone with .75 similarity has a similar set of items in their profile.

Where you are missing values you can of course make up your own values. I'd just keep them binary and try to blend the various different algorithms. That's called an ensemble.

Overall you are looking for something called item based collaborative filtering as the recommendation aspect of your set up and also used to identify similar items. It's a standard recommendation algorithm that does pretty much everything you've asked for.

When trying to find similar users you can perform some type of similarity metric across your user vectors.

Regarding Python, the book called programming in collective intelligence gives all their samples in python so go pick up a copy and read chapter 1.

Representing all of this as a graph will be somewhat problamatic as your undying representation is a Bipartile Graph. There are lots of recommendation approaches out there that use a graph based approach but its generally not the best performing approach.

回复收藏 0 原文

洋洋洒洒 2025-01-04 02:09:37

实际上，这是像 Neo4j 这样的图形数据库的优点之一。

因此，如果您的数据模型如下所示：

user -[:LIKE|:BOUGHT]-> item

您可以使用如下的 cypher 语句轻松地为用户获取推荐：

start user = node:users(id="doctorkohaku")
match user -[r:LIKE]->item<-[r2:LIKE]-other-[r3:LIKE]->rec_item
where r.stars > 2 and r2.stars > 2 and r3.stars > 2
return rec_item.name, count(*) as cnt, avg(r3.stars) as rating
order by rating desc, cnt desc limit 10

这也可以使用 Neo4j Core-API 或 Traversal-API 来完成。

Neo4j 有一个 Python API，也可以运行密码查询。

免责声明：我为 Neo4j 工作

还有一些有趣的文章 Marko Rodriguez 关于协同过滤。

Actually that is one of the sweetspots of a graph database like Neo4j.

So if your data model looks like this:

user -[:LIKE|:BOUGHT]-> item

You can easily get recommendations for an user with a cypher statement like this:

start user = node:users(id="doctorkohaku")
match user -[r:LIKE]->item<-[r2:LIKE]-other-[r3:LIKE]->rec_item
where r.stars > 2 and r2.stars > 2 and r3.stars > 2
return rec_item.name, count(*) as cnt, avg(r3.stars) as rating
order by rating desc, cnt desc limit 10

This can also be done using the Neo4j Core-API or the Traversal-API.

Neo4j has an Python API that is also able to run cypher queries.

Disclaimer: I work for Neo4j

There are also some interesting articles by Marko Rodriguez about collaborative filtering.

回复收藏 0 原文