协同过滤/推荐系统性能和方法

发布于 2024-12-27 08:51:54 字数 1434 浏览 1 评论 0原文

我真的很想知道人们如何使用协作过滤和推荐引擎等。我的意思更多的是脚本的性能而不是任何东西。我已经说过阅读《集体智慧编程》,这确实很有趣,但往往更关注事物的算法方面。

我目前只有 2000 个用户,但事实证明我当前的系统完全无法适应未来的需要,而且对服务器的负担已经很大。整个系统基于向用户推荐帖子。我的应用程序是 PHP/MySQL,但我使用一些 MongoDB 来进行协作过滤——我使用的是大型 Amazon EC2 实例。我的设置实际上是一个两步过程。首先,我计算项目之间的相似性,然后使用此信息提出建议。它的工作原理如下:

首先,我的系统计算用户帖子之间的相似度。该脚本运行一个算法,该算法返回每对的相似度得分。该算法检查诸如常见标签、常见评论者和常见点赞者等信息,并能够返回相似度分数。这个过程是这样的:

  • 每次添加帖子、添加标签、评论或点赞时,我都会将其添加到队列中。
  • 我通过 cron(每天一次)处理这个队列,找出每个帖子的相关信息,例如评论者和点赞者的 user_id 以及 tag_id。我以这种结构将此信息保存到 MongoDB: {"post_id":1,"tag_ids":[12,44,67],"commenter_user_ids":[6,18,22],"liker_user_ids":[87, 6]}。这使我最终能够建立一个 MongoDB 集合,它使我能够轻松快速地访问所有相关信息,以便当我尝试计算相似性时,
  • 我然后运行另一个 cron 脚本(也每天一次,但在前一个脚本之后),该脚本会执行再次排队。这次,对于队列中的每个帖子,我从 MongoDB 集合中获取它们的条目,并将其与所有其他条目进行比较。当两个条目有一些匹配信息时,我会根据相似度给它们+1。最后我得到了每对帖子的总分。我将分数保存到具有以下结构的不同 MongoDB 集合中: {"post_id":1,"similar":{"23":2,"2":5,"7":2}} ('similar' 是一个 key=>value 数组,以 post_id 作为键,相似度分数作为值,如果它是 0,我不会保存分数

。因此,以上所有内容在服务器上都非常困难。一个现在,这只是问题的一半,然后我使用这些信息来确定特定用户会感兴趣的帖子,因此,我每小时运行一次 cron 脚本。为网站上的每个用户计算 1 个推荐帖子的脚本 过程如下:

  • 脚本首先决定用户将获得哪种类型的推荐 - 1. 与您的帖子类似的帖子。帖子或 2. 帖子与您交互过的帖子相似。
  • 如果为 1,则脚本从 MySQL 获取用户的 post_ids,然后使用它们从 MongoDB 获取相似的帖子。该脚本获取最相似且尚未推荐给的帖子。用户。
  • 如果为 2,则脚本会从 MySQL 获取用户评论或点赞的所有帖子,并使用他们的 id 执行上面 1 中的相同操作。

不幸的是,每小时推荐脚本变得非常资源密集,并且慢慢地需要越来越长的时间才能完成......目前为 10-15 分钟。我担心在某些时候我将无法再提供每小时的建议。

我只是想知道是否有人觉得我可以更好地解决这个问题?

I'm really interested to find out how people approach collaborative filtering and recommendation engines etc. I mean this more in terms of performance of the script than anything. I have stated reading Programming Collective Intelligence, which is really interesting but tends to focus more on the algorithmic side of things.

I currently only have 2k users, but my current system is proving to be totally not future proof and very taxing on the server already. The entire system is based on making recommendations of posts to users. My application is PHP/MySQL but I use some MongoDB for my collaborative filtering stuff - I'm on a large Amazon EC2 instance. My setup is really a 2 step process. First I calculate similarities between items, then I use this information to make recommendations. Here's how it works:

First my system calculates similarities between users posts. The script runs an algorithm which returns a similarity score for each pair. The algorithm examines information such as - common tags, common commenters and common likers and is able to return a similarity score. The process goes like:

  • Each time a post is added, has a tag added, commented on or liked I add it to a queue.
  • I process this queue via cron (once a day), finding out the relevant information for each post, e.g. user_id's of the commenters and likers and tag_id's. I save this information to MongoDB in this kind of structure: {"post_id":1,"tag_ids":[12,44,67],"commenter_user_ids":[6,18,22],"liker_user_ids":[87,6]}. This allows me to eventually build up a MongoDB collection which gives me easy and quick access to all of the relevant information for when I try to calculate similarities
  • I then run another cron script (once a day also, but after the previous) which goes through the queue again. This time, for each post in the queue, I grab their entry from the MongoDB collection and compare it to all of the other entries. When 2 entries have some matching information, I give them +1 in terms of similarity. In the end I have an overall score for each pair of posts. I save the scores to a different MongoDB collection with the following structure: {"post_id":1,"similar":{"23":2,"2":5,"7":2}} ('similar' is a key=>value array with the post_id as key and the similarity score as the value. I don't save a score if it is 0.

I have 5k posts. So all of the above is quite hard on the server. There's a huge amount of reads and writes to be performed. Now, this is only half the issue. I then use this information to work out what posts would be interesting to a particular user. So, once an hour I run a cron script which runs a script that calculates 1 recommended post for each user on the site. The process goes like so:

  • The script first decides, which type of recommendation the user will get. It's a 50-50 change of - 1. A post similar to one of your posts or 2. A post similar to a post you have interacted with.
  • If 1, then the script grabs the users post_ids from MySQL, then uses them to grab their similar posts from MongoDB. The script takes the post that is most similar and has not yet been recommended to the user.
  • If 2, the script grabs all of the posts the user has commented on or liked from MySQL and uses their ids to do the same in 1 above.

Unfortunately the hourly recommendation script is getting very resource intensive and is slowly taking longer and longer to complete... currently 10-15 minutes. I'm worried that at some point I won't be able to provide hourly recommendations anymore.

I'm just wondering if anyone feels I could be approaching this any better?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

靖瑶 2025-01-03 08:51:54

5000 个帖子相当于 25,000,000 个关系,增加了 O(n^2)。

您的第一个问题是如何避免每次批处理运行时检查如此多的关系。使用标签或关键字将有助于内容匹配 - 并且您可以使用日期范围来限制常见的“喜欢”。除此之外......我们还需要了解更多关于建立关系的方法。

另一个考虑因素是您何时建立关系。为什么要等到批处理运行才能将新帖子与现有数据进行比较?当然,异步处理此问题以确保快速处理请求是有意义的 - 但是(除了平台施加的限制之外)为什么要等到批处理开始后再建立关系呢?使用异步消息队列。

事实上,根据处理消息所需的时间,甚至可能存在在检索项目时而不是创建项目时重新生成缓存的关系数据的情况。

如果我正在编写一个平台来衡量与数据的关系,那么(线索就在名称中)我肯定会倾向于关系数据库,其中连接很容易,并且大部分逻辑可以在数据库层上实现。

当然可以减少系统交叉引用数据所需的时间。这正是 Map-Reduce 想要解决的问题 - 但这样做的好处主要来自于在许多机器上并行运行算法 - 最终它只需要同样多的时钟周期。

With 5000 posts, that's 25,000,000 relationships, increasing O(n^2).

Your first problem is how you can avoid examining so many relationships every time the batch runs. Using tags or keywords will help with content matching - and you could use date ranges to limit common 'likes'. Beyond that....we'd to know a lot more about the methodology for establishing relationships.

Another consideration is when you establish relationships. Why are you waiting until the batch runs to compare a new post with existing data? Certainly it makes sense to handle this asynchronously to ensure that the request is processed quickly - but (other than the restrictions imposed by your platform) why wait until the batch kicks in before establishing the relationships? Use an asynchronous message queue.

Indeed depending on how long it takes to process a message, there may even be a case for re-generating cached relationship data when an item is retrieved rather than when it is created.

And if I were writing a platform to measure relationships with data then (the clue is in the name) I'd definitely be leaning towards a relational database where joins are easy and much of the logic can be implemented on the database tier.

It's certainly possible to reduce the length of time the system takes to cross-reference the data. This is exactly the kind of problem map-reduce is intended to address - but the benefits of this mainly come from being to run the algorithm in prallel across lots of machines - at the end of the day it takes just as many clock ticks.

や莫失莫忘 2025-01-03 08:51:54

我开始计划如何做到这一点。
首先,可能要摆脱数据库技术或用三重存储或图形技术对其进行补充。这应该可以为分析类似的喜欢或主题提供更好的性能。

接下来是获取一个子集。获取用户的一些兴趣并获得一小部分具有相似兴趣的用户。

然后以某种有意义的顺序构建喜欢的索引并计算反转(分而治之 - 这与合并排序非常相似,无论如何你都需要排序以计算分割反转)。

我希望这会有所帮助 - 你不想将所有东西与其他东西进行比较,否则它肯定是n2。如果你选择一组具有相似喜好的人并使用它,你应该能够用恒定和线性之间的某种东西来替换它。

例如,从图表的角度来看,采取他们最近喜欢的东西,查看内部边缘,然后追踪它们并分析这些用户。也许对一些最近喜欢的文章执行此操作,然后从中找到一组共同的用户,并将其用于协作过滤以查找用户可能喜欢的文章。那么你就处于一个可行的问题大小 - 特别是在没有索引增长的图表中(尽管可能在文章中遍历的边缘更多 - 这只是为你提供了更多寻找可用数据的变化)

更好的是关键文章本身,这样如果某人喜欢某篇文章,您就可以根据其他用户(即亚马逊的“购买此内容的用户也购买”)看到他们可能喜欢的文章。

希望能给一些想法。对于图形分析,有一些框架可能会有所帮助,例如用于统计和推导的 faunus。

I'm starting to plan how to do this.
First thing is to possibly get rid of your database technology or supplement it with either triplestore or graph technologies. That should provide some better performance for analyzing similar likes or topics.

Next yes get a subset. Take a few interests that the user has and get a small pool of users that have similar interests.

Then build indexes of likes in some sort of meaningful order and count the inversions (divide and conquer - this is pretty similar to merge sort and you'll want to sort on your way out to count split inversions anyways).

I hope that helps - you don't want to compare everything to everything else or it's definately n2. You should be able to replace that with something somwhere between constant and linear if you take sets of people who have similar likes and use that.

For example, from a graph perspective, take something that they recently liked, and look at the in edges and then go trace them out and just analyze those users. Maybe do this on a few recently liked articles and then find a common set of users from that and use that for the collaborative filtering to find articles the user would likely enjoy. then you're at a workable problem size - especially in graph where there is no index growth (although maybe more in edges to traverse on the article - that just gives you more change of finding usable data though)

Even better would be to key the articles themselves so that if an article was liked by someone you can see articles that they may like based on other users (ie Amazon's 'users that bought this also bought').

Hope that gives a few ideas. For graph analysis there are some frameworks that may help like faunus for stats and derivitions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文