用于在 Google 新闻中生成推荐的算法?

发布于 2024-08-30 05:49:10 字数 839 浏览 7 评论 0原文

我正在研究推荐引擎,我浏览了论文,它定义了如何Google 新闻基于协作过滤,向用户推荐他们可能感兴趣的新闻项目。

他们提到的一项有趣的技术是 Minhashing。我经历了它的作用,但我很确定我的想法很模糊,而且我很可能是错的。以下是我可以从中得出的结论:-

  1. 收集一组所有新闻项目。
  2. 为用户定义哈希函数。此哈希函数返回该用户查看的新闻项目中第一个项目的索引,该新闻项目在所有新闻项目列表中。
  3. 收集“n”个这样的值,并用该值列表代表用户。
  4. 根据这些列表之间的相似度计数,我们可以将用户之间的相似度计算为共同项目的数量。这样就减少了很多比较的次数。
  5. 根据这些相似性度量,将用户分组到不同的集群中。

我认为可能就是这样。在步骤 2 中,我们可能会改变哈希函数,使其返回不同元素的索引,而不是定义常量哈希函数。因此,一个哈希函数可以返回用户列表中第一个元素的索引,另一个哈希函数可以返回用户列表中第二个元素的索引,依此类推。因此,哈希函数的性质满足 minwise 独立排列 条件,这听起来确实是一种可能的方法。

有人可以确认我的想法是否正确吗?或者 Google 新闻推荐的 minhashing 部分以其他方式发挥作用?我是建议内部实施的新手。非常感谢任何帮助。

谢谢!

I'm study recommendation engines, and I went through the paper that defines how Google News generates recommendations to users for news items which might be of their interest, based on collaborative filtering.

One interesting technique that they mention is Minhashing. I went through what it does, but I'm pretty sure that what I have is a fuzzy idea and there is a strong chance that I'm wrong. The following is what I could make out of it :-

  1. Collect a set of all news items.
  2. Define a hash function for a user. This hash function returns the index of the first item from the news items which this user viewed, in the list of all news items.
  3. Collect, say "n" number of such values, and represent a user with this list of values.
  4. Based on the similarity count between these lists, we can calculate the similarity between users as the number of common items. This reduces the number of comparisons a lot.
  5. Based on these similarity measures, group users into different clusters.

This is just what I think it might be. In Step 2, instead of defining a constant hash function, it might be possible that we vary the hash function in a way that it returns the index of a different element. So one hash function could return the index of the first element from the user's list, another hash function could return the index of the second element from the user's list, and so on. So the nature of the hash function satisfying the minwise independent permutations condition, this does sound like a possible approach.

Could anyone please confirm if what I think is correct? Or the minhashing portion of Google News Recommendations, functions in some other way? I'm new to internal implementations of recommendations. Any help is appreciated a lot.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

↙温凉少女 2024-09-06 05:49:10

我想你已经很接近了。

首先,哈希函数首先随机排列所有新闻项,然后对于任何给定的人查看第一项。由于每个人都有相同的排列,因此两个人很有可能拥有相同的第一项。

然后,为了获得新的哈希函数,他们不是选择第二个元素(这会对第一个元素有一些令人困惑的依赖关系),而是选择一个全新的排列并再次采用第一个元素。

碰巧有 2-4 次相同哈希值(即 2-4 次排列中第一个元素相同)的人被放在一个簇中。这个算法重复10-20次,这样每个人就被分成10-20个簇。最后,根据 10-20 个集群中的(少数)其他人给出推荐。由于所有这些工作都是通过哈希完成的,因此人们可以直接放入其集群的存储桶中,不需要进行大量比较。

I think you're close.

First of all, the hash function first randomly permutes all the news items, and then for any given person looks at the first item. Since everyone had the same permutation, two people have a decent chance of having the same first item.

Then, to get a new hash function, rather than choosing the second element (which would have some confusing dependencies on the first element), they choose a whole new permutation and take the first element again.

People who happen to have the same hash value 2-4 times (that is, the same first element in 2-4 permutations) are put together in a cluster. This algorithm is repeated 10-20 times, so that each person gets put into 10-20 clusters. Finally, recommendations are given based (the small number of) other people in the 10-20 clusters. Since all this work is done by hashing, people are put directly into buckets for their clusters, and large numbers of comparisons aren't needed.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文