基于内容的项目推荐方法

发布于 2024-10-09 14:48:39 字数 1109 浏览 4 评论 0原文

我目前正在开发一个应用程序,我想将类似的项目分组。项目(如视频)可以由用户创建,并且它们的属性可以在以后更改或扩展(如新标签)。我不想像大多数协作过滤机制那样依赖用户的偏好,而是想根据项目的属性(例如相似的长度、相似的颜色、相似的标签集等)来比较项目相似性。计算对于两个主要目的是必要的:为给定项目建议 x 个相似项目以及聚类为相似项目组。

到目前为止,我的应用程序遵循异步设计,我想尽可能地解耦这个集群组件。新项目的创建或现有项目的新属性的添加将通过发布组件随后可以使用的事件来通告。

可以尽力提供计算和“快照”,这意味着我可以在给定时间点获得最佳结果,尽管结果质量最终会提高。

所以我现在正在寻找合适的算法来计算相似的项目和集群。一个重要的限制是可扩展性。最初,应用程序必须处理几千个项目,但后来也可能处理数百万个项目。当然,计算将在其他节点上执行,但算法本身应该可扩展。如果算法支持对数据的部分更改进行某种增量模式,那就太好了。

我最初的想法是将每个项目相互比较并存储数字相似度,这听起来有点粗糙。此外,它需要 n*(n-1)/2 条目来存储所有相似性,任何更改或新项目最终都会导致 n 相似性计算。

提前致谢!

更新tl;dr

为了澄清我想要什么,这是我的目标场景:

  • 用户生成条目(考虑文档)
  • 用户编辑条目元数据(考虑标签)

这是我的系统应该提供的内容:

  • 作为推荐的给定项目的相似条目列表
  • 相似条目的集群

两种计算都应基于:

  • 条目的元数据/属性(即相似标签的使用)
  • 因此,使用适当度量的两个条目的距离
  • 不基于关于用户投票、偏好或操作(与协作过滤不同)。尽管用户可以创建条目并更改属性,但计算应仅考虑项目及其属性,而不考虑与之关联的用户(就像仅存在项目而不存在用户的系统一样)。

理想情况下,该算法应支持:

  • 条目属性的永久更改 增量
  • 计算更改
  • 规模
  • 上的相似条目/簇如果可能的话,比简单的距离表更好(因为 O(n²) 空间复杂度)

I'm currently developing an application where I want to group similar items. Items (like videos) can be created by users and also their attributes can be altered or extended later (like new tags). Instead of relying on users' preferences as most collaborative filtering mechanisms do, I want to compare item similarity based on the items' attributes (like similar length, similar colors, similar set of tags, etc.). The computation is necessary for two main purposes: Suggesting x similar items for a given item and for clustering into groups of similar items.

My application so far is follows an asynchronous design and I want to decouple this clustering component as far as possible. The creation of new items or the addition of new attributes for an existing item will be advertised by publishing events the component can then consume.

Computations can be provided best-effort and "snapshotted", which means that I'm okay with the best result possible at a given point in time, although result quality will eventually increase.

So I am now searching for appropriate algorithms to compute both similar items and clusters. At important constraint is scalability. Initially the application has to handle a few thousand items, but later million items might be possible as well. Of course, computations will then be executed on additional nodes, but the algorithm itself should scale. It would also be nice if the algorithm supports some kind of incremental mode on partial changes of the data.

My initial thought of comparing each item with each other and storing the numerical similarity sounds a little bit crude. Also, it requires n*(n-1)/2 entries for storing all similarities and any change or new item will eventually cause n similarity computations.

Thanks in advance!

UPDATE tl;dr

To clarify what I want, here is my targeted scenario:

  • User generate entries (think of documents)
  • User edit entry meta data (think of tags)

And here is what my system should provide:

  • List of similar entries to a given item as recommendation
  • Clusters of similar entries

Both calculations should be based on:

  • The meta data/attributes of entries (i.e. usage of similar tags)
  • Thus, the distance of two entries using appropriate metrics
  • NOT based on user votings, preferences or actions (unlike collaborative filtering). Although users may create entries and change attributes, the computation should only take into account the items and their attributes, and not the users associated with (just like a system where only items and no users exist).

Ideally, the algorithm should support:

  • permanent changes of attributes of an entry
  • incrementally compute similar entries/clusters on changes
  • scale
  • something better than a simple distance table, if possible (because of the O(n²) space complexity)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

月依秋水 2024-10-16 14:48:39

不要从头开始编写,请查看 mahout.apache.org。它具有您正在寻找的聚类算法以及推荐算法。它与 Hadoop 一起工作,因此您可以轻松扩展

这将允许您根据您的关键字和/或视频描述确定集群中的类似文档。

https://cwiki.apache.org/MAHOUT/k-means-clustering.html

有一个关于使用 Reuters 数据集对文档进行聚类的快速教程。它与您想要实现的目标非常相似。 Mahout 包括推荐算法,例如斜率一、基于用户、基于项目的算法,并且非常容易扩展。它还具有一些非常有用的聚类算法,支持降维功能。如果您的矩阵稀疏(即很多标签的使用统计数据很少),这对您很有用。

另请查看 Lucene,以使用其 tfidf 功能对标签和文档进行聚类。另请检查 Solr。两者都是 Apache 项目。

Instead of writing from scratch take a look at mahout.apache.org. It has the clustering algorithms you are looking for as well as the recommendation algorithms. It works alongside Hadoop, so you can scale it out easily.

What this will allow you to do is determine similar documents in a cluster based on your keywords and/or description of the video.

https://cwiki.apache.org/MAHOUT/k-means-clustering.html

has a quick tutorial about clustering of documents using a Reuters dataset. It is quite similar to what you are trying to achieve. Mahout includes recommendation algorithms such as slope one, user based, item based and is incredibly easy to extend. It also has some pretty useful clustering algorithms which support dimension reduction features. This is useful for you in case your matrix is sparse (that is, a lot of tags that have very few usage stats).

Also take a look at Lucene to use its tfidf features to cluster tags and documents. Also check Solr. Both are Apache projects.

成熟稳重的好男人 2024-10-16 14:48:39

推荐算法会非常有帮助,因为它列出了处理问题的标准算法你的目的。

更新:

我想您正在查看的是 协作质量过滤 和不仅是协同过滤,我还附上了论文链接,希望这有帮助。

Recommendation Algorithm would be very helpful as it lists standard algorithm for dealing with your purpose.

Updated:

I guess what you are looking at is Collaborative Quality Filtering and not only Collaborative Filtering, I have attached link to paper, hope this helps.

一花一树开 2024-10-16 14:48:39

K-means 聚类 可能就是您想要的。

注意:

簇的数量 k 是一个输入参数:k 选择不当可能会产生糟糕的结果......它在某些数据集上效果很好,但在其他数据集上却表现不佳。

因此,您应该考虑有多少个集群、多少个标签以及什么指标。

另请参阅 StackOverflow questions/tagged/k-means

K-means clustering may be what you want.

N.B.:

The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results ... It works very well on some data sets, while failing miserably on others.

So you should consider how many clusters, how many tags, and what metric.

See also Stack Overflow questions/tagged/k-means.

简单爱 2024-10-16 14:48:39

http://taste.sourceforge.net/old.html

口味灵活、快捷
协同过滤引擎
爪哇。该引擎将用户的
对物品的偏好(“口味”)和
返回估计偏好
其他物品。例如,一个网站
出售可以轻松使用的书籍或 CD
从过去的味道中找出答案
购买数据,CD 给客户
可能有兴趣听一下。

味道提供了丰富的
您可以从中使用组件
构建定制推荐器
系统从一系列算法中选择。
味道被设计为
企业就绪;它是专为
性能、可扩展性和
灵活性。它支持一个标准
基于 J2EE 的 EJB 接口
应用程序,但品味不仅仅是
对于Java;它可以作为外部运行
公开推荐的服务器
通过网络将逻辑添加到您的应用程序
服务和 HTTP。

http://savannah.nongnu.org/projects/cofi/

目前,想要使用的程序员
协同过滤必读
文献并实施自己的
算法。更多时候,
程序员可能自己设计
算法,他们通常会
产生次优算法。我们想要
建立已经的基础
测试了算法并记录了
可用于广泛的
从研究到的背景
应用程序。指导原则
就是设计要薄。
Cofi 不想成为一切
为所有人。所以重点是
交付很少的代码行,
并依赖程序员
提供必要的胶水。

更多此处

http://taste.sourceforge.net/old.html

Taste is a flexible, fast
collaborative filtering engine for
Java. The engine takes users'
preferences for items ("tastes") and
returns estimated preferences for
other items. For example, a site that
sells books or CDs could easily use
Taste to figure out, from past
purchase data, which CDs a customer
might be interested in listening to.

Taste provides a rich set of
components from which you can
construct a customized recommender
system from a selection of algorithms.
Taste is designed to be
enterprise-ready; it's designed for
performance, scalability and
flexibility. It supports a standard
EJB interface for J2EE-based
applications, but Taste is not just
for Java; it can be run as an external
server which exposes recommendation
logic to your application via web
services and HTTP.

http://savannah.nongnu.org/projects/cofi/

Currently, programmers who want to use
collaborative filtering have to read
the literature and implement their own
algorithms. More often than not,
programmers probably design their own
algorithms and they will generally
produce suboptimal algorithms. We want
to build a foundation of already
tested algorithms and documented that
can be used in a wide range of
contexts from research to
applications. The guiding principle
is that the design should be thin.
Cofi doesn't want to be all things
for all people. So the focus is on
delivering very few lines of code,
and to rely on the programmer for
providing the necessary glue.

Few more here

ぇ气 2024-10-16 14:48:39

在开始实施、改编或使用现有库之前,请确保您了解该领域;阅读诸如“行动中的集体智慧”之类的内容是一个好的开始。

Before starting to implement, adapt or use existing library, make sure you know the domain; reading something like "Collective Intelligence in Action" is a good start.

孤者何惧 2024-10-16 14:48:39

您想要基于项目的协作过滤而不是基于用户的。谷歌上有很多这样的算法。基于项目的解决方案总是比基于用户的解决方案具有更好的扩展性。 PHP 中基于项目的协作过滤 有一些易于理解的示例代码并且适合您正在寻找的内容:

You want an item-based collaborative filtering rather than user-based. There are a number of algorithms for this floating around on Google. Item-based solutions always scale better than user-based solutions. Item based collaborative filtering in PHP has some easy-to-follow example code and fits what you're looking for:

深爱成瘾 2024-10-16 14:48:39

您必须根据产品的具体情况和您的判断力来决定相似性指标。视频长度重要吗?如果是这样,它就值得高权重。

You have to decide what the similarity metric is based on the specifics of your product and your good sense. Is length of video important? If so it deserves high weight.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文