基于内容的项目推荐方法
我目前正在开发一个应用程序,我想将类似的项目分组。项目(如视频)可以由用户创建,并且它们的属性可以在以后更改或扩展(如新标签)。我不想像大多数协作过滤机制那样依赖用户的偏好,而是想根据项目的属性(例如相似的长度、相似的颜色、相似的标签集等)来比较项目相似性。计算对于两个主要目的是必要的:为给定项目建议 x
个相似项目以及聚类为相似项目组。
到目前为止,我的应用程序遵循异步设计,我想尽可能地解耦这个集群组件。新项目的创建或现有项目的新属性的添加将通过发布组件随后可以使用的事件来通告。
可以尽力提供计算和“快照”,这意味着我可以在给定时间点获得最佳结果,尽管结果质量最终会提高。
所以我现在正在寻找合适的算法来计算相似的项目和集群。一个重要的限制是可扩展性。最初,应用程序必须处理几千个项目,但后来也可能处理数百万个项目。当然,计算将在其他节点上执行,但算法本身应该可扩展。如果算法支持对数据的部分更改进行某种增量模式,那就太好了。
我最初的想法是将每个项目相互比较并存储数字相似度,这听起来有点粗糙。此外,它需要 n*(n-1)/2
条目来存储所有相似性,任何更改或新项目最终都会导致 n
相似性计算。
提前致谢!
更新tl;dr
为了澄清我想要什么,这是我的目标场景:
- 用户生成条目(考虑文档)
- 用户编辑条目元数据(考虑标签)
这是我的系统应该提供的内容:
- 作为推荐的给定项目的相似条目列表
- 相似条目的集群
两种计算都应基于:
- 条目的元数据/属性(即相似标签的使用)
- 因此,使用适当度量的两个条目的距离
- 不基于关于用户投票、偏好或操作(与协作过滤不同)。尽管用户可以创建条目并更改属性,但计算应仅考虑项目及其属性,而不考虑与之关联的用户(就像仅存在项目而不存在用户的系统一样)。
理想情况下,该算法应支持:
- 条目属性的永久更改 增量
- 计算更改
- 规模
- 上的相似条目/簇如果可能的话,比简单的距离表更好(因为 O(n²) 空间复杂度)
I'm currently developing an application where I want to group similar items. Items (like videos) can be created by users and also their attributes can be altered or extended later (like new tags). Instead of relying on users' preferences as most collaborative filtering mechanisms do, I want to compare item similarity based on the items' attributes (like similar length, similar colors, similar set of tags, etc.). The computation is necessary for two main purposes: Suggesting x
similar items for a given item and for clustering into groups of similar items.
My application so far is follows an asynchronous design and I want to decouple this clustering component as far as possible. The creation of new items or the addition of new attributes for an existing item will be advertised by publishing events the component can then consume.
Computations can be provided best-effort and "snapshotted", which means that I'm okay with the best result possible at a given point in time, although result quality will eventually increase.
So I am now searching for appropriate algorithms to compute both similar items and clusters. At important constraint is scalability. Initially the application has to handle a few thousand items, but later million items might be possible as well. Of course, computations will then be executed on additional nodes, but the algorithm itself should scale. It would also be nice if the algorithm supports some kind of incremental mode on partial changes of the data.
My initial thought of comparing each item with each other and storing the numerical similarity sounds a little bit crude. Also, it requires n*(n-1)/2
entries for storing all similarities and any change or new item will eventually cause n
similarity computations.
Thanks in advance!
UPDATE tl;dr
To clarify what I want, here is my targeted scenario:
- User generate entries (think of documents)
- User edit entry meta data (think of tags)
And here is what my system should provide:
- List of similar entries to a given item as recommendation
- Clusters of similar entries
Both calculations should be based on:
- The meta data/attributes of entries (i.e. usage of similar tags)
- Thus, the distance of two entries using appropriate metrics
- NOT based on user votings, preferences or actions (unlike collaborative filtering). Although users may create entries and change attributes, the computation should only take into account the items and their attributes, and not the users associated with (just like a system where only items and no users exist).
Ideally, the algorithm should support:
- permanent changes of attributes of an entry
- incrementally compute similar entries/clusters on changes
- scale
- something better than a simple distance table, if possible (because of the O(n²) space complexity)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
不要从头开始编写,请查看 mahout.apache.org。它具有您正在寻找的聚类算法以及推荐算法。它与 Hadoop 一起工作,因此您可以轻松扩展。
这将允许您根据您的关键字和/或视频描述确定集群中的类似文档。
https://cwiki.apache.org/MAHOUT/k-means-clustering.html
有一个关于使用 Reuters 数据集对文档进行聚类的快速教程。它与您想要实现的目标非常相似。 Mahout 包括推荐算法,例如斜率一、基于用户、基于项目的算法,并且非常容易扩展。它还具有一些非常有用的聚类算法,支持降维功能。如果您的矩阵稀疏(即很多标签的使用统计数据很少),这对您很有用。
另请查看 Lucene,以使用其 tfidf 功能对标签和文档进行聚类。另请检查 Solr。两者都是 Apache 项目。
Instead of writing from scratch take a look at mahout.apache.org. It has the clustering algorithms you are looking for as well as the recommendation algorithms. It works alongside Hadoop, so you can scale it out easily.
What this will allow you to do is determine similar documents in a cluster based on your keywords and/or description of the video.
https://cwiki.apache.org/MAHOUT/k-means-clustering.html
has a quick tutorial about clustering of documents using a Reuters dataset. It is quite similar to what you are trying to achieve. Mahout includes recommendation algorithms such as slope one, user based, item based and is incredibly easy to extend. It also has some pretty useful clustering algorithms which support dimension reduction features. This is useful for you in case your matrix is sparse (that is, a lot of tags that have very few usage stats).
Also take a look at Lucene to use its tfidf features to cluster tags and documents. Also check Solr. Both are Apache projects.
推荐算法会非常有帮助,因为它列出了处理问题的标准算法你的目的。
更新:
我想您正在查看的是 协作质量过滤 和不仅是协同过滤,我还附上了论文链接,希望这有帮助。
Recommendation Algorithm would be very helpful as it lists standard algorithm for dealing with your purpose.
Updated:
I guess what you are looking at is Collaborative Quality Filtering and not only Collaborative Filtering, I have attached link to paper, hope this helps.
K-means 聚类 可能就是您想要的。
注意:
因此,您应该考虑有多少个集群、多少个标签以及什么指标。
另请参阅 StackOverflow questions/tagged/k-means。
K-means clustering may be what you want.
N.B.:
So you should consider how many clusters, how many tags, and what metric.
See also Stack Overflow questions/tagged/k-means.
http://taste.sourceforge.net/old.html
http://savannah.nongnu.org/projects/cofi/
更多此处
http://taste.sourceforge.net/old.html
http://savannah.nongnu.org/projects/cofi/
Few more here
在开始实施、改编或使用现有库之前,请确保您了解该领域;阅读诸如“行动中的集体智慧”之类的内容是一个好的开始。
Before starting to implement, adapt or use existing library, make sure you know the domain; reading something like "Collective Intelligence in Action" is a good start.
您想要基于项目的协作过滤而不是基于用户的。谷歌上有很多这样的算法。基于项目的解决方案总是比基于用户的解决方案具有更好的扩展性。 PHP 中基于项目的协作过滤 有一些易于理解的示例代码并且适合您正在寻找的内容:
You want an item-based collaborative filtering rather than user-based. There are a number of algorithms for this floating around on Google. Item-based solutions always scale better than user-based solutions. Item based collaborative filtering in PHP has some easy-to-follow example code and fits what you're looking for:
您必须根据产品的具体情况和您的判断力来决定相似性指标。视频长度重要吗?如果是这样,它就值得高权重。
You have to decide what the similarity metric is based on the specifics of your product and your good sense. Is length of video important? If so it deserves high weight.