基于内容的项目推荐方法

发布于 2024-10-09 14:48:39 字数 1109 浏览 4 评论 0原文

我目前正在开发一个应用程序，我想将类似的项目分组。项目（如视频）可以由用户创建，并且它们的属性可以在以后更改或扩展（如新标签）。我不想像大多数协作过滤机制那样依赖用户的偏好，而是想根据项目的属性（例如相似的长度、相似的颜色、相似的标签集等）来比较项目相似性。计算对于两个主要目的是必要的：为给定项目建议 x 个相似项目以及聚类为相似项目组。

到目前为止，我的应用程序遵循异步设计，我想尽可能地解耦这个集群组件。新项目的创建或现有项目的新属性的添加将通过发布组件随后可以使用的事件来通告。

可以尽力提供计算和“快照”，这意味着我可以在给定时间点获得最佳结果，尽管结果质量最终会提高。

所以我现在正在寻找合适的算法来计算相似的项目和集群。一个重要的限制是可扩展性。最初，应用程序必须处理几千个项目，但后来也可能处理数百万个项目。当然，计算将在其他节点上执行，但算法本身应该可扩展。如果算法支持对数据的部分更改进行某种增量模式，那就太好了。

我最初的想法是将每个项目相互比较并存储数字相似度，这听起来有点粗糙。此外，它需要 n*(n-1)/2 条目来存储所有相似性，任何更改或新项目最终都会导致 n 相似性计算。

提前致谢！

更新tl;dr

为了澄清我想要什么，这是我的目标场景：

用户生成条目（考虑文档）
用户编辑条目元数据（考虑标签）

这是我的系统应该提供的内容：

作为推荐的给定项目的相似条目列表
相似条目的集群

两种计算都应基于：

条目的元数据/属性（即相似标签的使用）
因此，使用适当度量的两个条目的距离
不基于关于用户投票、偏好或操作（与协作过滤不同）。尽管用户可以创建条目并更改属性，但计算应仅考虑项目及其属性，而不考虑与之关联的用户（就像仅存在项目而不存在用户的系统一样）。

理想情况下，该算法应支持：

条目属性的永久更改增量
计算更改
规模
上的相似条目/簇如果可能的话，比简单的距离表更好（因为 O(n²) 空间复杂度）

原文

I'm currently developing an application where I want to group similar items. Items (like videos) can be created by users and also their attributes can be altered or extended later (like new tags). Instead of relying on users' preferences as most collaborative filtering mechanisms do, I want to compare item similarity based on the items' attributes (like similar length, similar colors, similar set of tags, etc.). The computation is necessary for two main purposes: Suggesting x similar items for a given item and for clustering into groups of similar items.

My application so far is follows an asynchronous design and I want to decouple this clustering component as far as possible. The creation of new items or the addition of new attributes for an existing item will be advertised by publishing events the component can then consume.

Computations can be provided best-effort and "snapshotted", which means that I'm okay with the best result possible at a given point in time, although result quality will eventually increase.

So I am now searching for appropriate algorithms to compute both similar items and clusters. At important constraint is scalability. Initially the application has to handle a few thousand items, but later million items might be possible as well. Of course, computations will then be executed on additional nodes, but the algorithm itself should scale. It would also be nice if the algorithm supports some kind of incremental mode on partial changes of the data.

My initial thought of comparing each item with each other and storing the numerical similarity sounds a little bit crude. Also, it requires n*(n-1)/2 entries for storing all similarities and any change or new item will eventually cause n similarity computations.

Thanks in advance!

UPDATE tl;dr

To clarify what I want, here is my targeted scenario:

User generate entries (think of documents)
User edit entry meta data (think of tags)

And here is what my system should provide:

List of similar entries to a given item as recommendation
Clusters of similar entries

Both calculations should be based on:

The meta data/attributes of entries (i.e. usage of similar tags)
Thus, the distance of two entries using appropriate metrics
NOT based on user votings, preferences or actions (unlike collaborative filtering). Although users may create entries and change attributes, the computation should only take into account the items and their attributes, and not the users associated with (just like a system where only items and no users exist).

Ideally, the algorithm should support:

permanent changes of attributes of an entry
incrementally compute similar entries/clusters on changes
scale
something better than a simple distance table, if possible (because of the O(n²) space complexity)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

月依秋水 2024-10-16 14:48:39

不要从头开始编写，请查看 mahout.apache.org。它具有您正在寻找的聚类算法以及推荐算法。它与 Hadoop 一起工作，因此您可以轻松扩展。

这将允许您根据您的关键字和/或视频描述确定集群中的类似文档。

https://cwiki.apache.org/MAHOUT/k-means-clustering.html

有一个关于使用 Reuters 数据集对文档进行聚类的快速教程。它与您想要实现的目标非常相似。 Mahout 包括推荐算法，例如斜率一、基于用户、基于项目的算法，并且非常容易扩展。它还具有一些非常有用的聚类算法，支持降维功能。如果您的矩阵稀疏（即很多标签的使用统计数据很少），这对您很有用。

另请查看 Lucene，以使用其 tfidf 功能对标签和文档进行聚类。另请检查 Solr。两者都是 Apache 项目。

回复收藏 0 原文

成熟稳重的好男人 2024-10-16 14:48:39

推荐算法会非常有帮助，因为它列出了处理问题的标准算法你的目的。

更新：

我想您正在查看的是协作质量过滤和不仅是协同过滤，我还附上了论文链接，希望这有帮助。

回复收藏 0 原文

一花一树开 2024-10-16 14:48:39

K-means 聚类可能就是您想要的。

注意：

簇的数量 k 是一个输入参数：k 选择不当可能会产生糟糕的结果......它在某些数据集上效果很好，但在其他数据集上却表现不佳。

因此，您应该考虑有多少个集群、多少个标签以及什么指标。

另请参阅 StackOverflow questions/tagged/k-means。

回复收藏 0 原文

简单爱 2024-10-16 14:48:39

http://taste.sourceforge.net/old.html

口味灵活、快捷
协同过滤引擎
爪哇。该引擎将用户的
对物品的偏好（“口味”）和
返回估计偏好
其他物品。例如，一个网站
出售可以轻松使用的书籍或 CD
从过去的味道中找出答案
购买数据，CD 给客户
可能有兴趣听一下。
味道提供了丰富的
您可以从中使用组件
构建定制推荐器
系统从一系列算法中选择。
味道被设计为
企业就绪；它是专为
性能、可扩展性和
灵活性。它支持一个标准
基于 J2EE 的 EJB 接口
应用程序，但品味不仅仅是
对于Java；它可以作为外部运行
公开推荐的服务器
通过网络将逻辑添加到您的应用程序
服务和 HTTP。

http://savannah.nongnu.org/projects/cofi/

目前，想要使用的程序员
协同过滤必读
文献并实施自己的
算法。更多时候，
程序员可能自己设计
算法，他们通常会
产生次优算法。我们想要
建立已经的基础
测试了算法并记录了
可用于广泛的
从研究到的背景
应用程序。指导原则
就是设计要薄。
Cofi 不想成为一切
为所有人。所以重点是
交付很少的代码行，
并依赖程序员
提供必要的胶水。