使用 Mahout 进行持续协同过滤
我正在评估 Mahout 作为协作过滤推荐引擎的过程。到目前为止看起来很棒。 我们有来自 1200 万个不同用户的近 2000 万条布尔推荐。 根据 Mahout 的 wiki 和 Sean Owen,在这种情况下一台机器就足够了。因此,我决定使用 MySql 作为数据模型,并暂时跳过使用 Hadoop 的开销。
但有一件事让我困惑,不断更新建议而不从头开始读取整个数据的最佳实践是什么?我们每天都会有数以万计的新推荐。虽然我不希望它被实时处理,但我希望每 15 分钟左右处理一次。
请详细说明基于Mysql和基于Hadoop的部署方法。 谢谢!
I am in the process of evaluating Mahout as a collaborative-filtering-recommendation engine. So far it looks great.
We have almost 20M boolean recommendations from 12M different users.
According to Mahout's wiki and a few threads by Sean Owen, one machine should sufficient in this case. Because of that I decided to go with MySql as the data-model and skip the overhead of using Hadoop for now.
One thing eludes me though, what are the best practices for continuously updating the recommendations without reading the whole data from scratch? We have tens-of-thousands of new recommendations every day. While I do not expect it to be processed at real-time, I would like to have it processed every 15 minutes or so.
Please elaborate on the approaches for both a Mysql-based and Hadoop-based deployment.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
任何数据库都太慢而无法实时查询,因此任何方法都涉及将数据集缓存在内存中,我假设您已经使用
ReloadFromJDBCDataModel
执行此操作。只需使用refresh()
即可按照您喜欢的时间间隔重新加载。它应该在后台执行此操作。问题是,在从旧模型提供服务的同时,需要大量内存来加载新模型。您可以推出自己的解决方案,例如一次重新加载用户。Hadoop 上不存在实时更新这样的东西。一般来说,最好的选择是使用 Hadoop 对结果进行完整且正确的批量计算,然后根据保存和提供建议的应用程序中的新数据在运行时(不完美地)调整它们。
Any database is too slow to query in real-time, so any approach involves caching the data set in memory, which is what I assume you're already doing with
ReloadFromJDBCDataModel
. Just userefresh()
to have it re-load at whatever interval you like. It should do so in the background. The catch is that it will need a lot of memory to load the new model while serving from the old one. You could roll your own solutions that, say, reload a user at a time.There's no such thing as real-time updates on Hadoop. Your best bet there in general is to use Hadoop for full and proper batch computation of results, and then tweak them at run-time (imperfectly) based on new data in the app that is holding and serving recommendations.