数据库模糊查询
我很好奇今天的作品在许多社交网站上的表现如何。
例如,您输入您喜欢的电影列表,系统会建议您可能喜欢的其他电影(基于喜欢与您喜欢相同电影的其他人的电影)。我认为在大型数据集上直接使用 SQL 方式(将我的电影列表与电影用户连接到用户电影组并按电影标题对其应用计数)是不可能实现的,因为此类查询的“繁重” 。
同时我们不需要精确的解,近似的就足够了。我想知道是否有方法可以对传统 RDBMS 实现模糊查询之类的功能,该功能执行速度很快,但有一些不妥之处。或者这些功能如何在真实系统上实现。
I'm curious about how works feature on many social sites today.
For example, you enter list of movies you like and system suggests other movies you may like (based on movies that like other people who likes the same movies that you). I think doing it straight-sql way (join list of my movies with movies-users join with user-movies group by movie title and apply count to it ) on large datasets would be just impossible to implement due to "heaviness" of such query.
At the same time we don't need exact solution, approximate would be enough. I wonder is there way to implement something like fuzzy query to traditional RDBMS that would be fast to execute but has some infelicity. Or how such features implemented on real systems.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这就是协同过滤或推荐,
除非您需要非常复杂的东西,斜率预测器是最简单的预测器之一,就像 50 行 python,Bryan O'Sullivan 的协作过滤变得简单,Daniel Lemire 等人的论文。引入“Slope One Predictors for Online Rating-Based Collaborative Filtering”,
该方法可以在用户发生变化时一次仅更新一个用户,而在某些情况下无需重新处理整个数据库来更新其他用户
。使用 python 代码来预测文档中未出现的单词的字数,但我遇到了内存问题等,我想我可能会编写一个内存不足的版本,也许使用 sqlite
也可以使用该矩阵,该矩阵的边沿是三角形的对角线是镜像的,因此只需要存储矩阵的一半
that's collaborative filtering, or recommendation
unless you need something really complex the slope one predictor is one of the more simple ones it's like 50 lines of python, Bryan O’Sullivan’s Collaborative filtering made easy, the paper by Daniel Lemire et al. introducing "Slope One Predictors for Online Rating-Based Collaborative Filtering"
this one has a way of updating just one user at a time when they change without in some cases for others that need to reprocess the whole database just to update
i used that python code to do predict the word count of words not occurring in documents but i ran into memory issues and such and i think i might write an out of memory version maybe using sqlite
also the matrix used in that one is triangular the sides along the diagonal are mirrored so only one half of the matrix needs to be stored
您正在寻找的术语是“协作过滤”
阅读编程集体智能,作者:O'Reilly Press
The term you are looking for is "collaborative filtering"
Read Programming Collective Intelligence, by O'Reilly Press
最简单的方法使用贝叶斯网络。有些库可以为您处理大部分数学问题。
The simplest methods use Bayesian networks. There are libraries that can take care of most of the math for you.