电影推荐引擎概念数据库设计
我正在研究电影推荐引擎,并且面临数据库设计问题。 我的实际数据库如下所示:
MOVIES [ID,TITLE]
KEYWORDS_TABLE [ID,KEY_ID]
- 其中 ID 是 MOVIES.id 的外键,KEY_ID 是文本关键字表的键
这不是整个数据库,但我在这里展示了对我的问题重要的内容。 我有大约 50,000 部电影和大约 130 万个关键字相关性,基本上我的算法包括提取与给定电影具有相同关键字的所有内容,然后按关键字相关性数量对它们进行排序。
例如,我查找了一部类似于《荒岛余生》的电影,它返回了“六天六夜”,因为它具有最多的关键词相关性(4个关键词):
Island
Airplane crash
Stranded
Pilot
该算法基于更多因素,但这一个是最重要的也是最困难的方法。
基本上我现在所做的就是获取至少有一个与给定电影相似的关键字的所有电影,然后按暂时不重要的其他因素对它们进行排序。
如果没有那么多记录的话就不会有任何问题,很多情况下一次查询长达10-20秒,有的甚至会返回超过5000部电影。 有人已经在这里帮助我(感谢马克·拜尔斯)优化查询,但这还不够,因为它花费的时间太长
SELECT DISTINCT M.title
FROM keywords_table K1
JOIN keywords_table K2
ON K2.key_id = K1.key_id
JOIN movies M
ON K2.id = M.id
WHERE K1.id = 4
所以我认为如果我预先制作这些列表并为每部电影推荐电影会更好,但我没有确定如何设计表格..这是什么好主意,或者您将如何采取这种方法?
I am working at an movie recommendations engine and i'm facing a DB design issue.
My actual database looks like this:
MOVIES [ID,TITLE]
KEYWORDS_TABLE [ID,KEY_ID]
- where ID is Foreign Key for MOVIES.id and KEY_ID is a key for a text keywords table
This is not the entire DB, but i showed here what's important for my problem.
I have about 50,000 movies and about 1,3 milion keywords correlations, and basically my algorithm consists in extracting all the who have the same keywords with a given movie, then ordering them by the number of keywords correlations.
For example i looked for a movie similar to 'Cast away' and it returned 'Six days and six nights' because it had the most keywords correlations (4 keywords):
Island
Airplane crash
Stranded
Pilot
The algorithm is based on more factors, but this one is the most important and the most difficult for the approach.
Basically what i do now is getting all the movies that have at least one keyword similar to the given movie and then ordering them by other factors which are not important for a moment.
There wouldn't be any problem if there weren't so many records, a query lasts in many cases up to 10-20 seconds and some of them return even over 5000 movies.
Someone already helped me on here (thanks Mark Byers) with optimizing the query but that's not enough because it takes too longer
SELECT DISTINCT M.title
FROM keywords_table K1
JOIN keywords_table K2
ON K2.key_id = K1.key_id
JOIN movies M
ON K2.id = M.id
WHERE K1.id = 4
So i thought it would be better if i pre-made those lists with movies recommendations for each movie, but i'm not sure how to design the tables.. whatever is it a good idea or how would you take this approach?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
@Supyxy 更好的电影推荐方法是使用基于分析电影数据库后生成的抽象特征的神经网络方法(生成这些特征的因素可以是关键字|人们的评分|导演的名字|演员等)。
Filmaster 项目提供了一个示例开源电影推荐引擎。您可以在 wiki 上阅读相关内容: http://filmaster.org/display/DEV /New+recommendation+engine 并从 bitbucket 存储库获取代码: http://bitbucket.org/filmaster/filmaster-test/src/tip/count_recommendations.cpp
但是回答您当前解决方案中遇到的问题,如果进一步的查询优化失败,我会添加更多数据冗余是为每部电影存储对类似电影的引用(计算可以延迟完成或每天在 cron 作业中完成,因为这些值不应经常更改)。
@Supyxy a better approach for movie recommendations would be to use a neural network approach based on abstract features generated after analyzing the movie database (factors for generating those features could be keywords | people's ratings | director's name | actors, etc).
A sample open source movie recommendation engine is provided by the Filmaster project. You can read about it on the wiki: http://filmaster.org/display/DEV/New+recommendation+engine and get the code from the bitbucket repository: http://bitbucket.org/filmaster/filmaster-test/src/tip/count_recommendations.cpp
But answering the problem you have in your current solution, if further query optimization fails, I would add some more redundancy into data be storing references to similar films for each film (computing could be done lazily or in a cron job daily as those values should not change often).