电影数据库 - 演员/女演员/标签的存储?
创建一个电影数据库,我不喜欢为每个演员和每个标签提供自己的行,就好像总共有 1000 万部电影,每个电影至少有 20-30 人的演员阵容,我们将有 200-3 亿行在表中。
而且标签变得更加复杂,每部电影可以不受限制。那么这三样东西如何存放才最好呢?理想情况下,这些可以建模为多对多,但仍然会有数亿行。关于存储这些有更好的建议吗?我正在使用MySQL。
我会将其全部转储到一个文本文件中,但我需要链接电影之间的演员并进行一些分析,并允许用户对演员按标签查找电影等进行评分,因此需要使用数据库。
Creating a movie db and I dont like the idea of giving each actor/actress and also each tag its own row as if there are 10 million moives total, each has a cast of atleast 20-30 people we will have 200-300 million rows in the table.
And it gets more complex with tags which can be unlimited per movie. So how to best store these 3 items? Ideally these can be modeled as a Many to Many but still it will have hundreds of millions of rows. Any better suggestions on storing these? I am using MySQL.
I would dump it all in a textfile but I need to link actors between movies and do some analytics also and allow users to rate actors find movies by tag, etc so need to use a DB.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
1000 万部电影看起来相当雄心勃勃。 IMDb 目前的统计数据显示,他们拥有的图书数量不到 180 万,用户数量约为 390 万。
话虽如此,我认为创建标题表、演员表和连接表来解决两者之间的多对多关系没有问题。对于标签来说也是如此。
10 million movies seems pretty ambitious. IMDb's current statistics show that they have less than 1.8M titles and around 3.9M people.
Having said that, I see no problem with creating a table of titles, a table of actors, and a junction table to resolve the many-to-many relationship between the two. The same holds true for tags.
这里听起来可能有点不成熟的优化。您可以将所有演员非规范化为电影表中某种
TEXT
列,但您的性能和搜索将会受到影响,并且会失去关系数据的所有优势。建议保留规范化架构,正如您最初的想法:
MovieActor
和MovieTag
。无论电影数量有多少,也无论数据是否是 DNA 序列:实施设计、测试它、根据您的要求(用户接受度、SLA 等)判断其性能
It sounds perhaps a bit of premature optimization here. You could denormalize all actors into a
TEXT
column of some kind onto the Movie table, but your performance + search would suffer, as well as losing all benefits of relational data.Suggest to keep the normalized schema, as you were originally thinking:
MovieActor
andMovieTag
.Regardless of the number of movies, or whether the data is DNA sequences: implement a design, test it, judge its performance based on your requirements (user acceptance, SLA, etc)
您厌恶数百万行的原因是什么?感知到的性能问题?
它将在某个地方拥有数亿个关系。你确实必须捕捉演员和电影之间的映射,正如你所说,其中有 200-3 亿部(尽管我不相信存在 1000 万部电影?)
如果你真的想要,你可以(例如)将一部电影的演员 ID 打包到多列(或一列)中,但这会使搜索变得不愉快。
What's the reason for your aversion to the millions of rows? A perceived performance issue?
It's going to have hundreds of millions of relations somewhere. You do have to capture the mapping between actor and film and as you say, there are 200-300 million of those (although I don't believe there are 10 million movies in existence?)
If you really wanted, you could (for example) pack the ids for actors for a film into multiple columns (or into one column) but that would make searching unpleasant.
1000 万部电影,每部有 20 到 30 名演员(尽管这个数字听起来比现实生活中要多)必然会产生 200-3 亿个联想。如果您将数据存储在关系数据库中,则每个关联自然会成为将电影与演员链接起来的表中的一行。每行都非常小(两列 - 电影 PK 和演员 PK;可能还有一个额外的代理键列);大部分数据将存储在电影和演员表中。
任何其他解决方案(在 SQL 数据库中)都将以不太理想的格式存储相同数量的数据。
10 million movies w/ 20 to 30 cast members each (although the number sound higher than real life) will invariably lead to 200-300 million associations. If you're storing your data in a relational database, each association will naturally be one row in a table linking movies to actors. Each row will be very small (two columns - movie PK and actor PK; possibly an extra surrogate key column); the bulk of the data will be stored in the movies and actors table.
Any other solution (in an SQL database) will store the same amount of data in a less optimal format.