电影数据库 - 演员/女演员/标签的存储?

发布于 2024-10-08 17:08:14 字数 269 浏览 0 评论 0原文

创建一个电影数据库,我不喜欢为每个演员和每个标签提供自己的行,就好像总共有 1000 万部电影,每个电影至少有 20-30 人的演员阵容,我们将有 200-3 亿行在表中。

而且标签变得更加复杂,每部电影可以不受限制。那么这三样东西如何存放才最好呢?理想情况下,这些可以建模为多对多,但仍然会有数亿行。关于存储这些有更好的建议吗?我正在使用MySQL。

我会将其全部转储到一个文本文件中,但我需要链接电影之间的演员并进行一些分析,并允许用户对演员按标签查找电影等进行评分,因此需要使用数据库。

Creating a movie db and I dont like the idea of giving each actor/actress and also each tag its own row as if there are 10 million moives total, each has a cast of atleast 20-30 people we will have 200-300 million rows in the table.

And it gets more complex with tags which can be unlimited per movie. So how to best store these 3 items? Ideally these can be modeled as a Many to Many but still it will have hundreds of millions of rows. Any better suggestions on storing these? I am using MySQL.

I would dump it all in a textfile but I need to link actors between movies and do some analytics also and allow users to rate actors find movies by tag, etc so need to use a DB.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

一个人的旅程 2024-10-15 17:08:14

1000 万部电影看起来相当雄心勃勃。 IMDb 目前的统计数据显示,他们拥有的图书数量不到 180 万,用户数量约为 390 万。

话虽如此,我认为创建标题表、演员表和连接表来解决两者之间的多对多关系没有问题。对于标签来说也是如此。

替代文字

10 million movies seems pretty ambitious. IMDb's current statistics show that they have less than 1.8M titles and around 3.9M people.

Having said that, I see no problem with creating a table of titles, a table of actors, and a junction table to resolve the many-to-many relationship between the two. The same holds true for tags.

alt text

羁拥 2024-10-15 17:08:14

这里听起来可能有点不成熟的优化。您可以将所有演员非规范化为电影表中某种TEXT列,但您的性能和搜索将会受到影响,并且会失去关系数据的所有优势。

建议保留规范化架构,正如您最初的想法:

Movie (ID)
Actor (ID)
Tag (ID) --horror, comedy, etc.

MovieActor (MovieID, ActorID)
MovieTag (MovieID, TagID)
  • 按照正常方式在关联实体上创建索引:MovieActorMovieTag
  • 在测试环境中加载一些虚拟数据。 1000 万部电影,1 亿演员,100 万个标签。根据需要为每个创建关联条目。
  • 基线和性能测试。
  • 水平分区(分片)(如果您的性能指标需要更高的性能)。

无论电影数量有多少,也无论数据是否是 DNA 序列:实施设计、测试它、根据您的要求(用户接受度、SLA 等)判断其性能

It sounds perhaps a bit of premature optimization here. You could denormalize all actors into a TEXT column of some kind onto the Movie table, but your performance + search would suffer, as well as losing all benefits of relational data.

Suggest to keep the normalized schema, as you were originally thinking:

Movie (ID)
Actor (ID)
Tag (ID) --horror, comedy, etc.

MovieActor (MovieID, ActorID)
MovieTag (MovieID, TagID)
  • Create indexes as per normal on the associative entities: MovieActor and MovieTag.
  • Load some dummy data in a Test environment. 10 million movies with 100 million actors with 1 million tags. Create associative entries for each as required.
  • Baseline and performance test.
  • Horizontal partitioning (sharding) if your performance metrics require more performance.

Regardless of the number of movies, or whether the data is DNA sequences: implement a design, test it, judge its performance based on your requirements (user acceptance, SLA, etc)

○闲身 2024-10-15 17:08:14

您厌恶数百万行的原因是什么?感知到的性能问题?

它将在某个地方拥有数亿个关系。你确实必须捕捉演员和电影之间的映射,正如你所说,其中有 200-3 亿部(尽管我不相信存在 1000 万部电影?)

如果你真的想要,你可以(例如)将一部电影的演员 ID 打包到多列(或一列)中,但这会使搜索变得不愉快。

What's the reason for your aversion to the millions of rows? A perceived performance issue?

It's going to have hundreds of millions of relations somewhere. You do have to capture the mapping between actor and film and as you say, there are 200-300 million of those (although I don't believe there are 10 million movies in existence?)

If you really wanted, you could (for example) pack the ids for actors for a film into multiple columns (or into one column) but that would make searching unpleasant.

浮光之海 2024-10-15 17:08:14

1000 万部电影,每部有 20 到 30 名演员(尽管这个数字听起来比现实生活中要多)必然会产生 200-3 亿个联想。如果您将数据存储在关系数据库中,则每个关联自然会成为将电影与演员链接起来的表中的一行。每行都非常小(两列 - 电影 PK 和演员 PK;可能还有一个额外的代理键列);大部分数据将存储在电影和演员表中。

任何其他解决方案(在 SQL 数据库中)都将以不太理想的格式存储相同数量的数据。

10 million movies w/ 20 to 30 cast members each (although the number sound higher than real life) will invariably lead to 200-300 million associations. If you're storing your data in a relational database, each association will naturally be one row in a table linking movies to actors. Each row will be very small (two columns - movie PK and actor PK; possibly an extra surrogate key column); the bulk of the data will be stored in the movies and actors table.

Any other solution (in an SQL database) will store the same amount of data in a less optimal format.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文