哪个更高效:多个 MySQL 表还是一个大表?

发布于 2024-07-26 10:43:08 字数 671 浏览 3 评论 0原文

我将各种用户详细信息存储在 MySQL 数据库中。 最初它是在各种表中设置的,这意味着数据与 UserId 链接,并通过有时复杂的调用进行输出,以根据需要显示和操作数据。 设置一个新系统时,将所有这些表组合成一个相关内容的大表几乎是有意义的。

  • 这会是帮助还是阻碍?
  • 调用、更新或搜索/操作时的速度考虑因素?

以下是我的一些表结构的示例:

  • users - UserId、用户名、电子邮件、加密密码、注册日期、ip
  • user_details - cookie 数据、姓名、地址、联系方式、从属关系、人口统计数据
  • user_activity - 贡献、最后在线时间,上次查看
  • user_settings - 个人资料显示设置
  • user_interests - 广告目标变量
  • user_levels - 访问权限
  • user_stats - 点击数、统计

编辑: 到目前为止,我已经对所有答案都投了赞成票,它们都有基本上回答我的问题的元素。

大多数表都具有 1:1 关系,这是对它们进行非规范化的主要原因。

如果表格跨越 100 多列,而其中很大一部分单元格可能仍为空,是否会出现问题?

I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.

  • Is this going to be a help or hindrance?
  • Speed considerations in calling, updating or searching/manipulating?

Here's an example of some of my table structure(s):

  • users - UserId, username, email, encrypted password, registration date, ip
  • user_details - cookie data, name, address, contact details, affiliation, demographic data
  • user_activity - contributions, last online, last viewing
  • user_settings - profile display settings
  • user_interests - advertising targetable variables
  • user_levels - access rights
  • user_stats - hits, tallies

Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.

Most of the tables have a 1:1 relationship which was the main reason for denormalising them.

Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

千秋岁 2024-08-02 10:43:09

多个表在以下方面/情况下有帮助:

(a) 如果不同的人要开发涉及不同表的应用程序,则拆分它们是有意义的。

(b) 如果你想给不同的人不同的数据收集部分不同的权限,那么将他们分开可能会更方便。 (当然,您可以查看定义视图并对其进行适当的授权)。

(c) 为了将数据移动到不同的地方,特别是在开发过程中,使用表来减小文件大小可能是有意义的。

(d) 当您开发针对单个实体的特定数据收集的应用程序时,较小的占用空间可能会让您感到舒适。

(e) 这是一种可能性:你所认为的单值数据将来可能会变成真正的多值。 例如,信用额度目前是一个单一值字段。 但明天,您可能决定将值更改为(日期自、日期至、信用值)。 拆分表现在可能会派上用场。

我的投票是支持多个表 - 数据适当分割。

祝你好运。

Multiple tables help in the following ways / cases:

(a) if different people are going to be developing applications involving different tables, it makes sense to split them.

(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).

(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.

(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.

(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.

My vote would be for multiple tables - with data appropriately split.

Good luck.

猫腻 2024-08-02 10:43:09

组合表称为非规范化。

它可能(或可能不会)有助于使某些查询(产生大量 JOIN)运行得更快,但代价是造成维护地狱。

MySQL只能使用JOIN方法,即NESTED LOOPS

这意味着对于驱动表中的每条记录,MySQL 会循环查找驱动表中的匹配记录。

定位记录是一项成本相当高的操作,可能比纯粹的记录扫描花费数十倍的时间。

将所有记录移到一个表中将帮助您摆脱此操作,但表本身会变得更大,并且表扫描需要更长的时间。

如果其他表中有大量记录,那么表扫描的增加可能会超过顺序扫描记录的好处。

另一方面,维护地狱是有保证的。

Combining the tables is called denormalizing.

It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.

MySQL is capable of using only JOIN method, namely NESTED LOOPS.

This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.

Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.

Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.

If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.

Maintenance hell, on the other hand, is guaranteed.

楠木可依 2024-08-02 10:43:09

都是1:1的关系吗? 我的意思是,如果一个用户可能属于不同的用户级别,或者如果用户兴趣表示为用户兴趣表中的多条记录,那么立即合并这些表将是不可能的。

对于之前关于规范化的回答,必须要说的是,数据库规范化规则完全不顾性能,只看什么是整洁的数据库设计。 这通常是您想要实现的目标,但有时为了追求性能而主动进行非规范化是有意义的。

总而言之,我认为问题归结为表中有多少字段以及访问它们的频率。 如果用户活动通常不是很有趣,那么出于性能和维护原因,始终将其放在同一条记录中可能会很麻烦。 如果某些数据(例如设置)被经常访问,但只是包含太多字段,则合并表也可能不方便。 如果您只对性能提升感兴趣,则可以考虑其他方法,例如将设置分开,但将它们保存在自己的会话变量中,这样您就不必经常查询数据库。

Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.

Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.

All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.

み零 2024-08-02 10:43:09

所有这些表都具有一对一关系吗? 例如,每个用户行在 user_statsuser_levels 中是否只有一个对应行? 如果是这样,将它们合并到一张表中可能是有意义的。 如果关系不是 1 到 1,则组合(非规范化)它们可能没有意义。

将它们放在单独的表中而不是放在一个表中可能对性能影响不大,除非您有数十万或数百万条用户记录。 您将获得的唯一真正收益是通过组合它们来简化查询。

ETA:

如果您担心列太多,那么请考虑一下您通常一起使用哪些内容并将它们组合起来,将其余内容留在一个单独的表(或多个单独的表,如果需要)。

如果您查看使用数据的方式,我猜测您会发现大约 80% 的查询使用了 20% 的数据,而其余 80% 的数据只是偶尔使用。 将常用的 20% 合并到一张表中,并将不经常使用的 80% 保留在单独的表中,这样您可能会得到一个很好的折衷方案。

Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.

Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.

ETA:

If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).

If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.

戏剧牡丹亭 2024-08-02 10:43:09

创建一张庞大的表违反了关系数据库原则。 我不会将它们全部合并到一张表中。 您将获得重复数据的多个实例。 例如,如果您的用户有三个兴趣,您将有 3 行,其中相同的用户数据只是为了存储三个不同的兴趣。 绝对采用多个“标准化”表方法。 请参阅 Wiki 页面了解数据库规范化。

编辑:
我更新了我的答案,就像你更新了你的问题一样......我现在更加同意我最初的答案,因为......

这些细胞的很大一部分是
可能仍为空

例如,如果用户没有任何兴趣,如果您进行标准化,那么您的兴趣表中将不会有该用户的行。 如果您将所有内容都放在一个庞大的表中,那么您将拥有仅包含 NULL 的列(显然是很多列)。

我曾在一家电话公司工作过,那里有大量的表,获取数据可能需要很多连接。 当从这些表中读取的性能至关重要时,创建的过程可以生成一个平面表(即非规范化表),该表不需要报告可以指向的连接、计算等。 然后将它们与 SQL Server 代理结合使用,以一定的时间间隔运行作业(即每周运行一次某些统计数据的每周视图,依此类推)。

Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definitely go for the multiple 'normalized' table approach. See this Wiki page for database normalization.

Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...

a large portion of these cells are
likely to remain empty

If for example, a user didn't have any interests, if you normalize then you simple won't have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.

I have worked for a telephony company where there have been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).

倥絔 2024-08-02 10:43:09

为什么不使用与 WordPress 相同的方法,即创建一个包含每个人都拥有的基本用户信息的用户表,然后添加一个“user_meta”表,该表基本上可以是与用户 ID 关联的任何键、值对。 因此,如果您需要查找用户的所有元信息,您只需将其添加到查询中即可。 如果不需要登录等操作,您也不必总是添加额外的查询。这种方法的好处还使您的表可以向用户添加新功能,例如存储他们的 Twitter 句柄或每个个人兴趣。 您也不必处理迷宫般的关联 ID,因为您有一个表来管理所有元数据,并且您将其限制为只有一个关联而不是 50 个。Wordpress

专门这样做是为了允许通过插件添加功能,因此,您的项目将更具可扩展性,并且如果您需要添加新功能,则无需进行完整的数据库检修。

Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.

Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.

清引 2024-08-02 10:43:09

我认为这是“视情况而定”的情况之一。 拥有多个表更干净,理论上可能更好。 但是,当您必须连接 6-7 个表才能获取有关单个用户的信息时,您可能会开始重新考虑这种方法。

I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.

巡山小妖精 2024-08-02 10:43:09

我想说这取决于其他表的真正含义。
user_details 是否包含超过 1 个/用户,依此类推。
哪种标准化级别最适合您的需求取决于您的需求。

如果您有一张具有良好索引的表,那可能会更快。 但另一方面可能更难以维护。

对我来说,您似乎可以跳过 User_Details,因为它可能与用户是一对一的关系。
但其余的每个用户可能有很多行?

I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.

If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.

To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?

心如荒岛 2024-08-02 10:43:09

大表上的性能考虑

“喜欢”和“查看”(等)是 1:1 关系的极少数有效案例之一。 这可以防止非常频繁的 UPDATE ... +1 干扰其他活动,反之亦然。
底线:在非常大和繁忙的桌子上单独的频繁计数器。

另一种可能的情况是您有一组很少出现的列。 与其拥有一堆空值,不如拥有一个 1:1 相关的单独表,或者更恰当地说“1:很少”。 然后仅当您需要这些列时才使用 LEFT JOIN 。 当需要将 NULL 转换为 0 时,请使用 COALESCE()
底线:这要看情况。

搜索条件限制为一张表。 INDEX 无法引用不同表中的列,因此过滤多个列的 WHERE 子句可能会使用一个表上的索引,但随后必须更加努力才能继续过滤其他表中的列。 如果涉及“范围”,这个问题尤其糟糕。
底线:不要将此类列移至单独的表中。

TEXT 和 BLOB 列可能会很大,这可能会导致性能问题,特别是当您不必要地使用 SELECT * 时。 此类列以“非记录”方式存储(在 InnoDB 中)。 这意味着获取它们的额外成本可能涉及额外的磁盘命中。
底线:InnoDB 已经解决了这个性能“问题”。

Performance considerations on big tables

"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.

Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.

Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.

TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.

决绝 2024-08-02 10:43:09

老话题,但想补充一些想法。

  1. MySQL 记录为 OLTP 设计的数据库

MySQL 标准版使您能够交付高性能且可扩展的在线事务处理 (OLTP) 应用程序。 它提供了使 MySQL 闻名的易用性以及工业强度的性能和可靠性。

  1. 大量标准化是 OLTP 的一部分。 为此,我准备了两个插入操作之间的比较。 表格和数据基于虚拟 Sakila 数据库。 案例一涉及原始 sakila.film 表,案例二是通过添加来自 languagecategory 的列来对 sakila.film 进行非规范化代码>、<代码>电影类别、<代码>演员、<代码>电影_演员。 这会导致许多值变得多余(也称为非规范化)。

设置一些元数据

SET NAMES utf8mb4;
SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;
SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;
SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='TRADITIONAL';
USE sakila;

表准备

CREATE TABLE film_olap (
  film_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  title VARCHAR(128) NOT NULL,
  description TEXT DEFAULT NULL,
  release_year YEAR DEFAULT NULL,
  language_name CHAR(20) NOT NULL,
  language_name_last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  category_id TINYINT UNSIGNED NOT NULL,
  category_name VARCHAR(25) NOT NULL,
  category_last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  film_id_last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  original_language_id TINYINT UNSIGNED DEFAULT NULL,
  rental_duration TINYINT UNSIGNED NOT NULL DEFAULT 3,
  rental_rate DECIMAL(4,2) NOT NULL DEFAULT 4.99,
  length SMALLINT UNSIGNED DEFAULT NULL,
  replacement_cost DECIMAL(5,2) NOT NULL DEFAULT 19.99,
  rating ENUM('G','PG','PG-13','R','NC-17') DEFAULT 'G',
  special_features SET('Trailers','Commentaries','Deleted Scenes','Behind the Scenes') DEFAULT NULL,
  actor_id SMALLINT UNSIGNED NOT NULL,
  actor_first_name VARCHAR(45) NOT NULL,
  actor_last_name VARCHAR(45) NOT NULL,
  actor_last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  film_actor_last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY  (film_id)
  -- KEY idx_title (title),
  -- KEY idx_fk_original_language_id (original_language_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;


CREATE TABLE film_oltp (
  film_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  title VARCHAR(128) NOT NULL,
  description TEXT DEFAULT NULL,
  release_year YEAR DEFAULT NULL,
  language_id TINYINT UNSIGNED NOT NULL,
  original_language_id TINYINT UNSIGNED DEFAULT NULL,
  rental_duration TINYINT UNSIGNED NOT NULL DEFAULT 3,
  rental_rate DECIMAL(4,2) NOT NULL DEFAULT 4.99,
  length SMALLINT UNSIGNED DEFAULT NULL,
  replacement_cost DECIMAL(5,2) NOT NULL DEFAULT 19.99,
  rating ENUM('G','PG','PG-13','R','NC-17') DEFAULT 'G',
  special_features SET('Trailers','Commentaries','Deleted Scenes','Behind the Scenes') DEFAULT NULL,
  last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY  (film_id)
  -- KEY idx_title (title),
  -- KEY idx_fk_language_id (language_id),
  -- KEY idx_fk_original_language_id (original_language_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

过程

delimiter //
CREATE PROCEDURE insert_to_olap()
BEGIN
  DECLARE total INT unsigned DEFAULT 0;
  WHILE total <= 1000000 DO
    INSERT INTO `film_olap` (`title`,`description`,`release_year`,`language_name`,`language_name_last_update`,`category_id`,`category_name`,`category_last_update`,`film_id_last_update`,`original_language_id`,`rental_duration`, `rental_rate`,`length`,`replacement_cost`,`rating`,`special_features`,`actor_id`,`actor_first_name`,`actor_last_name`, `actor_last_update`,`film_actor_last_update`,`last_update`) VALUES ('ACADEMY DINOSAUR','A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies',2006,'English','2006-02-15 05:02:19',10,'Games','2006-02-15 04:46:27','2006-02-15 05:07:09',NULL,6,'0.99',86,'20.99','PG','Deleted Scenes,Behind the Scenes',1,'PENELOPE','CRUZ','2006-02-15 04:34:33','2006-02-15 05:05:03','2006-02-15 05:03:42');
    SET total = total + 1;
  END WHILE;
END//
delimiter ;

delimiter //
CREATE PROCEDURE insert_to_oltp()
BEGIN
  DECLARE total INT unsigned DEFAULT 0;
  WHILE total <= 1000000 DO
INSERT INTO `film_oltp` (`title`,`description`,`release_year`,`language_id`,`original_language_id`,`rental_duration`,`rental_rate`,`length`,`replacement_cost`,`rating`,`special_features`,`last_update`) VALUES ('ACADEMY DINOSAUR','A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies',2006,1,NULL,6,'0.99',86,'20.99','PG','Deleted Scenes,Behind the Scenes','2006-02-15 05:03:42');
    SET total = total + 1;
  END WHILE;
END//
delimiter ;

测试

-- 1101.016sec
CALL insert_to_olap();

RESTART;
-- 
-- 1089.844sec
CALL insert_to_oltp();

结果

比较标准化与标准化之间的插入速度非规范化表似乎是徒劳的,因为两种方法都是不同的架构模式。 由于两者的最终用途而导致不同的模式。 由于使用完全相同的引擎,性能可能相似。 将 OLTP MySQL 与 Snowflake OLAP 进行比较也是徒劳的,因为我们正在比较服务于不同目的的架构。

OLTP 用于应用程序的后端,必须处理插入/更新/删除操作,重规范化可避免数据冗余,由于需要写入 n 个连接,因此不适合聚合和分析

另一方面 OLAP 是业务的一部分智能域,处理计算和视图的聚合,促进非规范化以加快查询速度和可读性,因为非规范化(根据 Kimball 又名维度)数据不会分布在多个小表中

摘要

您应该启动 db通过询问自己来设计

“我的数据库是否在后端工作,或者它是否支持我公司/项目的分析域”

一旦建立了上述内容,您就可以深入了解 OLTP 或 OLAP 的详细信息。

用户案例

可能导致 OLTP/OLAP 方法的一些问题是:

数据多久更改一次?

完成了多少次读取调用?

新用户每日增长是多少?

还有更多...

Old topic but thought would add some thoughts.

  1. MySQL is documented as OLTP designed database

MySQL Standard Edition enables you to deliver high-performance and scalable Online Transaction Processing (OLTP) applications. It provides the ease of use that has made MySQL famous along with industrial strength performance and reliability.

  1. Heavy normalization is part of OLTP. For that I prepared comparison between two insert operations. Tables and data is based on dummy Sakila database. Case one deals with original sakila.film table, case two is sakila.film denormalized by adding columns from language, category, film_category, actor, film_actor. That results in many values being redundant (aka denormalization).

Set some metadata

SET NAMES utf8mb4;
SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;
SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;
SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='TRADITIONAL';
USE sakila;

Tables prep

CREATE TABLE film_olap (
  film_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  title VARCHAR(128) NOT NULL,
  description TEXT DEFAULT NULL,
  release_year YEAR DEFAULT NULL,
  language_name CHAR(20) NOT NULL,
  language_name_last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  category_id TINYINT UNSIGNED NOT NULL,
  category_name VARCHAR(25) NOT NULL,
  category_last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  film_id_last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  original_language_id TINYINT UNSIGNED DEFAULT NULL,
  rental_duration TINYINT UNSIGNED NOT NULL DEFAULT 3,
  rental_rate DECIMAL(4,2) NOT NULL DEFAULT 4.99,
  length SMALLINT UNSIGNED DEFAULT NULL,
  replacement_cost DECIMAL(5,2) NOT NULL DEFAULT 19.99,
  rating ENUM('G','PG','PG-13','R','NC-17') DEFAULT 'G',
  special_features SET('Trailers','Commentaries','Deleted Scenes','Behind the Scenes') DEFAULT NULL,
  actor_id SMALLINT UNSIGNED NOT NULL,
  actor_first_name VARCHAR(45) NOT NULL,
  actor_last_name VARCHAR(45) NOT NULL,
  actor_last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  film_actor_last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY  (film_id)
  -- KEY idx_title (title),
  -- KEY idx_fk_original_language_id (original_language_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;


CREATE TABLE film_oltp (
  film_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  title VARCHAR(128) NOT NULL,
  description TEXT DEFAULT NULL,
  release_year YEAR DEFAULT NULL,
  language_id TINYINT UNSIGNED NOT NULL,
  original_language_id TINYINT UNSIGNED DEFAULT NULL,
  rental_duration TINYINT UNSIGNED NOT NULL DEFAULT 3,
  rental_rate DECIMAL(4,2) NOT NULL DEFAULT 4.99,
  length SMALLINT UNSIGNED DEFAULT NULL,
  replacement_cost DECIMAL(5,2) NOT NULL DEFAULT 19.99,
  rating ENUM('G','PG','PG-13','R','NC-17') DEFAULT 'G',
  special_features SET('Trailers','Commentaries','Deleted Scenes','Behind the Scenes') DEFAULT NULL,
  last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY  (film_id)
  -- KEY idx_title (title),
  -- KEY idx_fk_language_id (language_id),
  -- KEY idx_fk_original_language_id (original_language_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Procedures

delimiter //
CREATE PROCEDURE insert_to_olap()
BEGIN
  DECLARE total INT unsigned DEFAULT 0;
  WHILE total <= 1000000 DO
    INSERT INTO `film_olap` (`title`,`description`,`release_year`,`language_name`,`language_name_last_update`,`category_id`,`category_name`,`category_last_update`,`film_id_last_update`,`original_language_id`,`rental_duration`, `rental_rate`,`length`,`replacement_cost`,`rating`,`special_features`,`actor_id`,`actor_first_name`,`actor_last_name`, `actor_last_update`,`film_actor_last_update`,`last_update`) VALUES ('ACADEMY DINOSAUR','A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies',2006,'English','2006-02-15 05:02:19',10,'Games','2006-02-15 04:46:27','2006-02-15 05:07:09',NULL,6,'0.99',86,'20.99','PG','Deleted Scenes,Behind the Scenes',1,'PENELOPE','CRUZ','2006-02-15 04:34:33','2006-02-15 05:05:03','2006-02-15 05:03:42');
    SET total = total + 1;
  END WHILE;
END//
delimiter ;

delimiter //
CREATE PROCEDURE insert_to_oltp()
BEGIN
  DECLARE total INT unsigned DEFAULT 0;
  WHILE total <= 1000000 DO
INSERT INTO `film_oltp` (`title`,`description`,`release_year`,`language_id`,`original_language_id`,`rental_duration`,`rental_rate`,`length`,`replacement_cost`,`rating`,`special_features`,`last_update`) VALUES ('ACADEMY DINOSAUR','A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies',2006,1,NULL,6,'0.99',86,'20.99','PG','Deleted Scenes,Behind the Scenes','2006-02-15 05:03:42');
    SET total = total + 1;
  END WHILE;
END//
delimiter ;

The test

-- 1101.016sec
CALL insert_to_olap();

RESTART;
-- 
-- 1089.844sec
CALL insert_to_oltp();

Outcomes

Comparing insert speed between normalized vs denormalized table seems futile as both approaches are different architectural patterns. Different patterns due to final usage of both. Probably performance is simillar due to usage of exact same engine. Comparing OLTP MySQL with Snowflake OLAP will be futile as well since we are comparing architectures that serve different purposes.

OLTP is used in app's backened, must handle insert/update/delete operations, heavy normalization saves data from being redundant, is not designed for aggregation and analytics due to requirement to write n-number of joins

OLAP on the other hand is part of Business Intelligence domain, handles aggregation with calculations and views, promotes denormalization to speed-up query and readability, due to denormalization (aka dimensions according to Kimball) data is not spread around multiple small tables

Summary

You should start db design by asking oneself

"Is my database working in the backend or will it support analytic domain of my company/project"

Once above is established, you can dive in into details of OLTP or OLAP.

User case

Some of the questions that could lead to OLTP/OLAP approch would be:

How often is data changed?

How many read-calls are done?

What is the daily growth of new users?

and much more...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文