数据库设计：分区表与规范化表

发布于 2024-10-24 10:00:41 字数 1166 浏览 1 评论 0原文

我的新闻通讯应用程序有两个表：tblIssue 和 tblIssueSubscriber。

这是我的规范化设计：

 tblIssues (newsletter issues masterlist)
 --------------------
 IssueId int PK
 PublisherCode varchar(10)
 IssueDesc varchar(50)


 tblIssueSubscribers (newsletter subscribers)
 -----------------
 IssueId int FK
 EmailAddress varchar(100)

但 tblIssueSubscriber 预计每周保存数十万甚至数百万条记录，并且会被频繁访问，这就是我倾向于表分区的原因。我的设计是根据 PublisherCode 对 tblIssueSubscriber 进行分区（我们的主列表上有 8 个publisherCode）。

 tblIssues
 --------------
 IssueId int PK
 PublisherCode varchar(10)
 IssueDesc varchar(50)


 tblIssueSubscribers
 -----------------
 IssueId int FK
 PublisherCode varchar(10)
 EmailAddress varchar(100)

然后根据 PublisherCode 将其分区

 CREATE PARTITION FUNCTION [PartitionPublisher] (varchar(10)) AS RANGE RIGHT FOR VALUES  ('PUBLISHER1', 'PUBLISHER2', 'PUBLISHER3', 'PUBLISHER4', 'PUBLISHER5', 'PUBLISHER6', 'PUBLISHER7', 'PUBLISHER8');

我知道表分区会增加复杂性，所以我的问题是，

是否值得对 tblIssueSubscriber 进行分区，或者我应该坚持标准化设计？

原文

I have two tables: tblIssue and tblIssueSubscriber for my newsletter application.

This is my normalized design:

 tblIssues (newsletter issues masterlist)
 --------------------
 IssueId int PK
 PublisherCode varchar(10)
 IssueDesc varchar(50)


 tblIssueSubscribers (newsletter subscribers)
 -----------------
 IssueId int FK
 EmailAddress varchar(100)

but tblIssueSubscriber is expected to hold hundred thousands or even millions of record per week and it will be accessed frequently that's why Im leaning towards Table partitioning. My design is to partition the tblIssueSubscriber per PublisherCode (We have 8 publisherCode on our masterlist).

 tblIssues
 --------------
 IssueId int PK
 PublisherCode varchar(10)
 IssueDesc varchar(50)


 tblIssueSubscribers
 -----------------
 IssueId int FK
 PublisherCode varchar(10)
 EmailAddress varchar(100)

and then partitioned it per PublisherCode

 CREATE PARTITION FUNCTION [PartitionPublisher] (varchar(10)) AS RANGE RIGHT FOR VALUES  ('PUBLISHER1', 'PUBLISHER2', 'PUBLISHER3', 'PUBLISHER4', 'PUBLISHER5', 'PUBLISHER6', 'PUBLISHER7', 'PUBLISHER8');

I know that table partitioning adds complexity so my question is,

Is it worth partitioning tblIssueSubscriber, or should I stick to
the normalized design?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不乱于心 2024-10-31 10:00:41

首先，我认为尺寸是一个转移注意力的因素。这不是一个非常有用的论点，因为所有大小都是相对的，并且有理由使用分区而不管大小。

性能只是部分原因。罗尼斯提出了一些很好的观点，但还不止于此。

对表进行分区有两个原因。一是性能，一是维护。

我们先从维护开始。

一般来说，在数据库中删除是一件“坏”事情。假设您错误地插入了 100 万行，然后删除了 100 万行。这些删除中的每一个都会记录生成 UNDO 和 REDO 记录，这会浪费空间并花费时间，不仅在删除时进行，而且在“播放”以进行时间点恢复时也需要时间。那么还有什么比删除更好呢？截断（或删除）。当您所描述的表不断增长时，在某些时候您希望删除旧记录。这就是为什么我说大小无关紧要 - 如果您想在该表中保留一年，则需要删除超过 12 个月的记录 - 无论大小是多少。添加记录一年后，您可能会拥有一个 300MB 的表或 500GB 的表 - 无论您需要/想要开始删除。因此，您始终可以删除带有 insert_dt < 的行。 sysdate - 365。或者您可以删除或截断该月/日的分区。未记录的事务将占用较少的资源。

还有其他维护好处，例如单独备份分区或重建索引或移动到新表空间等。不确定您使用的是什么 RDBMS，但大多数情况下您可以通过分区交换加载数据。这使您可以在所有数据加载并准备就绪之前对最终表进行任何更改。

就性能而言...

这里的关键是，任何在 where 子句中不包含分区键的查询很可能比分区之前的性能更差。这不是 GO_FASTER = TRUE 类型的设置。我见过人们实施分区并破坏他们的系统。 Ronnis 的帖子是单个分区表中性能指南的基础知识。如果您有多个表在同一键上分区，某些 RDBMS 可以并行化它们之间的连接。

First I think Size is a Red Herring. It's not a very useful argument since all size is relative and there are reasons to use partition irrespective of size.

Performance is only part of the reason. Ronnis makes some good points but it doesn't stop there.

There are two reasons to partition a table. One is performance, one is maintenance.

Let's start with maintenance.

In general DELETE is a 'bad' thing to do in a database. Say you mistakenly insert 1 million rows and then delete 1 million rows. Each of those deletes is logged generating UNDO and REDO records, which waste space and take time not only to make while deleting but again when 'played' for a point-in-time recovery. So what's better than delete? Truncate (or drop). When you have tables as you describe that are constantly growing, at some point you'd like to get rid of old records. This is why I say size is irrelevant - if you want to keep a year in that table, you'll need to remove records that are more than 12 months old - NO MATTER WHAT THAT SIZE IS. You could have a 300MB table or a 500GB table after 1 year of adding records - regardless you'll need/want to start deleting. So you can always just delete the rows with insert_dt < sysdate - 365. Or you could just drop or truncate that month/day's partition. A not logged transaction that will be less resource intensive.

There are other maintenance benefits like individually backing up partition or rebuilding indexes or moving to new tablespaces etc. Not sure what RDBMS you're using but you can load data via partition swaps in most. This allows you to make no changes to your final tables until all of the data is loaded and ready to go.

As far as performance goes...

The key here is that any query that doesn't include the partition key in the where clause will most likely perform worse than it did before partitioning. This isn't a GO_FASTER = TRUE type of setting. I've seen people implement partitioning and crush their systems. Ronnis' post is the basics of performance guideline in a single partitioned tables. If you have more than one table partitioned on the same key, some RDBMS's can parallelize the joins between them.

回复收藏 0 原文