对表进行分区才能真正受益的最佳大小(行数)是多少?
IE 如果我们有一个包含 400 万行的表。
其中有一个 STATUS
字段,可以采用以下值:TO_WORK
、BLOCKED
或 WORKED_CORRECTLY
。
你会在一个只改变一次的字段上进行分区吗(大多数时候是从 to_work 到working_ Correctly)?您将创建多少个分区?
I.E. if we have got a table with 4 million rows.
Which has got a STATUS
field that can assume the following value: TO_WORK
, BLOCKED
or WORKED_CORRECTLY
.
Would you partition on a field which will change just one time (most of times from to_work to worked_correctly)? How many partitions would you create?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
分区中的绝对行数并不是最有用的指标。您真正想要的是随着表的增长而稳定的列,并且能够提供分区的潜在好处。它们是:可用性、表空间管理和性能。
例如,您的示例列具有三个值。这意味着您可以拥有三个分区,也就意味着您可以拥有三个表空间。因此,如果表空间损坏,您将丢失三分之一的数据。分区是否使您的表更可用?并不真地。
添加或删除分区可以更轻松地管理大量数据。但是您是否有可能删除状态为 WORKED_CORRECTLY 的所有行?可能性极小。分区是否使您的表更易于管理?并不真地。
分区的性能优势来自于查询修剪,优化器可以立即减少表的块。现在每个分区有 130 万行。因此,即使您查询
STATUS='WORKED_CORRECTLY'
,您仍然有大量记录需要筛选。而且很可能,任何不涉及 STATUS 的查询的性能都会比针对未分区表的性能更差。分区是否使您的表性能更高?可能不会。到目前为止,我一直假设您的分区是均匀分布的。但你的最后一个问题表明情况并非如此。大多数行(如果不是全部)最终都会出现在
WORKED_CORRECTLY
中。因此,与其他分区相比,该分区将变得巨大,并且从分区中获益的机会变得更加渺茫。最后,你提出的方案没有弹性。作为当前卷,每个分区将有 130 万行。当您的表总共增长到 4000 万行时,每个分区将容纳 1330 万行。这很糟糕。
那么,什么才是分区键的良好候选者呢?一种是产生大量分区,一种是分区大小大致相等,一种是键的值不太可能改变,一种是该值在底层对象的生命周期中具有某种意义,最后一种是在针对表运行的大量查询中非常有用。
这就是为什么像 DATE_CREATED 这样的东西是数据仓库中事实表分区的流行选择。它会在一系列粒度(通常选择日、月或年)中生成合理数量的分区。我们在给定时间跨度内创建的记录数量大致相同。数据加载和数据归档通常是根据年龄(即创建日期)来完成的。 BI 查询几乎总是包含 TIME 维度。
The absolute number of rows in a partition is not the most useful metric. What you really want is a column which is stable as the table grows, and which delivers on the potential benefits of partitioning. These are: availability, tablespace management and performance.
For instance, your example column has three values. That means you can have three partitions, which means you can have three tablespaces. So if a tablespace becomes corrupt you lose one third of your data. Has partitioning made your table more available? Not really.
Adding or dropping a partition makes it easier to manage large volumes of data. But are you ever likely to drop all the rows with a status of
WORKED_CORRECTLY
? Highly unlikely. Has partitioning made your table more manageable? Not really.The performance benefits of partitioning come from query pruning, where the optimizer can discount chunks of the table immediately. Now each partition has 1.3 million rows. So even if you query on
STATUS='WORKED_CORRECTLY'
you still have a huge number of records to winnow. And the chances are, any query which doesn't involve STATUS will perform worse than it did against the unpartitioned table. Has partitioning made your table more performant? Probably not.So far, I have been assuming that your partitions are evenly distributed. But your final question indicates that this is not the case. Most rows - if not all - rows will end up in the
WORKED_CORRECTLY
. So that partition will become enormous compared to the others, and the chances of benefits from partitioning become even more remote.Finally, your proposed scheme is not elastic. As the current volume each partition would have 1.3 million rows. When your table grows to forty million rows in total, each partition will hold 13.3 million rows. This is bad.
So, what makes a good candidate for a partition key? One which produces lots of partitions, one where the partitions are roughly equal in size, one where the value of the key is unlikely to change and one where the value has some meaning in the life-cycle of the underlying object, and finally one which is useful in the bulk of queries run against the table.
This is why something like DATE_CREATED is such a popular choice for partitioning of fact tables in data warehouses. It generates a sensible number of partitions across a range of granularities (day, month, or year are the usual choices). We get roughly the same number of records created in a given time span. Data loading and data archiving are usually done on the basis of age (i.e. creation date). BI queries almost invariably include the TIME dimension.
表中的行数通常不是用于确定是否以及如何对表进行分区的重要指标。
您想解决什么问题?您是否正在尝试提高查询性能?数据负载的性能?清除数据的性能?
假设您正在尝试提高查询性能?您的所有查询是否在
STATUS
列上都有谓词?他们是否对行进行单行查找?或者您希望查询扫描整个分区?The number of rows in a table isn't generally a great metric to use to determine whether and how to partition the table.
What problem are you trying to solve? Are you trying to improve query performance? Performance of data loads? Performance of purging your data?
Assuming you are trying to improve query performance? Do all your queries have predicates on the
STATUS
column? Are they doing single row lookups of rows? Or would you want your queries to scan an entire partition?