按年份分区与名为 Data_2011、Data_2010 等的单独表的比较
我们正在设计一个大容量 SQL Server 应用程序,其中涉及对指定年份内限制的数据进行处理和报告。
我想到了使用按年分区。
另一个建议是以编程方式创建单独的物理表,其中名称的后缀是年份,并且当需要跨年份报告时,提供一个作为物理表的并集的视图。
我的直觉告诉我,分区就是为了处理这种情况而设计的。使用其他方法有什么优点吗?
We are designing a high volume SQL Server application that involves processing and reporting on data that is restricted within a specified year.
Using Partitioning by year comes to mind.
Another suggestion is to programmatically create separate physical table where the suffix of the name is the year and, when reporting is needed across years, to provide a view which is the union of the physical tables.
My gut tells me that this situation is what partitioning is design to handle. Are there any advantages to using the other approach?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我觉得使用日期驱动的分区键进行分区就像使用锤子拧入螺丝一样...“这一定是他们发明锤子的原因”...当您需要并行进程在数据中运行时,分区是很好的选择集市或您在某些任意键(例如和身份列)上进行分区。就您而言,业务要求只是保留多年的历史记录。为了使用分区,应用程序团队需要创建一个动态生成分区约束的例程,这就是 DDL,并且是 DBA 团队的职责。多表/联合视图提供了更简单的解决方案。
I feel using partitioning with a date driven paritioning key is like using a hammer to drive in a screw...'that must have been why they invented the hammer'...Partitioning is good when you need parallel processes to run as in data marts or you partition on some arbitrary key e.g. and identity column. In your case, the business requirement is simply to keep multiple years of history. In order to use partitioning, the app team would need to create a routine that dynamically generates the partitioning constraint, which is DDL and is the responsibility of the DBA team. The multi-table/union view provides a much simpler soluiton.
从内部角度来看,方法本质上是相同的。
在幕后,当您创建基于日期的分区时,SQL 引擎会为每个分区创建单独的物理表,然后执行基本的操作
UNION
当您查询表本身时。如果您在与分区字段(例如,
DateField
)对应的分区表的查询中使用过滤器,则引擎可以直接转到数据所需的分区。如果没有,则根据需要搜索逻辑表中的每个物理表以完成查询。如果您的查询将涉及日期过滤器(听起来像是您的问题),那么我认为您的“自定义”方法没有任何优势。
本质上,您需要做出的选择是您想要负责分区中涉及的所有逻辑和极端情况,还是相信几十年来一直在这样做的 Microsoft 开发人员为您做这件事?
的目的,如果有一个内置框架可以实现我想做的事情,那么我总是尝试使用它。它总是比“自行部署”的解决方案更快、更稳定且不易出错。
From an internals perspective, the methods are essentially the same.
Behind the scenes, when you create a date-based partition the SQL engine creates separate physical tables for each partition, then does what is basically a
UNION
when you query the table itself.If you use a filter in your query on the partitioned table that corresponds to your partitioning field (
DateField
let's say), then the engine can go directly to the partition that you need for the data. If not, then it searches each physical table in the logical table as needed to complete the query.If your queries will involve a date filter (which it sounds like they will from your question) then I can think of no advantage to your "custom" method.
Essentially, the choice you need to make is do you want to be responsible for all the logic and corner cases involved in partitioning, or trust the developers at Microsoft who have been doing this for decades to do it for you?
For my own purposes, if there is a built-in framework for something I want to do then I always try to use it. It is invariably faster, more stable, and less error-prone than a "roll-your-own" solution.
这两种解决方案都意味着您必须在数据库中执行一些元数据操作。问题是你是否会对历史数据做一些改变/更新?我正在研究类似的解决方案 - 花蕾,而不是一年,我们只处理半年的数据。在本例中,我们使用按日期分区 - 我们有半年浮动窗口,将 2 年的历史数据 + 当前半年 (HTD) 保存在 10 个分区中(每个分区代表一个单独的季度)。我们每天更新 HTD 数据,每周重述一些历史数据。在本例中,我们只命中了几个分区(分区 id 在 where 子句中定义,分区键是表示维度之一中的日历日期的 date_id)。整个表大约有250M行。每半年该过程就会调整分区,但视图也必须进行同样的调整。使用这种方法,我们始终可以对整个表执行更新(使用视图,您将必须测试更新场景或对单独的表运行更新)。我们有适当的程序可以截断/切换表的指定分区,因此操作很快。
很难说哪个是最好的选择。但一般来说,我建议在您确实不更改历史记录的情况下使用表(我会选择 1 个分区表用于历史记录,1 个表用于当前数据)
Both of the solutions means that you have to do some metadata operations in the db. The question is whether you will do some changes/updates in the historical data? I was working on similar solution - bud instead of a year we were working a half year of the data. In this case we used partitioning by date - we have half year floating window keeping 2 years of historical data + current half year (HTD) in 10 partitions(each partition represents a separate quarter). We were updating the HTD data every day and once a week we were restating some of the historical data. In this case we were hitting only few partitions (the partition id was defined in where clause, the partitioning key was a date_id representing the calendar date in one of our dimensions). The whole table had about 250M of rows. Every half year the process is adjusting the partitioning, but the same you will have to do with the view. Using this approach we can always execute an update against the whole table (using the view you will have to test the the update scenario or run the update against separate tables). We have procedures in place which can truncate / switch out a specified partition of the table so the manipulation is quick.
It is difficult to say which is the best option. But in general I would suggest to use the tables in the case that you really are not changing the history (I would go for 1 partitioned table for history and 1 table for current data)