将 OLTP 关系型数据库转变为数据仓库模型
将数据从典型的实体关系 OLTP 数据库模型加载到 Kimball 星型模式数据仓库/集市模型中时采用的常见设计方法是什么?
- 您是否使用暂存区来执行转换,然后加载到仓库中?
- 如何链接仓库和 OLTP 数据库之间的数据?
- 在哪里/如何管理转换过程 - 在数据库中作为 sprocs、dts/ssis 包或来自应用程序代码的 SQL?
What are the common design approaches taken in loading data from a typical Entity-Relationship OLTP database model into a Kimball star schema Data Warehouse/Marts model?
- Do you use a staging area to perform the transformation and then load into the warehouse?
- How do you link data between the warehouse and the OLTP database?
- Where/How do you manage the transformation process - in the database as sprocs, dts/ssis packages, or SQL from application code?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我目前正在开发一个中小型数据仓库。 我们正在采用 Kimball 提出的一些概念,即带有事实和维度表的星型方案。 我们对其进行构造,以便事实仅连接到维度(而不是事实到事实或维度到维度 - 但这是我们的选择,而不是说它应该这样做),因此我们将所有维度连接展平到事实表。
我们使用 SSIS 将数据从生产数据库中移动 -> 源数据库-> 暂存数据库 -> 报告数据库(我们可能可以使用更少的数据库,但这就是它的下降方式)。
SSIS 非常好,因为它可以让您非常有逻辑地构建数据流。 我们使用 SSIS 组件和存储过程的组合,其中 SSIS 的一个很好的功能是能够提供 SQL 命令作为源/目标数据流之间的转换。 这意味着我们可以根据需要在每一行上调用存储过程,这可能很有用(尽管速度有点慢)。
我们还使用了一个名为更改数据捕获 (CDC) 的新 SQL Server 2008 功能,它允许您审核表上的所有更改(您可以指定要查看这些表中的哪些列),因此我们在生产数据库告诉我们发生了什么变化,这样我们就可以将这些记录移动到源数据库进行处理。
I'm currently working on a small/mid size dataware house. We're adopting some of the concepts that Kimball puts forward, i.e. the star scheme with fact and dimension tables. We structure it so that facts only join to dimensions (not fact to fact or dimension to dimension - but this is our choice, not saying it's the way it should be done), so we flatten all dimension joins to the fact table.
We use SSIS to move the data from the production DB -> source DB -> staging DB -> reporting DB (we probably could have have used less DBs, but that's the way it's fallen).
SSIS is really nice as it's lets you structure your data flows very logically. We use a combination of SSIS components and stored procs, where one nice feature of SSIS is the ability to provide SQL commands as a transform between a source/destination data-flow. This means we can call stored procs on every row if we want, which can be useful (albeit a bit slower).
We're also using a new SQL Server 2008 feature called change data capture (CDC) which allows you to audit all changes on a table (you can specify which columns you want to look at in those tables), so we use that on the production DB to tell what has changed so we can move just those records across to the source DB for processing.
我同意高度评价的答案,但我想添加以下内容:
是否需要分期取决于转换的类型。 分段提供了将 ETL 分解为更易于管理的块的好处,而且还提供了一个工作区域,允许在不影响仓库的情况下对数据进行操作。 它可以帮助在暂存区域中进行(至少)一些维度查找,该暂存区域存储来自 OLTP 系统的键和最新暗淡记录的键,以在加载事实记录时用作查找。
该转换发生在 ETL 过程本身中,但在此过程中可能需要也可能不需要一些阶段来帮助它。
将业务键(或实际主键,如果可用)加载到数据仓库中作为 OLTP 系统的引用非常有用。 此外,DW 进程中的审核应通过记录加载该数据的加载进程来记录每一位数据的沿袭。
这通常位于 SSIS 包中,但在源查询中进行转换通常性能更高。 不幸的是,这使得源查询变得非常难以理解和维护,因此如果性能不是问题,那么在 SSIS 代码中进行转换是最好的。 当您执行此操作时,这是拥有暂存区域的另一个原因,因为这样您可以在不同表之间的源查询中进行更多联接。
I agree with the highly rated answer but thought I'd add the following:
It depends on the type of transformation whether it will require staging. Staging offers benefits of breaking the ETL into more manageable chunks, but also provides a working area that allows manipulations to take place on the data without affecting the warehouse. It can help to have (at least) some dimension lookups in a staging area which store the keys from the OLTP system and the key of the latest dim record, to use as a lookup when loading your fact records.
The transformation happens in the ETL process itself, but it may or may not require some staging to help it along the way.
It is useful to load the business keys (or actual primary keys if available) into the data warehouse as a reference back to the OLTP system. Also, auditing in the DW process should record the lineage of each bit of data by recording the load process that has loaded it.
This would typically be in SSIS packages, but often it is more performant to transform in the source query. Unfortunately this makes the source query quite complicated to understand and therefore maintain, so if performance is not an issue then transforming in the SSIS code is best. When you do this, this is another reason for having a staging area as then you can make more joins in the source query between different tables.
约翰·桑德斯的流程解释很好。
如果您希望在 SQL Server 中实现数据仓库项目,您将在优秀文本“Microsoft 数据仓库工具包 ”。
有趣的是,作者之一是拉尔夫·金博尔:-)
John Saunders' process explanation is a good.
If you are looking to implement a Datawarehouse project in SQL Server you will find all the information you require for the delivering the entire project within the excellent text "The Microsoft Data Warehouse Toolkit".
Funilly enough, one of the authors is Ralph Kimball :-)
您可能需要查看数据仓库建模。 它声称解决了一些长期问题,例如更改属性。
You may want to take a look at Data Vault Modeling. It claims solving some loner term issues like changing attributes.
就我个人而言,我倾向于按以下方式工作:
这对我来说效果很好,尽管我承认我没有做过很多这样的项目,也没有任何非常大的。
Personally, I tend to work as follows:
This has worked well for me, though I admit I haven't done many of these projects, nor any really large ones.