将 OLTP 关系型数据库转变为数据仓库模型

发布于 2024-07-19 05:55:13 字数 217 浏览 12 评论 0原文

将数据从典型的实体关系 OLTP 数据库模型加载到 Kimball 星型模式数据仓库/集市模型中时采用的常见设计方法是什么?

  • 您是否使用暂存区来执行转换,然后加载到仓库中?
  • 如何链接仓库和 OLTP 数据库之间的数据?
  • 在哪里/如何管理转换过程 - 在数据库中作为 sprocs、dts/ssis 包或来自应用程序代码的 SQL?

What are the common design approaches taken in loading data from a typical Entity-Relationship OLTP database model into a Kimball star schema Data Warehouse/Marts model?

  • Do you use a staging area to perform the transformation and then load into the warehouse?
  • How do you link data between the warehouse and the OLTP database?
  • Where/How do you manage the transformation process - in the database as sprocs, dts/ssis packages, or SQL from application code?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

迎风吟唱 2024-07-26 05:55:14

我目前正在开发一个中小型数据仓库。 我们正在采用 Kimball 提出的一些概念,即带有事实和维度表的星型方案。 我们对其进行构造,以便事实仅连接到维度(而不是事实到事实或维度到维度 - 但这是我们的选择,而不是说它应该这样做),因此我们将所有维度连接展平到事实表。

我们使用 SSIS 将数据从生产数据库中移动 -> 源数据库-> 暂存数据库 -> 报告数据库(我们可能可以使用更少的数据库,但这就是它的下降方式)。

SSIS 非常好,因为它可以让您非常有逻辑地构建数据流。 我们使用 SSIS 组件和存储过程的组合,其中 SSIS 的一个很好的功能是能够提供 SQL 命令作为源/目标数据流之间的转换。 这意味着我们可以根据需要在每一行上调用存储过程,这可能很有用(尽管速度有点慢)。

我们还使用了一个名为更改数据捕获 (CDC) 的新 SQL Server 2008 功能,它允许您审核表上的所有更改(您可以指定要查看这些表中的哪些列),因此我们在生产数据库告诉我们发生了什么变化,这样我们就可以将这些记录移动到源数据库进行处理。

I'm currently working on a small/mid size dataware house. We're adopting some of the concepts that Kimball puts forward, i.e. the star scheme with fact and dimension tables. We structure it so that facts only join to dimensions (not fact to fact or dimension to dimension - but this is our choice, not saying it's the way it should be done), so we flatten all dimension joins to the fact table.

We use SSIS to move the data from the production DB -> source DB -> staging DB -> reporting DB (we probably could have have used less DBs, but that's the way it's fallen).

SSIS is really nice as it's lets you structure your data flows very logically. We use a combination of SSIS components and stored procs, where one nice feature of SSIS is the ability to provide SQL commands as a transform between a source/destination data-flow. This means we can call stored procs on every row if we want, which can be useful (albeit a bit slower).

We're also using a new SQL Server 2008 feature called change data capture (CDC) which allows you to audit all changes on a table (you can specify which columns you want to look at in those tables), so we use that on the production DB to tell what has changed so we can move just those records across to the source DB for processing.

乖乖兔^ω^ 2024-07-26 05:55:14

我同意高度评价的答案,但我想添加以下内容:

* 您是否使用暂存区来执行转换,然后 
  

装入仓库?

是否需要分期取决于转换的类型。 分段提供了将 ETL 分解为更易于管理的块的好处,而且还提供了一个工作区域,允许在不影响仓库的情况下对数据进行操作。 它可以帮助在暂存区域中进行(至少)一些维度查找,该暂存区域存储来自 OLTP 系统的键和最新暗淡记录的键,以在加载事实记录时用作查找。
该转换发生在 ETL 过程本身中,但在此过程中可能需要也可能不需要一些阶段来帮助它。

* 如何链接仓库和 OLTP 数据库之间的数据? 
  

将业务键(或实际主键,如果可用)加载到数据仓库中作为 OLTP 系统的引用非常有用。 此外,DW 进程中的审核应通过记录加载该数据的加载进程来记录每一位数据的沿袭。

* 您在哪里/如何管理转换过程 - 在 
  

数据库作为 sprocs、dts/ssis 包、
或来自应用程序代码的 SQL?

这通常位于 SSIS 包中,但在源查询中进行转换通常性能更高。 不幸的是,这使得源查询变得非常难以理解和维护,因此如果性能不是问题,那么在 SSIS 代码中进行转换是最好的。 当您执行此操作时,这是拥有暂存区域的另一个原因,因为这样您可以在不同表之间的源查询中进行更多联接。

I agree with the highly rated answer but thought I'd add the following:

* Do you use a staging area to perform the transformation and then

load into the warehouse?

It depends on the type of transformation whether it will require staging. Staging offers benefits of breaking the ETL into more manageable chunks, but also provides a working area that allows manipulations to take place on the data without affecting the warehouse. It can help to have (at least) some dimension lookups in a staging area which store the keys from the OLTP system and the key of the latest dim record, to use as a lookup when loading your fact records.
The transformation happens in the ETL process itself, but it may or may not require some staging to help it along the way.

* How do you link data between the warehouse and the OLTP database?

It is useful to load the business keys (or actual primary keys if available) into the data warehouse as a reference back to the OLTP system. Also, auditing in the DW process should record the lineage of each bit of data by recording the load process that has loaded it.

* Where/How do you manage the transformation process - in the

database as sprocs, dts/ssis packages,
or SQL from application code?

This would typically be in SSIS packages, but often it is more performant to transform in the source query. Unfortunately this makes the source query quite complicated to understand and therefore maintain, so if performance is not an issue then transforming in the SSIS code is best. When you do this, this is another reason for having a staging area as then you can make more joins in the source query between different tables.

温柔少女心 2024-07-26 05:55:14

约翰·桑德斯的流程解释很好。

如果您希望在 SQL Server 中实现数据仓库项目,您将在优秀文本“Microsoft 数据仓库工具包 ”。

有趣的是,作者之一是拉尔夫·金博尔:-)

John Saunders' process explanation is a good.

If you are looking to implement a Datawarehouse project in SQL Server you will find all the information you require for the delivering the entire project within the excellent text "The Microsoft Data Warehouse Toolkit".

Funilly enough, one of the authors is Ralph Kimball :-)

十年不长 2024-07-26 05:55:14

您可能需要查看数据仓库建模。 它声称解决了一些长期问题,例如更改属性。

You may want to take a look at Data Vault Modeling. It claims solving some loner term issues like changing attributes.

疏忽 2024-07-26 05:55:13

就我个人而言,我倾向于按以下方式工作:

  1. 首先设计数据仓库。 特别是,设计作为 DW 一部分所需的表,忽略任何临时表。
  2. 使用 SSIS 设计 ETL,但有时使用 SSIS 调用相关数据库中的存储过程。
  3. 如果 ETL 中需要任何暂存表,那很好,但同时要确保它们得到清理。 仅用作单个 ETL 步骤系列一部分的临时表应在这些步骤完成后被截断,无论成功与否。
  4. 我让 SSIS 包引用 OLTP 数据库,至少将数据拉入临时表中。 根据情况,他们可以将 OLTP 表直接处理到数据仓库中。 所有此类查询均通过WITH(NOLOCK) 执行。
  5. 文件,文件,文件。 明确每个包使用哪些输入以及输出的去向。 确保记录选择输入的标准(过去 24 小时?自上次成功以来?新的标识值?所有行?)

这对我来说效果很好,尽管我承认我没有做过很多这样的项目,也没有任何非常大的。

Personally, I tend to work as follows:

  1. Design the data warehouse first. In particular, design the tables that are needed as part of the DW, ignoring any staging tables.
  2. Design the ETL, using SSIS, but sometimes with SSIS calling stored procedures in the involved databases.
  3. If any staging tables are required as part of the ETL, fine, but at the same time make sure they get cleaned up. A staging table used only as part of a single series of ETL steps should be truncated after those steps are completed, with or without success.
  4. I have the SSIS packages refer to the OLTP database at least to pull data into the staging tables. Depending on the situation, they may process the OLTP tables directly into the data warehouse. All such queries are performed WITH(NOLOCK).
  5. Document, Document, Document. Make it clear what inputs are used by each package, and where the output goes. Make sure to document the criteria by which the input are selected (last 24 hours? since last success? new identity values? all rows?)

This has worked well for me, though I admit I haven't done many of these projects, nor any really large ones.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文