检查更新的维度数据
我有一个 OLTP 数据库,目前正在创建一个数据仓库。 DW(DimStudents)中有一个维度表,其中包含学生数据,例如地址详细信息、电子邮件、通知设置。
在 OLTP 数据库中,这些数据分布在多个表中(因为它是第三范式的标准 OLTP 数据库)。
目前有 10,390 条记录,但这一数字预计还会增长。
我想使用类型 2 ETL,如果 OLTP 数据库中的记录发生更改,则会将新记录添加到 DW 中。
扫描 DW 中的 10,000 条记录,然后将结果与 OLTP 中包含的多个表中的结果进行比较的最佳方法是什么?
我正在考虑使用 OLTP 数据的临时表创建“快照”,然后将结果逐行与 DW 中维度表中的数据进行比较。
我正在使用 SQL Server 2005。这似乎不是最有效的方法。还有其他选择吗?
I have an OLTP database, and am currently creating a data warehouse. There is a dimension table in the DW (DimStudents) that contains student data such as address details, email, notification settings.
In the OLTP database, this data is spread across several tables (as it is a standard OLTP database in 3rd normal form).
There are currently 10,390 records but this figure is expected to grow.
I want to use Type 2 ETL whereby if a record has changed in the OLTP database, a new record is added to the DW.
What is the best way to scan through 10,000 records in the DW and then compare the results with the results in several tables contained in the OLTP?
I'm thinking of creating a "snapshot" using a temporary table of the OLTP data and then comparing the results row by row with the data in the Dimension table in the DW.
I'm using SQL Server 2005. This doesn't seem like the most efficient way. Are there alternatives?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
将 LastUpdated 引入源系统 (OLTP) 表中。这样您就可以使用以下方法提取更少的内容:
WHERE LastUpdated >= some_time_here
您似乎正在使用 SQL Server,因此您也可以尝试 rowversion 类型(8 字节 db-scope-unique 计数器)
Introduce LastUpdated into source system (OLTP) tables. This way you have less to extract using:
WHERE LastUpdated >= some_time_here
You seem to be using SQL server, so you may also try rowversion type (8 byte db-scope-unique counter)
听起来你正在接近这种倒退。执行 ETL(提取、测试、加载)的典型方法是:
实际上,在第 1 步中,您将通过查询 OLTP 数据库中的多个表来创建物理记录,然后将该结果记录与维度数据进行比较,以确定是否进行了修改。这是做事的标准方式。此外,就数量而言,10000 行是相当微不足道的。任何 RDBMS 和 ETL 进程都应该能够在最多不超过几秒钟的时间内完成处理。我知道 SQL Server 有 DTS,但我不确定该名称在最新版本中是否已更改。这是执行此类操作的完美工具。
It sounds like you are approaching this sort of backwards. The typical way for performing ETL (Extract, Test, Load) is:
Effectively, in step #1, you'll create a physical record via a query against the multiple tables in your OLTP database, then compare that resulting record against your dimensional data to determine if a modification was made. This is the standard way of doing things. In addition, 10000 rows is pretty insignificant as far as volume goes. Any RDBMS and ETL process should be able to process through that in a matter of no more than few seconds at most. I know SQL Server has DTS, although I'm not sure if the name has changed in more recent versions. That is the perfect tool for doing something like this.
您的 OLTP 数据库有审计跟踪吗?
如果是这样,那么您可以在审计跟踪中查询自上次 ETL 以来已触及的记录。
Does you OLTP database have an audit trail?
If so, then you can query the audit trail for just the records that have been touched since the last ETL.