Brief Introduction to Method of Change Data Capture
Synopsis: The data warehouse system always needs to be refreshed by extracting the change data from the source databases. There are a number of strategies and methods often used to achieve this summarized as below:
1. Table (File) Differencing Full Scan
2. Value-change selection base on Timestamp
3. Base on DBMS Log
4. Base on Trigger Mechanism
5. Base on Replication Technology
All of these methods have their own situations for usage. Many DBMS have their own methods and tools to support Change Data Capture (CDC) and of course we can use our own program to capture the change data from source systems.
Key Word: CDC (Change Data Capture)
1. Table (File) Differencing Full Scan
Table differencing involves transporting a copy of an entire table from the source (production) database to the staging database (where the change data is captured), where an older version of the table already exists. Often a “MINUS” method is adopted to identify the difference between the old version and the new version, or the target content is just replaced simply. This method is very simple to develop and it will be effective if the data volume is not very large.
However, there are several problems with this method:
• It requires that the new version of the entire table be transported to the staging database, not just the change data, thereby greatly increasing transport costs.
• The computational cost of performing the two MINUS operations on the staging database can be very high.
• Table differencing cannot capture data that have reverted to their old values. For example, suppose the price of a product changes several times between the old version and the new version of the product's table. If the price in the new version ends up being the same as the old, table differencing cannot detect that the price has fluctuated. Moreover, any intermediate price values between the old and new versions of the product's table cannot be captured using table differencing.
• There is no way to determine which changes were made as part of the same transaction. For example, suppose a sales manager creates a special discount to close a deal. The fact that the creation of the discount and the creation of the sale occurred as part of the same transaction cannot be captured, unless the source database is specifically designed to do so.
2. Value-change selection base on Timestamp
Change-value selection involves capturing the data on the source database by selecting the new and changed data from the source tables based on the value of a specific column. For example, suppose the source table has a LAST_UPDATE_DATE timestamp column. To capture changes, you base your selection from the source table on the LAST_UPDATE_DATE column value.
This method is a way we often use to do an incremental extraction from the source systems because of its simplicity, but it still has some shortcomings or restrictions:
• The overhead of capturing the change data must be borne on the source database, and you must run potentially expensive queries against the source table on the source database. The need for these queries may force you to add indexes that would otherwise be unneeded. There is no way to offload this overhead to the staging database.
• This method is no better at capturing intermediate values than the table differencing method. If the price in the product's table fluctuates, you will not be able to capture all the intermediate values, or even tell if the price had changed, if the ending value is the same as it was the last time that you captured change data.
• This method is also no better than the table differencing method at capturing which data changes were made together in the same transaction. If you need to capture information concerning which changes occurred together in the same transaction, you must include specific designs for this purpose in your source database.
• Some legacy source systems may not have this kind of timestamp columns to support this method.
3.Base on DBMS Log
By taking advantage of the data sent to the redo log files, change data is captured after a SQL statement that performs a DML operation is committed. In this mode, change data is not captured as part of the transaction that is modifying the source table, and therefore has no effect on that transaction. This kind of methods capture the change while they read and analyst the online log or archive log which record all the operations and the corresponding data values of the DBMS.
The following list describes the advantages of this model base on DBMS log:
• Completeness
This method can capture all effects of INSERT, UPDATE, and DELETE operations, including data values before and after UPDATE operations.
• Performance
This method can be configured to have minimal performance impact on the source database because it is neither a part of a transaction nor requiring a query on the tables or views.
But this model still has some disadvantages such as:
• Not real time
• Complex and depending on the tools or interfaces DBMS provides
Typical technology instance: Oracle 9i (or above) Asynchronous Change Data Capture, Oracle Streams (Oracle 10g), DB2 Replication Center
4. Base on Trigger Mechanism
Triggers on the source database allow change data to be captured immediately, as each SQL statement that performs a data manipulation language (DML) operation (INSERT, UPDATE, or DELETE) is made. In this mode, change data is captured as part of the transaction modifying the source table.
This method is often used when sending the data to ODS or Staging Database. It has several advantages listed as the following:
• Real time
• Completeness: Every change can be captured by the trigger on DML.
The obvious disadvantage is that it will impact and burden the performance of source system because it is a part of the transaction.
Typical technology instance: Oracle 9i (or above) Synchronous Change Data Capture, Oracle Streams (Oracle 10g)
5. Base on Replication Technology
Replication is the process of copying and maintaining database objects, such as tables, in multiple databases that make up a distributed database system. Changes applied at one site are captured and stored locally before being forwarded and applied at each of the remote locations. Although this technology is designed for sharing data among distributed databases, it can still be used to capture change data.
In SQL Server, this technology is basing on Publisher and Subscriber.
In Oracle 8i (or above), it includes Master Replication and Materialized View Replication. Oracle Streams can also support this technology.
Other tools: DB2 Replication Center.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
随着企业情况和对数据需求的多样性,很多方法会被用起来,技术也会随着发展。
2用的比较多。