用于跟踪仓储所需的应用程序数据库中的数据更改的架构

发布于 2025-01-12 06:32:34 字数 1590 浏览 0 评论 0原文

概述

我有一个 OLTP DB，它存储应用程序事务数据并充当应用程序当前状态的事实来源。我想让我的 DWH 存储历史数据，以便我可以进行分析，将应用程序的先前状态与当前状态进行比较。用 Kimball 术语来说，我想要我的 SCD 采用 2 型尺寸。

在应用程序数据库中，不会跟踪维度的更改，而是出于效率原因进行适当更新。因此，如果我每天直接从应用程序数据库进行一次 ETL，我可能会丢失历史数据。我的问题是如何跟踪这些数据更改以进行仓储？以下是我考虑过的一些选项：

选项#1：将事件数据提取到数据湖中

每当应用程序中发生任何事情时，都会发生一个事件发出。这些事件可以捕获我仓库所需的所有信息。我的想法是，您可以使用 Apache Kafka 之类的工具发出事件，并让一些进程侦听事件并将它们以原始形式存储在完全不可变的数据湖中。然后，您将使用从数据湖而不是应用程序数据库工作的 ETL 流程来加载仓库。

优点

如果将来需要，可以实现实时数据分析（目前没有必要）
不可变的数据湖可以作为其他类型分析的基础，例如机器学习或其他仓库
数据湖可以作为所有数据的单一事实来源当存在多个应用程序数据库和其他数据摄取源时，这对未来很有好处

缺点

需要事件处理/流服务，维护开销更大
数据可能丢失/重复，导致湖无法反映应用程序数据库
需要储存数据分布在两个地方，这是更多的开发开销

选项 #2：批处理应用程序数据库快照

在这种情况下，我将使用数据库的每日快照作为 ETL 到 DWH 的源。历史数据将取决于快照发生的频率（在本例中为每天）。这意味着当天发生的更改数据将丢失。然而，无论如何，存储如此细粒度的数据可能并不那么重要。

优点

数据存储不会在两个地方之间重复
不需要额外的基础设施，因为每日快照已自动获取并存储在 S3 中
保持数据完整性，因为我们直接处理当前应用程序状态，因此可以更好地保持同步

缺点

需要针对先前快照的增量计算，以确定需要导入哪些新维度对象（这实际上可能在任何场景中都需要，但对于事件架构来说似乎更自然）
历史数据的粒度与导入的频率耦合在一起发生快照
仅与 ETL 兼容不要将事件发送到 Apache Kafka 等，而是将

选项 #3：批量处理日志数据

事件数据存储在临时日志文件中应用服务器。然后，ETL 过程将涉及扫描所有应用程序服务器以获取并处理日志文件。这些日志文件将存储大量的事务历史记录，足以获取 DWH 所需的所有数据。

优点

从应用程序服务器的开发角度来看，将事件记录到文件很容易
可以存储完整的事务历史记录，因此我们不会丢失任何信息
对应用程序服务器的性能影响最小

缺点

数据可靠性低于其他两个选项，因为应用程序服务器可以随时拆除以适应扩展，这将导致日志文件丢失（服务器也可能崩溃）
处理日志数据需要额外的解析逻辑
ETL 需要直接针对可能需要服务发现的服务器实例工作

原文

Overview

I have an OLTP DB that stores application transaction data and acts as the source of truth for the current state of the application. I would like to have my DWH store historical data so I can do analyses that compare previous states of the application to the current state. In Kimball terminology, I would like Type 2 dimensions for my SCDs.

In the application DB, changes to the dimensions are not tracked but rather updated in place for efficiency reasons. So, if I were to ETL directly from my application DB once per day, I would be losing historical data potentially. My question is how can I track these data changes for the sake of warehousing? Here are some options I have considered:

Option #1: Ingest Event Data Into Data Lake

Whenever anything happens in the application an event is emitted. These events can capture all the information I need for the warehouse. My thoughts are you could emit the events using something like Apache Kafka, and have some process listen for the events and store them in a raw form in a data lake that's completely immutable. Then, you would use an ETL process that works from the data lake instead of the application DB to load up the warehouse.

Pros

Can achieve real-time data analysis if required in the future (currently not necessary)
An immutable data lake can act as a foundation for other types of analytics such as ML or other warehouses
The data lake serves as a single source of truth for all data which will be nice for the future when there are multiple application DBs and other sources of data ingestion

Cons

Requires an event processing/streaming service which is more overhead to maintain
Data can be lost/duplicated causing the lake to not reflect the application DB
Requires storing data in two places which is more developmental overhead

Option #2: Batch Process Application DB Snapshots

In this scenario, I would use the daily snapshots of the DB as a source for ETL'ing into the DWH. Historical data would be at the grain of how often the snapshot takes place (daily in this case). This would mean that an change data that happens within the day would be lost. However, it may not be that important to store such fine-grained data anyways.

Pros

Data storage is not duplicated between two places
No extra infrastructure is required as daily snapshots are already automatically obtained and stored in S3
Data integrity is maintained because we're working directly with the current application state so we can stay in-sync better

Cons

Requires a delta calculation against the previous snapshot to determine what new dimension objects need to be imported (this may actually be required in any scenario but seems more natural to do with the event architecture)
The grain of the historical data is coupled to the frequency at which snapshots occur
Only compatible with ETL'ing into DWH and would not be as useful for ML/data science applications that work well with raw data

Option #3: Batch Process Log Data

Instead of emitting an event to something like Apache Kafka, store the event data in a temporary log file on the application server. Then, the ETL process would involve scanning all the application servers to grab and process the log files. These log files would store extensive transaction history, enough to get all the data required for the DWH.

Pros

Logging events to a file is easy from a developmental standpoint on the application server
A full transaction history can be stored so we don't lose any information
Minimal performance impact on application server

Cons

Data reliability is lower than the other two options because application servers can be torn down at any moment to accomadate scaling which would lead to log files getting lost (also server can crash)
Processing log data requires extra parsing logic
ETL would need to work directly against server instances which may require service discovery

分享到QQ

分享到微博