ETL架构

发布于 2024-10-21 18:28:41 字数 784 浏览 2 评论 0原文

我被要求制作一个 ETL 风格的应用程序,将信息从一个数据源传输到另一个数据源。目前,我决定使用三层架构,但我想了解有关最佳实践以及此维基百科页面上描述的生命周期的更多信息:

http://en.wikipedia.org/wiki/Extract,_transform,_load

ETL 架构设计的四层方法

  • 功能层:核心功能ETL处理(提取、转换和加载)。
  • 运营管理层:作业流定义和管理、参数、调度、监控、通信和警报。
  • 审计、平衡和控制(ABC)层:作业执行统计、平衡和控制、拒绝和错误处理、代码管理。
  • 实用层:支持所有其他层的通用组件。

现实生活中的 ETL 周期

典型的现实 ETL 周期由以下执行步骤组成:

  1. 周期启动
  2. 构建参考数据
  3. 提取(从源中)
  4. 验证
  5. 转换(清理、应用业务规则、检查数据完整性、创建聚合或分解)
  6. 阶段(加载到暂存表(如果使用)
  7. 审核报告(例如,遵守业务规则。此外,如果发生故障,有助于诊断/修复)
  8. 发布(到目标表)
  9. 存档
  10. 清理

I've been asked to make an ETL-style application that transfers information from one data source to another. At the moment, I've decided to use a three-layer architecture but I would like to find out more about the best practices as well as the life cycle described on this wikipedia page:

http://en.wikipedia.org/wiki/Extract,_transform,_load

Four-layered approach for ETL architecture design

  • Functional layer: Core functional ETL processing (extract, transform, and load).
  • Operational management layer: Job-stream definition and management, parameters, scheduling, monitoring, communication and alerting.
  • Audit, balance and control (ABC) layer: Job-execution statistics, balancing and controls, rejects- and error-handling, codes management.
  • Utility layer: Common components supporting all other layers.

Real-life ETL cycle

The typical real-life ETL cycle consists of the following execution steps:

  1. Cycle initiation
  2. Build reference data
  3. Extract (from sources)
  4. Validate
  5. Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)
  6. Stage (load into staging tables, if used)
  7. Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)
  8. Publish (to target tables)
  9. Archive
  10. Clean up

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

Bonjour°[大白 2024-10-28 18:28:41

我不知道你的情况是什么,或者你的要求是什么,但你可能想太多了。

仅名称就是“the”架构:

  • Extract
  • Transform
  • Load

将数据库表导出到 CSV 可以被视为“ET”,而加载 CSV 则被视为“L”。大多数 ETL 问题并不复杂。

除此之外,您应该获取 Java 中已有的 1 或 200 万个 ETL 和 ESB 包中的任何一个,无论是免费的还是商业的、库和完整的船处理系统,并简单地采用您最喜欢的其中一个。

拿一块白板,用线条将一些气泡串在一起,然后将其转化为代码。

I don't know what your situation is or what your requirements are, but you're likely over thinking the problem.

The name alone is "the" architecture:

  • Extract
  • Transform
  • Load

Exporting a DB table to a CSV can be considered "ET" while loading the CSV is the "L". Most ETL problems are simply not complicated.

Beyond that, you should grab any of the 1 or 2 million ETL and ESB packages already available in Java, free and commercial, libraries and full boat processing systems, and simply adopt one of them that you like best.

Get a white board, string some bubbles together with lines and turn that in to code.

话少情深 2024-10-28 18:28:41

回答“最佳实践是什么?”这个问题。答案取决于您想要实现的目标。

为了简单起见,我们假设您正在执行以下操作之一:

  1. 您正在构建一个数据仓库,它将以某种方式重组数据
  2. 您正在将数据从 A 点移动到 B 点,但您没有重组数据

当我使用“”这个词时重组”,我的意思是改变表格的粒度或最低级别的细节。

1. 通常遵循您问题中概述的十个步骤。一般最佳实践:

  • 将尽可能多的转换逻辑推送到数据库资源,而不是 ETL 软件(ETL 软件通常速度较慢)
  • 验证、转换和审核步骤用于采用您的组织使用的任何主数据管理 (MDM) 标准

2这更加简单,因此可以使用您问题中概述的任何一种方法。

To answer the question, "What's the best practice?" the answer depends on what you are trying to accomplish.

To simplify let's assume you are doing one of the following:

  1. You are building a data warehouse that will restructure the data in some way
  2. You are moving data from point A to point B, but you are not restructuring the data

When I use the word "restructuring", I mean changing the grain or lowest level of detail of a table.

For 1. The ten steps outlined in your question is generally followed. General best practices:

  • As much transformation logic as possible is pushed onto database resources, not ETL software (ETL software is generally slower)
  • Validate, Transform, and Audit steps are used to employ whatever Master Data Management (MDM) standards your organization uses

For 2. This is much more straightforward so either method outlined in your question can be used.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文