ETL架构
我被要求制作一个 ETL 风格的应用程序,将信息从一个数据源传输到另一个数据源。目前,我决定使用三层架构,但我想了解有关最佳实践以及此维基百科页面上描述的生命周期的更多信息:
http://en.wikipedia.org/wiki/Extract,_transform,_load
ETL 架构设计的四层方法
- 功能层:核心功能ETL处理(提取、转换和加载)。
- 运营管理层:作业流定义和管理、参数、调度、监控、通信和警报。
- 审计、平衡和控制(ABC)层:作业执行统计、平衡和控制、拒绝和错误处理、代码管理。
- 实用层:支持所有其他层的通用组件。
现实生活中的 ETL 周期
典型的现实 ETL 周期由以下执行步骤组成:
- 周期启动
- 构建参考数据
- 提取(从源中)
- 验证
- 转换(清理、应用业务规则、检查数据完整性、创建聚合或分解)
- 阶段(加载到暂存表(如果使用)
- 审核报告(例如,遵守业务规则。此外,如果发生故障,有助于诊断/修复)
- 发布(到目标表)
- 存档
- 清理
I've been asked to make an ETL-style application that transfers information from one data source to another. At the moment, I've decided to use a three-layer architecture but I would like to find out more about the best practices as well as the life cycle described on this wikipedia page:
http://en.wikipedia.org/wiki/Extract,_transform,_load
Four-layered approach for ETL architecture design
- Functional layer: Core functional ETL processing (extract, transform, and load).
- Operational management layer: Job-stream definition and management, parameters, scheduling, monitoring, communication and alerting.
- Audit, balance and control (ABC) layer: Job-execution statistics, balancing and controls, rejects- and error-handling, codes management.
- Utility layer: Common components supporting all other layers.
Real-life ETL cycle
The typical real-life ETL cycle consists of the following execution steps:
- Cycle initiation
- Build reference data
- Extract (from sources)
- Validate
- Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)
- Stage (load into staging tables, if used)
- Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)
- Publish (to target tables)
- Archive
- Clean up
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我不知道你的情况是什么,或者你的要求是什么,但你可能想太多了。
仅名称就是“the”架构:
将数据库表导出到 CSV 可以被视为“ET”,而加载 CSV 则被视为“L”。大多数 ETL 问题并不复杂。
除此之外,您应该获取 Java 中已有的 1 或 200 万个 ETL 和 ESB 包中的任何一个,无论是免费的还是商业的、库和完整的船处理系统,并简单地采用您最喜欢的其中一个。
拿一块白板,用线条将一些气泡串在一起,然后将其转化为代码。
I don't know what your situation is or what your requirements are, but you're likely over thinking the problem.
The name alone is "the" architecture:
Exporting a DB table to a CSV can be considered "ET" while loading the CSV is the "L". Most ETL problems are simply not complicated.
Beyond that, you should grab any of the 1 or 2 million ETL and ESB packages already available in Java, free and commercial, libraries and full boat processing systems, and simply adopt one of them that you like best.
Get a white board, string some bubbles together with lines and turn that in to code.
回答“最佳实践是什么?”这个问题。答案取决于您想要实现的目标。
为了简单起见,我们假设您正在执行以下操作之一:
当我使用“”这个词时重组”,我的意思是改变表格的粒度或最低级别的细节。
1. 通常遵循您问题中概述的十个步骤。一般最佳实践:
2这更加简单,因此可以使用您问题中概述的任何一种方法。
To answer the question, "What's the best practice?" the answer depends on what you are trying to accomplish.
To simplify let's assume you are doing one of the following:
When I use the word "restructuring", I mean changing the grain or lowest level of detail of a table.
For 1. The ten steps outlined in your question is generally followed. General best practices:
For 2. This is much more straightforward so either method outlined in your question can be used.