So after having played thoroughly with both ETL and ELT, I have come to the conclusion that you should avoid ELT at all costs. ETL prepares the data for your warehouse before you actually load it in. ELT however loads the raw data into the warehouse and you transform it in place. That is problematic if you have a busy data warehouse. If there is a reporting query running on a table that you are attempt to update, your query will get blocked. Consequently, it is possible for reporting queries to hold up or block updates.
Now some of you might say reporting queries do not need to block an update and you can set your isolation level to allow for dirty reads. Reporting queries however are not generally executed by software engineers. They are executed by business users so you can't rely on them to set their isolation levels properly. As well, not all reports can tolerate dirty reads.
There are cases where ELT can work however by introducing it to your data warehouse is dangerous and consequently, I recommend for your sanity and for maintainability, avoid it.
Which is better is hard to answer -- depends on the problem.
I prefer multi-step ETL -- ECCD (Extract, Clean, Conform, Deliver) whenever possible. I also keep intermediate csv files after each extract, clean, and conform step; takes some disk space, but is quite useful. Whenever DW has to be re-loaded due to bugs in etl, or DW schema changes, there is no need to query source systems again -- it is already in flat files. It is also quite convenient to be able to grep, sed and awk through flat files in the staging area when needed. In the case when there are several source systems which feed into the same DW, only extract steps have to be developed (and maintained) for each of the source systems -- clean, conform, and deliver steps are all common.
我两者都用。这只是便利性和功能性的问题。这一切都取决于具体情况。有时我会做 TEL - 即在源数据库(在存储过程或视图中)中完成转换,然后直接提取和加载。
I use both. It's simply a matter of convenience and functionality. It all depends on the case. Sometimes I do TEL - i.e. the transform is done in the source database (in a stored procedure or view) and then extracted and loaded directly.
I prefer ELT. One can say it is against the Norm. It does require a change in mentality and design approach against traditional methods. But it utilizes Existing Hardware and skill sets, further reducing the cost and risk in the development process.
If we want to ensure referential integrity in ETL approach, then data must be downloaded from target to ETL server(Engine). But we don't need to do it in ETL approach.
To get the best from an ELT approach requires an open mind.
发布评论
评论(4)
因此,在彻底研究了 ETL 和 ELT 之后,我得出的结论是,您应该不惜一切代价避免 ELT。 ETL 在实际加载数据之前为您的仓库准备数据。然而,ELT 将原始数据加载到仓库中并就地转换它。如果您有一个繁忙的数据仓库,那就会出现问题。如果您尝试更新的表上正在运行报告查询,您的查询将被阻止。因此,报告查询可能会阻止或阻止更新。
现在,有些人可能会说报告查询不需要阻止更新,您可以设置隔离级别以允许脏读。然而,报告查询通常不是由软件工程师执行的。它们由业务用户执行,因此您不能依赖它们来正确设置隔离级别。同样,并非所有报告都可以容忍脏读。
在某些情况下,ELT 可以发挥作用,但是将其引入数据仓库是危险的,因此,为了您的理智和可维护性,我建议避免使用它。
So after having played thoroughly with both ETL and ELT, I have come to the conclusion that you should avoid ELT at all costs. ETL prepares the data for your warehouse before you actually load it in. ELT however loads the raw data into the warehouse and you transform it in place. That is problematic if you have a busy data warehouse. If there is a reporting query running on a table that you are attempt to update, your query will get blocked. Consequently, it is possible for reporting queries to hold up or block updates.
Now some of you might say reporting queries do not need to block an update and you can set your isolation level to allow for dirty reads. Reporting queries however are not generally executed by software engineers. They are executed by business users so you can't rely on them to set their isolation levels properly. As well, not all reports can tolerate dirty reads.
There are cases where ELT can work however by introducing it to your data warehouse is dangerous and consequently, I recommend for your sanity and for maintainability, avoid it.
哪个更好很难回答——取决于具体问题。
只要有可能,我更喜欢多步骤 ETL - ECCD(提取、清理、整合、交付)。我还会在每次提取、清理和整合步骤后保留中间 csv 文件;占用一些磁盘空间,但非常有用。每当由于 etl 中的错误或 DW 模式更改而必须重新加载 DW 时,无需再次查询源系统 - 它已经在平面文件中。当需要时,能够通过暂存区域中的平面文件进行 grep、sed 和 awk 也非常方便。当有多个源系统馈入同一个数据仓库时,只需为每个源系统开发(和维护)提取步骤——清理、整合和交付步骤都是常见的。
Which is better is hard to answer -- depends on the problem.
I prefer multi-step ETL -- ECCD (Extract, Clean, Conform, Deliver) whenever possible. I also keep intermediate csv files after each extract, clean, and conform step; takes some disk space, but is quite useful. Whenever DW has to be re-loaded due to bugs in etl, or DW schema changes, there is no need to query source systems again -- it is already in flat files. It is also quite convenient to be able to grep, sed and awk through flat files in the staging area when needed. In the case when there are several source systems which feed into the same DW, only extract steps have to be developed (and maintained) for each of the source systems -- clean, conform, and deliver steps are all common.
我两者都用。这只是便利性和功能性的问题。这一切都取决于具体情况。有时我会做 TEL - 即在源数据库(在存储过程或视图中)中完成转换,然后直接提取和加载。
I use both. It's simply a matter of convenience and functionality. It all depends on the case. Sometimes I do TEL - i.e. the transform is done in the source database (in a stored procedure or view) and then extracted and loaded directly.
我更喜欢英语语言培训。可以说这是违反规范的。它确实需要改变传统方法的心态和设计方法。但它利用现有的硬件和技能集,进一步降低了开发过程中的成本和风险。
如果我们想确保 ETL 方法中的引用完整性,则必须将数据从目标下载到 ETL 服务器(引擎)。但我们不需要用 ETL 方法来做。
要从 ELT 方法中获得最佳效果,需要开放的心态。
I prefer ELT. One can say it is against the Norm. It does require a change in mentality and design approach against traditional methods. But it utilizes Existing Hardware and skill sets, further reducing the cost and risk in the development process.
If we want to ensure referential integrity in ETL approach, then data must be downloaded from target to ETL server(Engine). But we don't need to do it in ETL approach.
To get the best from an ELT approach requires an open mind.