当前位置：文江博客话题详情

如何执行 ETL 任务？

发布于 2024-07-12 06:32:05 字数 147 浏览 5 评论 0 原文

我应该执行 ETL，其中源是一个大型且设计糟糕的 sql 2k 数据库和一个设计更好的 sql 2k5 数据库。我认为 SSIS 是一条出路。谁能建议一个待办事项清单或清单或需要注意的事项，以便我不会忘记任何事情？我应该如何处理这个问题，这样它以后就不会在后面咬我了。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时光与爱终年不遇 2024-07-19 06:32:05

一些常规 ETL 技巧

考虑按以下方式组织它：
目的地（例如，所有
生成客户的代码
维度位于同一模块中，无论来源如何）。
这有时被称为
面向主题的 ETL。它使
寻找东西变得更容易并且会
提高你的可维护性
代码。
如果SQL2000数据库一团糟，
你可能会发现 SSIS
数据流是一种笨拙的处理方式
与数据。一般来说，ETL 工具
复杂性难以扩展；
大约是所有数据的一半
金融仓库项目
公司已经完成了存储
过程代码作为显式
架构决策——正是出于这个原因。如果你有
放入大量代码
sprocs，考虑将全部放入
存储过程中的代码。

对于涉及大量复杂清理或转换的系统，100% 存储过程方法的可维护性要高得多，因为它将所有转换和业务逻辑放在一个地方的唯一可行的方法。对于混合 ETL/sproc 系统，您必须在多个位置进行跟踪、故障排除、调试或更改整个转换。
ETL 工具的最佳位置是在拥有大量数据源且转换相对简单的系统上。
使代码可测试，这样您就可以
拆开组件并进行测试
隔离中。只能在 ETL 工具中的复杂数据流中间执行的代码更难测试。
使数据提取变得愚蠢，没有
业务逻辑，并复制到
暂存区。如果你有生意
逻辑遍布整个摘录和
变换图层，你将得到
无法测试的转换
孤立并使其难以
追踪错误。如果变换是
从你的集结区跑
减少对的硬依赖
源码系统，再次增强
可测试性。这是基于存储过程的体系结构的一个特殊胜利，因为它允许几乎完全同构的代码库。
建立一个通用的缓慢变化的
尺寸处理程序或使用其中之一
架子（如果有）。这使得
更容易对其进行单元测试
功能。如果这可以是单位
经测试，系统测试没有
必须测试所有的极端情况，
只是数据是否呈现给
它是正确的。这并不像听起来那么复杂——我写的最后一篇大约有 600 或 700 行 T-SQL 代码。这同样适用于任何通用的清理函数。
如果可能的话，增量加载。
检测您的代码 - 让它生成日志条目，可能会记录诊断信息，例如检查总数或计数。如果没有这个，故障排除几乎是不可能的。另外，断言检查是考虑错误处理的好方法（b 中的行数是否等于 b 中的行数，A:B 关系真的是 1:1）。
使用合成密钥。使用源系统中的自然键将您的系统与数据源联系起来，并且使得添加额外的源变得困难。系统中的键和关系应该始终保持一致——没有空值。对于错误“未记录”，请在维度表中创建特定的“错误”或“未记录”条目并与它们匹配。
如果您构建操作数据存储（许多宗教战争的主题），请不要回收星型模式中的 ODS 键。无论如何，加入 ODS 键来构造维度，但匹配自然键。这允许您任意删除并重新创建 ODS（可能会更改其结构），而不会干扰星型模式。拥有此功能是真正的维护胜利，因为您可以随时更改 ODS 结构或对 ODS 进行强力重新部署。

第 1-2 点和第 4-5 点意味着您可以构建一个系统，其中任何给定子系统（例如单个维度或事实表）的所有代码都位于系统中的一个且唯一的位置。这种类型的架构也更适合大量数据源。

第 3 点与第 2 点相对应。基本上，SQL 和 ETL 工具之间的选择是转换复杂性和源系统数量的函数。数据越简单，数据源数量越多，基于工具的方法就越有必要。数据越复杂，迁移到基于存储过程的架构的理由就越充分。一般来说，最好只使用或几乎只使用其中之一，而不是同时使用两者。

第 6 点使您的系统更易于测试。测试 SCD 或任何基于更改的功能非常繁琐，因为您必须能够向系统提供多个版本的源数据。如果将变更管理功能移至基础架构代码中，则可以使用测试数据集单独对其进行测试。这是测试的胜利，因为它降低了系统测试要求的复杂性。

第 7 点是一般性能提示，对于大数据量您需要遵守该提示。请注意，您可能只需要增量加载系统的某些部分；对于较小的参考表和尺寸，您可能不需要它。

第 8 点与任何无头进程密切相关。如果夜间出现问题，您需要一些机会来看看第二天出了什么问题。如果代码没有正确记录正在发生的情况并捕获错误，那么排除故障的工作就会困难得多。

第 9 点赋予数据仓库自己的生命。当仓库有自己的密钥时，您可以轻松添加和删除源系统。仓库钥匙对于实现缓慢变化的维度也是必要的。

第 10 点是维护和部署的胜利，因为如果您需要添加新系统或更改记录的基数，可以重新构建 ODS。这还意味着可以从 ODS 中的多个位置加载维度（例如：添加手动会计调整），而不依赖于 ODS 键。

Some general ETL tips

Consider organising it by
destination (for example, all the
code to produce the Customer
dimension lives in the same module, regardless of source).
This is sometimes known as
Subject-oriented ETL. It makes
finding stuff much easier and will
increase the maintainability of your
code.
If the SQL2000 database is a mess,
you will probably find that SSIS
data flows are a clumsy way to deal
with the data. As a rule, ETL tools
scale poorly with complexity;
something like half of all data
warehouse projects in finance
companies are done with stored
procedure code as an explicit
architectural decision - for precisely this reason. If you have
to put a large amount of code in
sprocs, consider putting all of the
code in sprocs.

For a system involving lots of complex scrubbing or transformations, a 100% sproc approach is far more maintainable as it is the only feasible way to put all of the transformations and business logic in one place. With mixed ETL/sproc systems, you have to look in multiple places to track, troubleshoot, debug or change the whole transformation.
The sweet spot of ETL tools is on systems where you have a larger number of data sources with relatively simple transformations.
Make the code testable, so you can
pick apart the components and test
in isolation. Code that can only be executed from within the middle of a complex data flow in an ETL tool is much harder to test.
Make the data extract dumb with no
business logic, and copy into a
staging area. If you have business
logic spread across the extract and
transform layers, you will have
transformations that cannot be tested
in isolation and make it hard to
track down bugs. If the transform is
running from a staging area you
reduce the hard dependency on the
source system, again enhancing
testability. This is a particular win on sproc-based architectures as it allows an almost completely homogeneous code base.
Build a generic slowly-changing
dimension handler or use one off the
shelf if available. This makes it
easier to unit test this
functionality. If this can be unit
tested, the system testing does not
have to test all of the corner cases,
merely whether the data presented to
it is correct. This is not as complex as it sounds - The last one I wrote was about 600 or 700 lines of T-SQL code. The same goes for any generic scrubbing functions.
Load incrementally if possible.
Instrument your code - have it make log entries, possibly recording diagnostics such as check totals or counts. Without this, troubleshooting is next to impossible. Also, assertion checking is a good way to think of error handling for this (does row count in a equal row count in b, is A:B relationship really 1:1).
Use synthetic keys. Using natural keys from the source systems ties your system to the data sources, and makes it difficult to add extra sources. The keys and relationships in the system should always line up - no nulls. For errors, 'not recorded', make a specific 'error' or 'not recorded' entries in the dimension table and match to them.
If you build an Operational Data Store (the subject of many a religious war) do not recycle the ODS keys in the star schemas. By all means join on ODS keys to construct dimensions, but match on a natural key. This allows you to arbitrarily drop and recreate the ODS - possibly changing its structure - without disturbing the star schemas. Having this capability is a real maintenance win, as you can change ODS structure or do a brute-force re-deployment of the ODS at any point.

Points 1-2 and 4-5 mean that you can build a system where all of the code for any given subsystem (e.g. a single dimension or fact table) lives in one and only one place in the system. This type of architecture is also better for larger numbers of data sources.

Point 3 is a counterpoint to point 2. Basically the choice between SQL and ETL tooling is a function of transformation complexity and number of source systems. The simpler the data and larger the number of data sources, the stronger the case for a tools-based approach. The more complex the data, the stronger the case for moving to an architecture based on stored procedures. Generally it's better to exclusively or almost exclusively use one or the other but not both.

Point 6 makes your system easier to test. Testing SCD's or any change based functionality is fiddly, as you have to be able to present more than one version of the source data to the system. If you move the change management functionality into infrastructure code, you can test it in isolation with test data sets. This is a win in testing, as it reduces the complexity of your system testing requirements.

Point 7 is a general performance tip that you will need to observe for large data volumes. Note that you may only need incremental loading for some parts of a system; for smaller reference tables and dimensions you may not need it.

Point 8 is germane to any headless process. If it goes tits up during the night, you want some fighting chance of seeing what went wrong the next day. If the code doesn't properly log what's going on and catch errors, you will have a much harder job troubleshooting it.

Point 9 gives the data warehouse a life of its own. You can easily add and drop source systems when the warehouse has its own keys. Warehouse keys are also necessary to implement slowly changing dimensions.

Point 10 is a maintenance and deployment win, as the ODS can be re-structured if you need to add new systems or change the cardinality of a record. It also means that a dimension can be loaded from more than one place in the ODS (think: adding manual accounting adjustments) without a dependency on the ODS keys.

回复收藏 0 原文

仄言 2024-07-19 06:32:05

我有 ETL 流程的经验，每天、每周、每月和每年将数据从 200 多个分布式数据库提取到中央数据库。这是大量的数据，并且我们遇到了许多针对我们具体情况的问题。但在我看来，无论情况如何，都有几项需要考虑：

确保在源端和目标端都考虑文件锁。确保其他进程没有锁定文件（并在必要时删除这些锁并且有意义）很重要。
为自己锁定文件。确保在拉出数据时锁定文件，尤其是在源上，这样您就不会获得半途更新的数据。
如果可能的话，提取增量，而不是所有数据。获取数据的副本，然后仅提取已更改的行而不是所有内容。数据集越大，这一点就越重要。如果有必要，请查看日志和触发器，但随着在一定基础上拥有这些数据变得更加重要，这可能是我给您的第一个建议。即使它会为项目增加大量时间。
执行日志。确保您知道它何时有效、何时无效，并且在过程中抛出特定错误确实有助于调试。
文档，文档，文档。如果你正确地构建了这个，你就会构建它，然后很长一段时间都不会去想它。但可以保证的是，您或其他人将需要在某个时候返回来增强它或修复错误。在这些情况下，文档是关键。

HTH，如果我想到其他任何内容，我会更新此内容。

回复收藏 0 原文

零度℉ 2024-07-19 06:32:05

嗯，我正在为我所在的公司开发 ETL。

我们正在与 SSIS 合作。
使用 api 生成并构建我们自己的 dtsx 包。

SSIS 对于管理错误并不友好。有时您会收到“OleDb 错误”，根据上下文，该错误可能有很多不同的含义。

阅读API文档（他们没有说太多）。

一些可以帮助您从这里开始的链接：
http://technet.microsoft.com/de-de /library/ms135932(SQL.90).aspx

http:/ /msdn.microsoft.com/en-us/library/ms345167.aspx

http://msdn.microsoft.com/en-us/library/ms403356.aspx

http://www.codeproject.com /KB/database/SSISProgramming.aspx?display=PrintAll&fid=382208&df=90&mpp=25&noise=3&sort=Position&view=Quick&fr=26&select=2551674

http://www.codeproject.com/KB/database/foreachadossis.aspx

< a href="http://wiki.sqlis.com/default.aspx/SQLISWiki/ComponentErrorCodes.html" rel="nofollow noreferrer">http://wiki.sqlis.com/default.aspx/SQLISWiki/ComponentErrorCodes.html

http:// /www.new.facebook.com/inbox/readmessage.php?t=1041904880323#/home.php?ref=logo

http://technet.microsoft.com/en-us/library/ms187670.aspx

http://msdn.microsoft.com/ja-jp/library/microsoft.sqlserver.dts。 runtime.foreachloop.foreachenumerator.aspx

http://www.sqlis.com/post/Handling- Different-row-types-in-the-same-file.aspx

http://technet.microsoft.com/en-us/library/ms135967(SQL.90).aspx

http://msdn.microsoft.com/en- us/library/ms137709(SQL.90).aspx

http://msdn.microsoft.com/en-us/library/ms345164(SQL.90).aspx

http://msdn.microsoft.com/en-us/library/ms141232.aspx

http://www.microsoft.com/technet/prodtechnol/sql/2005/ssisperf.mspx

http://www.ivolva.com/ssis_code_generator.html

http://www.ivolva.com/ssis_wizards.html

http://www.codeplex.com/MSFTISProdSamples