如何执行 ETL 任务?

发布于 2024-07-12 06:32:05 字数 147 浏览 5 评论 0 原文

我应该执行 ETL,其中源是一个大型且设计糟糕的 sql 2k 数据库和一个设计更好的 sql 2k5 数据库。 我认为 SSIS 是一条出路。 谁能建议一个待办事项清单或清单或需要注意的事项,以便我不会忘记任何事情? 我应该如何处理这个问题,这样它以后就不会在后面咬我了。

I am supposed to perform ETL where source is a large and badly designed sql 2k database and a a better designed sql 2k5 database. I think SSIS is the way to go. Can anyone suggest a to-do list or a checklist or things to watchout for so that I dont forget anything? How should I approach this so that it does not bite me in the rear later on.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

时光与爱终年不遇 2024-07-19 06:32:05

一些常规 ETL 技巧

  1. 考虑按以下方式组织它:
    目的地(例如,所有
    生成客户的代码
    维度位于同一模块中,无论来源如何)。
    这有时被称为
    面向主题的 ETL。 它使
    寻找东西变得更容易并且会
    提高你的可维护性
    代码。

  2. 如果SQL2000数据库一团糟,
    你可能会发现 SSIS
    数据流是一种笨拙的处理方式
    与数据。 一般来说,ETL 工具
    复杂性难以扩展;
    大约是所有数据的一半
    金融仓库项目
    公司已经完成了存储
    过程代码作为显式
    架构决策——正是出于这个原因。 如果你有
    放入大量代码
    sprocs,考虑将全部放入
    存储过程中的代码。

    对于涉及大量复杂清理或转换的系统,100% 存储过程方法的可维护性要高得多,因为它将所有转换和业务逻辑放在一个地方的唯一可行的方法。 对于混合 ETL/sproc 系统,您必须在多个位置进行跟踪、故障排除、调试或更改整个转换。

  3. ETL 工具的最佳位置是在拥有大量数据源且转换相对简单的系统上。

  4. 使代码可测试,这样您就可以
    拆开组件并进行测试
    隔离中。 只能在 ETL 工具中的复杂数据流中间执行的代码更难测试。

  5. 使数据提取变得愚蠢,没有
    业务逻辑,并复制到
    暂存区。 如果你有生意
    逻辑遍布整个摘录和
    变换图层,你将得到
    无法测试的转换
    孤立并使其难以
    追踪错误。 如果变换是
    从你的集结区跑
    减少对的硬依赖
    源码系统,再次增强
    可测试性。 这是基于存储过程的体系结构的一个特殊胜利,因为它允许几乎完全同构的代码库。

  6. 建立一个通用的缓慢变化的
    尺寸处理程序或使用其中之一
    架子(如果有)。 这使得
    更容易对其进行单元测试
    功能。 如果这可以是单位
    经测试,系统测试没有
    必须测试所有的极端情况,
    只是数据是否呈现给
    它是正确的。 这并不像听起来那么复杂——我写的最后一篇大约有 600 或 700 行 T-SQL 代码。 这同样适用于任何通用的清理函数。

  7. 如果可能的话,增量加载。

  8. 检测您的代码 - 让它生成日志条目,可能会记录诊断信息,例如检查总数或计数。 如果没有这个,故障排除几乎是不可能的。 另外,断言检查是考虑错误处理的好方法(b 中的行数是否等于 b 中的行数,A:B 关系真的是 1:1)。

  9. 使用合成密钥。 使用源系统中的自然键将您的系统与数据源联系起来,并且使得添加额外的源变得困难。 系统中的键和关系应该始终保持一致——没有空值。 对于错误“未记录”,请在维度表中创建特定的“错误”或“未记录”条目并与它们匹配。

  10. 如果您构建操作数据存储(许多宗教战争的主题),请不要回收星型模式中的 ODS 键。 无论如何,加入 ODS 键来构造维度,但匹配自然键。 这允许您任意删除并重新创建 ODS(可能会更改其结构),而不会干扰星型模式。 拥有此功能是真正的维护胜利,因为您可以随时更改 ODS 结构或对 ODS 进行强力重新部署。

第 1-2 点和第 4-5 点意味着您可以构建一个系统,其中任何给定子系统(例如单个维度或事实表)的所有代码都位于系统中的一个且唯一的位置。 这种类型的架构也更适合大量数据源。

第 3 点与第 2 点相对应。基本上,SQL 和 ETL 工具之间的选择是转换复杂性和源系统数量的函数。 数据越简单,数据源数量越多,基于工具的方法就越有必要。 数据越复杂,迁移到基于存储过程的架构的理由就越充分。 一般来说,最好只使用或几乎只使用其中之一,而不是同时使用两者。

第 6 点使您的系统更易于测试。 测试 SCD 或任何基于更改的功能非常繁琐,因为您必须能够向系统提供多个版本的源数据。 如果将变更管理功能移至基础架构代码中,则可以使用测试数据集单独对其进行测试。 这是测试的胜利,因为它降低了系统测试要求的复杂性。

第 7 点是一般性能提示,对于大数据量您需要遵守该提示。 请注意,您可能只需要增量加载系统的某些部分; 对于较小的参考表和尺寸,您可能不需要它。

第 8 点与任何无头进程密切相关。 如果夜间出现问题,您需要一些机会来看看第二天出了什么问题。 如果代码没有正确记录正在发生的情况并捕获错误,那么排除故障的工作就会困难得多。

第 9 点赋予数据仓库自己的生命。 当仓库有自己的密钥时,您可以轻松添加和删除源系统。 仓库钥匙对于实现缓慢变化的维度也是必要的。

第 10 点是维护和部署的胜利,因为如果您需要添加新系统或更改记录的基数,可以重新构建 ODS。 这还意味着可以从 ODS 中的多个位置加载维度(例如:添加手动会计调整),而不依赖于 ODS 键。

Some general ETL tips

  1. Consider organising it by
    destination (for example, all the
    code to produce the Customer
    dimension lives in the same module, regardless of source).
    This is sometimes known as
    Subject-oriented ETL. It makes
    finding stuff much easier and will
    increase the maintainability of your
    code.

  2. If the SQL2000 database is a mess,
    you will probably find that SSIS
    data flows are a clumsy way to deal
    with the data. As a rule, ETL tools
    scale poorly with complexity;
    something like half of all data
    warehouse projects in finance
    companies are done with stored
    procedure code as an explicit
    architectural decision - for precisely this reason. If you have
    to put a large amount of code in
    sprocs, consider putting all of the
    code in sprocs.

    For a system involving lots of complex scrubbing or transformations, a 100% sproc approach is far more maintainable as it is the only feasible way to put all of the transformations and business logic in one place. With mixed ETL/sproc systems, you have to look in multiple places to track, troubleshoot, debug or change the whole transformation.

  3. The sweet spot of ETL tools is on systems where you have a larger number of data sources with relatively simple transformations.

  4. Make the code testable, so you can
    pick apart the components and test
    in isolation. Code that can only be executed from within the middle of a complex data flow in an ETL tool is much harder to test.

  5. Make the data extract dumb with no
    business logic, and copy into a
    staging area. If you have business
    logic spread across the extract and
    transform layers, you will have
    transformations that cannot be tested
    in isolation and make it hard to
    track down bugs. If the transform is
    running from a staging area you
    reduce the hard dependency on the
    source system, again enhancing
    testability. This is a particular win on sproc-based architectures as it allows an almost completely homogeneous code base.

  6. Build a generic slowly-changing
    dimension handler or use one off the
    shelf if available. This makes it
    easier to unit test this
    functionality. If this can be unit
    tested, the system testing does not
    have to test all of the corner cases,
    merely whether the data presented to
    it is correct. This is not as complex as it sounds - The last one I wrote was about 600 or 700 lines of T-SQL code. The same goes for any generic scrubbing functions.

  7. Load incrementally if possible.

  8. Instrument your code - have it make log entries, possibly recording diagnostics such as check totals or counts. Without this, troubleshooting is next to impossible. Also, assertion checking is a good way to think of error handling for this (does row count in a equal row count in b, is A:B relationship really 1:1).

  9. Use synthetic keys. Using natural keys from the source systems ties your system to the data sources, and makes it difficult to add extra sources. The keys and relationships in the system should always line up - no nulls. For errors, 'not recorded', make a specific 'error' or 'not recorded' entries in the dimension table and match to them.

  10. If you build an Operational Data Store (the subject of many a religious war) do not recycle the ODS keys in the star schemas. By all means join on ODS keys to construct dimensions, but match on a natural key. This allows you to arbitrarily drop and recreate the ODS - possibly changing its structure - without disturbing the star schemas. Having this capability is a real maintenance win, as you can change ODS structure or do a brute-force re-deployment of the ODS at any point.

Points 1-2 and 4-5 mean that you can build a system where all of the code for any given subsystem (e.g. a single dimension or fact table) lives in one and only one place in the system. This type of architecture is also better for larger numbers of data sources.

Point 3 is a counterpoint to point 2. Basically the choice between SQL and ETL tooling is a function of transformation complexity and number of source systems. The simpler the data and larger the number of data sources, the stronger the case for a tools-based approach. The more complex the data, the stronger the case for moving to an architecture based on stored procedures. Generally it's better to exclusively or almost exclusively use one or the other but not both.

Point 6 makes your system easier to test. Testing SCD's or any change based functionality is fiddly, as you have to be able to present more than one version of the source data to the system. If you move the change management functionality into infrastructure code, you can test it in isolation with test data sets. This is a win in testing, as it reduces the complexity of your system testing requirements.

Point 7 is a general performance tip that you will need to observe for large data volumes. Note that you may only need incremental loading for some parts of a system; for smaller reference tables and dimensions you may not need it.

Point 8 is germane to any headless process. If it goes tits up during the night, you want some fighting chance of seeing what went wrong the next day. If the code doesn't properly log what's going on and catch errors, you will have a much harder job troubleshooting it.

Point 9 gives the data warehouse a life of its own. You can easily add and drop source systems when the warehouse has its own keys. Warehouse keys are also necessary to implement slowly changing dimensions.

Point 10 is a maintenance and deployment win, as the ODS can be re-structured if you need to add new systems or change the cardinality of a record. It also means that a dimension can be loaded from more than one place in the ODS (think: adding manual accounting adjustments) without a dependency on the ODS keys.

仄言 2024-07-19 06:32:05

我有 ETL 流程的经验,每天、每周、每月和每年将数据从 200 多个分布式数据库提取到中央数据库。 这是大量的数据,并且我们遇到了许多针对我们具体情况的问题。 但在我看来,无论情况如何,都有几项需要考虑:

  • 确保在源端和目标端都考虑文件锁。 确保其他进程没有锁定文件(并在必要时删除这些锁并且有意义)很重要。

  • 为自己锁定文件。 确保在拉出数据时锁定文件,尤其是在源上,这样您就不会获得半途更新的数据。

  • 如果可能的话,提取增量,而不是所有数据。 获取数据的副本,然后仅提取已更改的行而不是所有内容。 数据集越大,这一点就越重要。 如果有必要,请查看日志和触发器,但随着在一定基础上拥有这些数据变得更加重要,这可能是我给您的第一个建议。 即使它会为项目增加大量时间。

  • 执行日志。 确保您知道它何时有效、何时无效,并且在过程中抛出特定错误确实有助于调试。

  • 文档,文档,文档。 如果你正确地构建了这个,你就会构建它,然后很长一段时间都不会去想它。 但可以保证的是,您或其他人将需要在某个时候返回来增强它或修复错误。 在这些情况下,文档是关键。

HTH,如果我想到其他任何内容,我会更新此内容。

I have experience with ETL processes pulling data from 200+ distributed databases to a central database on a daily, weekly, monthly and yearly basis. It is a massive amount of data and there are many issues we have had specific to our situation. But as I see it, there are several items to think about regardless of the situation:

  • Make sure that you take file locks into consideration, both on the source and destination side. Making sure that other processes do not have the files locked (and removing those locks if necessary and it makes sense) is important.

  • locking the files for yourself. Make sure, especially on the source that you lock the files while pulling out the data so that you do not get halfway updated data.

  • if at all possible, pull deltas, not all of the data. Get a copy of the data and then pull only rows that have changed instead of everything. The larger your data set the more important this becomes. Look at journals and triggers if you have to, but as it becomes more important to have this data on a certain basis, this is probably the number one advice I would give you. Even if it adds a significant amount of time to the project.

  • execution log. make sure you know when it worked and when it didn't, and throwing specific errors in the process can really help in debugging.

  • document, document, document. If you build this right, you are going to build it and then not think about it for a long time. But you can be guaranteed, you or someone else will need to come back to it at some point to enhance it or do a bug fix. Documentation is key in these situations.

HTH, ill update this if I think of anything else.

零度℉ 2024-07-19 06:32:05

嗯,我正在为我所在的公司开发 ETL。

我们正在与 SSIS 合作。
使用 api 生成并构建我们自己的 dtsx 包。

SSIS 对于管理错误并不友好。 有时您会收到“OleDb 错误”,根据上下文,该错误可能有很多不同的含义。

阅读API文档(他们没有说太多)。

一些可以帮助您从这里开始的链接:
http://technet.microsoft.com/de-de /library/ms135932(SQL.90).aspx

http:/ /msdn.microsoft.com/en-us/library/ms345167.aspx

http://msdn.microsoft.com/en-us/library/ms403356.aspx

http://www.codeproject.com /KB/database/SSISProgramming.aspx?display=PrintAll&fid=382208&df=90&mpp=25&noise=3&sort=Position&view=Quick&fr=26&select=2551674

http://www.codeproject.com/KB/database/foreachadossis.aspx

< a href="http://wiki.sqlis.com/default.aspx/SQLISWiki/ComponentErrorCodes.html" rel="nofollow noreferrer">http://wiki.sqlis.com/default.aspx/SQLISWiki/ComponentErrorCodes.html

http:// /www.new.facebook.com/inbox/readmessage.php?t=1041904880323#/home.php?ref=logo

http://technet.microsoft.com/en-us/library/ms187670.aspx

http://msdn.microsoft.com/ja-jp/library/microsoft.sqlserver.dts。 runtime.foreachloop.foreachenumerator.aspx

http://www.sqlis.com/post/Handling- Different-row-types-in-the-same-file.aspx

http://technet.microsoft.com/en-us/library/ms135967(SQL.90).aspx

http://msdn.microsoft.com/en- us/library/ms137709(SQL.90).aspx

http://msdn.microsoft.com/en-us/library/ms345164(SQL.90).aspx

http://msdn.microsoft.com/en-us/library/ms141232.aspx

http://www.microsoft.com/technet/prodtechnol/sql/2005/ssisperf.mspx

http://www.ivolva.com/ssis_code_generator.html

http://www.ivolva.com/ssis_wizards.html

http://www.codeplex.com/MSFTISProdSamples

http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/Q_23972361.html

http://forums.microsoft.com/MSDN/MigeratedForum.aspx?siteid=1&PostID=1404157

http://msdn.microsoft.com /en-us/library/aa719592(VS.71).aspx

http://forums.microsoft.com/MSDN/MigeratedForum.aspx?siteid=1&ForumID=80

http://blogs.conchango.com/jamiethomson/archive/2005/06/11/SSIS_3A00_-自定义日志记录使用事件处理程序.aspx

http://blogs.conchango.com/jamiethomson/archive/2007/03/13/SSIS_3A00_-Property-Paths-syntax.aspx

http://search.live.com/results.aspx ?q=%s&go=Buscar&form=QBJK&q1=macro%3Ajamiet.ssis

http://toddmcdermid.blogspot.com/2008/09/using-performupgrade.html?showComment=1224715020000

http://msdn.microsoft.com/en-us/library/ms136082.aspx

http://support.microsoft.com/kb/839279/en-us

对于“垃圾邮件”,我们深表歉意,但它们对我来说非常有用。

Well i'm developing an ETL for the company where i am.

We are working with SSIS.
Using the api to generate and build our own dtsx packages.

SSIS it's not friendly for managing errors. Sometimes you get an "OleDb Error" that could have a lot of different meanings depeding on the context.

Read the API Documentation (they don't say much).

Some links to help you out starting there:
http://technet.microsoft.com/de-de/library/ms135932(SQL.90).aspx

http://msdn.microsoft.com/en-us/library/ms345167.aspx

http://msdn.microsoft.com/en-us/library/ms403356.aspx

http://www.codeproject.com/KB/database/SSISProgramming.aspx?display=PrintAll&fid=382208&df=90&mpp=25&noise=3&sort=Position&view=Quick&fr=26&select=2551674

http://www.codeproject.com/KB/database/foreachadossis.aspx

http://wiki.sqlis.com/default.aspx/SQLISWiki/ComponentErrorCodes.html

http://www.new.facebook.com/inbox/readmessage.php?t=1041904880323#/home.php?ref=logo

http://technet.microsoft.com/en-us/library/ms187670.aspx

http://msdn.microsoft.com/ja-jp/library/microsoft.sqlserver.dts.runtime.foreachloop.foreachenumerator.aspx

http://www.sqlis.com/post/Handling-different-row-types-in-the-same-file.aspx

http://technet.microsoft.com/en-us/library/ms135967(SQL.90).aspx

http://msdn.microsoft.com/en-us/library/ms137709(SQL.90).aspx

http://msdn.microsoft.com/en-us/library/ms345164(SQL.90).aspx

http://msdn.microsoft.com/en-us/library/ms141232.aspx

http://www.microsoft.com/technet/prodtechnol/sql/2005/ssisperf.mspx

http://www.ivolva.com/ssis_code_generator.html

http://www.ivolva.com/ssis_wizards.html

http://www.codeplex.com/MSFTISProdSamples

http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/Q_23972361.html

http://forums.microsoft.com/MSDN/MigratedForum.aspx?siteid=1&PostID=1404157

http://msdn.microsoft.com/en-us/library/aa719592(VS.71).aspx

http://forums.microsoft.com/MSDN/MigratedForum.aspx?siteid=1&ForumID=80

http://blogs.conchango.com/jamiethomson/archive/2005/06/11/SSIS_3A00_-Custom-Logging-Using-Event-Handlers.aspx

http://blogs.conchango.com/jamiethomson/archive/2007/03/13/SSIS_3A00_-Property-Paths-syntax.aspx

http://search.live.com/results.aspx?q=%s&go=Buscar&form=QBJK&q1=macro%3Ajamiet.ssis

http://toddmcdermid.blogspot.com/2008/09/using-performupgrade.html?showComment=1224715020000

http://msdn.microsoft.com/en-us/library/ms136082.aspx

http://support.microsoft.com/kb/839279/en-us

Sorry for the "spam", but they are very useful to me.

儭儭莪哋寶赑 2024-07-19 06:32:05

我们正在做一个巨大的 ETL(将客户端从旧版 AS400 应用程序迁移到 Oracle EBS),并且我们实际上有一个我可以推荐的流程(经过修改):

  1. 识别关键目标
    表/字段。
  2. 确定关键点
    源表/字段。

  3. 业务用户将源映射到
    目标。
  4. 分析源数据
    质量问题。
  5. 确定谁是
    负责数据质量问题
    确定。
  6. 有责任方
    清理源中的数据。
  7. 根据以下内容开发实际的 ETL
    步骤 1 - 3 中的信息。

最棘手的步骤是 2 和 3。 3 根据我的经验 - 有时很难让业务用户一次性正确识别他们需要的所有位,并且更难正确识别数据的确切来源(尽管这可能与神秘的数据有关)我看到的文件名和字段名!)。 然而,这个过程应该可以帮助您避免重大失误。

We're doing a huge ETL (moving a client from legacy AS400 apps to Oracle EBS), and we actually have a process that (with modifications) I can recommend:

  1. Identify the critical target
    tables/fields.
  2. Identify the critical
    source tables/fields.
  3. Work with the
    business users to map source to
    target.
  4. Analyze the source data for
    quality issues.
  5. Determine who's
    responsible for data quality issues
    identified.
  6. Have responsible parties
    clean up the data in the source.
  7. Develop the actual ETL based on the
    information from steps 1 - 3.

The trickiest steps are 2 & 3 in my experience - it's sometimes difficult to get the business users to correctly identify all the bits they need in one pass, and can be even harder to properly identify exactly where the data is coming from (though that may have something to do with cryptic file and field names that I'm seeing!). However, this process should help you avoid major misses.

弄潮 2024-07-19 06:32:05

该线程已过时,但我想提请您注意 ConcernedOfTunbridgeWells 的答案。 从各个方面来说,这都是非常好的建议。 我可以重申一些,但这会削弱其余的内容,它们都值得仔细研究。

This thread is old, but I want to draw your attention to ConcernedOfTunbridgeWells' answer. It is incredibly good advice, on all points. I could reiterate a few, but that would diminish the rest, and they all deserve close study.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文