避免在 SSIS 中完全编写 SQL 查询

发布于 2024-09-01 18:47:17 字数 190 浏览 7 评论 0原文

在从事数据仓库项目时,为我们提供教程的人建议我们坚持使用 SQL 查询而不是定义大量数据流转换,并指出它会消耗 ETL 盒上的大量内存,因此我们而是将处理留给数据库盒。这真的是可取的吗?依赖 GUI 工具与在集成包上执行一堆 SQL 脚本之间的平衡在哪里?

老实说,我想尽可能避免编写 SQL 查询。 (但这不是重点。我真的很想客观地看待这个问题。)

Working on a Data Warehouse project, the guy that gave us the tutorial advised that we stick to using SQL queries over defining a lot of data flow transformations, citing points like it'll consume a lot of memory on the ETL box so we'd rather leave the processing to the DB box. Is this really advisable? Where's the balance between relying on GUI tools over executing a bunch of SQL scripts on your Integration package?

And honestly, I'd like to avoid writing SQL queries as much as I can. (but that's beside the point. I'd really like to look at this objectively.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

半城柳色半声笛 2024-09-08 18:47:17

答案是:这取决于情况,但您希望为任何给定的工作选择其中之一,并尽可能避免将两者混合。

一般来说,最好要么在工具中执行所有可能的操作,要么在存储过程代码中执行所有可能的操作。当层之间有大量逻辑分割时,系统将变得更难以跟踪和调试。

  • 如果该工具可以在数据流不变得尴尬和复杂的情况下进行转换,您可以使用该工具并尝试在查询中很少或没有逻辑。这意味着单层具有业务逻辑,并且在哪里可以找到它应该相当明显。然而,ETL 工具在处理高度复杂的转换方面往往表现较差。这种方法的最佳点是在拥有大量数据源但转换相对简单的系统上。

  • 如果您有相对复杂的转换,您最好将所有业务逻辑和转换放入存储过程层中。 SQL 代码更擅长以可维护的方式实现复杂的转换 - 我有相当权威的说法,银行和保险行业中大约一半的数据仓库项目正是出于这个原因而使用这种类型的架构。

    在这种情况下,可以使用ETL工具来实现相对哑的数据副本。源数据基本上可以逐字复制到暂存区,然后由执行 ETL 的存储过程代码体获取。 ETL 工具可用于数据复制、批量加载操作、日志记录、调度和其他框架任务。

无论哪种情况,您最好选择一种方法。否则,您最终可能会发现业务逻辑分布在提取层、数据库视图、数据流和存储过程代码中。分布在多个层的逻辑更难测试。

例如,当所有逻辑都包含在存储过程或集中的 ETL 转换作业中时,您可以单独对给定的转换进行单元测试。设计的清晰度也有助于维护和审计。

The answer is: it depends, but you want to pick one or the other for any given job and avoid mixing the two where possible.

Generally, it's best to either do everything possible within the tool or do everything possible within stored procedure code. When you have significant amounts of logic split between layers the system becomes harder to trace and debug.

  • Where the tool can do the transformations without the data flows becoming awkward and convoluted you could use the tool and try to have little or no logic in queries. This means that one single layer has the business logic and it should be fairly obvious where to find it. However, ETL tools tend to handle highly complex transformations relatively poorly. The sweet spot for this type of approach is on systems where you have a large number of data sources but relatively simple transformations.

  • If you have relatively complex transformations you may be better off putting all the business logic and transformation into a layer of stored procedures. SQL code is better at implementing complex transformations in a maintainable way - I have it on fairly good authority that around half of all data warehouse projects in the banking and insurance sectors use this type of architecture for precisely that reason.

    In this case the ETL tool can be used to implement relatively dumb data copies. Source data can be copied into staging areas essentially verbatim and then picked up by a body of stored procedure code that does the ETL. The ETL tool can be used for data copies, bulk load operations, logging, scheduling and other framework tasks.

In either case you're best off picking one approach. Otherwise, you can end up with business logic spread across extraction layers, database views, data flows, and stored procedure code. Logic spread across multiple layers is much harder to test.

When all of the logic is (for example) contained within stored procedures or focussed ETL transformation jobs you can unit test a given transformation in isolation. The clarity in design also helps with maintenance and auditing.

很酷又爱笑 2024-09-08 18:47:17

我发现使用 SQl 代码不仅运行速度更快,而且开发速度更快并且更容易维护。

I find that using SQl code is not only faster to run, but it is faster to develop and much much easier to maintain.

剑心龙吟 2024-09-08 18:47:17

通常,当您想要单独处理每一行时,请使用数据流,否则最好使用 Sql 命令。

就我个人而言,我会尽可能地编写 SQL。以后优化会更容易并且(通常)也更快。谷歌将给出更详细的答案。

另一个需要考虑的因素是您用于连接的提供商。

您需要根据您的需求做出决定。我们使用 postgres DB,因此我们必须为某些进程创建大量临时表,这会加快整个过程的速度。

您还应该考虑它运行的盒子,如果您有一个功能强大的数据库盒子和一个小 ETL 盒子,那么运行任何东西都是没有意义的。

如果您在 ETL 盒上进行所有处理,您还将在网络上拖动大量数据。

查看以下链接以帮助您入门:

ssistalk.com/category/ssis/ssis-advanced-techniques/

msdn.microsoft.com/en-us/library/ms141031.aspx

weblogs.sqlteam.com/jamesn/Default.aspx

Generally when you want to process each row individually, use a data flow, otherwise it may be better to use a Sql Command.

Personally I'd go with writing the SQL where I can. It's easier to optimise later and (usually) faster as well. Google will give much more detailed answers.

Another factor to think about is the provider you use for your connections.

You need to make the decision based on your needs. We use postgres DB, so we have to create a load of staging tables for some processes, which speeds the whole thing up.

You should also take into consideration the box it is running on, if you have an all powerful DB box, and a little ETL box, there'd be no point in running anything.

If you do all your processing on the ETL box you'll be dragging a lot of data across the network as well.

Check out these links to get you started:

ssistalk.com/category/ssis/ssis-advanced-techniques/

msdn.microsoft.com/en-us/library/ms141031.aspx

weblogs.sqlteam.com/jamesn/Default.aspx

偏爱自由 2024-09-08 18:47:17

我认为这是一个很难回答的问题;还有一个有趣的。

恕我直言,使用 SSIS 的原因之一是提高可维护性。如果您将所有逻辑打包在 SQL 语句中(您当然可以!),您可能会破坏使用 SSIS 的初衷。您再也无法真正“看到数据流”了。

另一方面,我觉得有时放置得当的 SQL 语句也有其价值。例如,当您从表中读取数据时,无论出于何种原因,您都知道您只需要满足条件 X 的行,而我看不到读取整个表的原因,并且在下一步中“有条件地拆分大部分数据”。
顺便说一句,我不知道这对性能意味着什么。 SSIS 是否足够聪明,能够看到正在发生的情况,并将“读取整个表和条件分割”更改为“从 where X 选择 Y”(或者在构建时) /部署)?

最大的问题是在哪里划清界限。这在一定程度上取决于从事 ETL 流程的人员。如果支持该流程的每个人从一开始就了解 SQL,那么与您的同事(或客户或您关心的继任者)几乎不了解所有 SQL 中发生的情况相比,您可以更好地在 ETL 中支持更多的 SQL ,更不用说对其进行更改/改进/添加了。

所以我认为最重要的是,不使用 SQL 或在 SQL 中执行所有操作都不是更好。尝试制定一些适合您的要求并且每个人都可以接受的简单规则,然后遵循它们。这可以让您从使用 SSIS 中获得最大的价值。

I think this is a difficult question; and an interesting one as well.

One reason to use SSIS is to improve maintainability, IMHO. If you pack all the logic in SQL statements (and you sure can!) you tend to spoil this reason of using SSIS in the first place. You cannot really "see the data flow" anymore.

On the other hand I feel there are times when a well placed SQL statement has its value. For example when you read data from a table and for whatever reason already know you will only ever need the rows satisfying condition X I do not see the reason for reading the whole table and in the next step "conditional-splitting most of it away".
What I do not know is what this means in terms of performance, by the way. Is SSIS smart enough to see what is happening and change the "read-whole-table-and-conditional-split-it" into a "select Y from where X" on the fly (or when building/deploying)?

The big question is where to draw the line. And this depends to a certain extent on the people working on your ETL process. If everyone ever supporting the process knows SQL since its beginning you can better support a higher amount of SQL in your ETL than if you have co-workers (or customers, or successors you care about) that hardly understand what is happening in all your SQL, let alone change/improve/add to it.

So I think the bottom line is that neither not using nor doing everything in SQL is better. Try to make up some simple rules that fit your requirements and that everyone can live with, then follow them. This buys you the most value from using SSIS.

2024-09-08 18:47:17

SQL Server 有些事情做得很好,有些事情做得不太好。我使用 SSIS 从 SQL Server 导入或导出数据。在迁移过程中,我会在有意义的地方使用 SSIS。我可以轻松地按行进行工作,这在 SQL Server(游标)中效率不高。如果说您不应该在 ETL 盒上使用转换和数据流,因为它在 ETL 盒上太昂贵,就像说“不要将汽车开得太快,因为这会导致引擎工作”。 ETL 和 SSIS 的目的是将 SQL Sever 不能很好地完成的一些处理转移到可以完成的引擎上。

SQL Server does some things well and other things not so well. I use SSIS to import to or export data from SQL Server. During the course of the move I use SSIS where it makes sense. I can easily do work on a per row basis, which is not very efficient in SQL Server (cursors). To say that you shouldn't use transformations and data flows on an ETL box, because it is too expensive on the ETL box is like say 'don't drive your car too fast, because it causes the engine to work'. The purpose of an ETL and SSIS is to take some of the processing that SQL Sever does not do well and move it to an engine that does.

删除→记忆 2024-09-08 18:47:17

必须使用正确的工具来完成工作。通常,您在 SSIS 中完成大部分操作,其中某些操作是在“纯”SQL 中完成的。

例如,如果您执行大量 UPDATE(例如,维度模型中维度表上的表差异),您确实不想为每一行执行 UPDATE。在这种情况下,您会定期插入临时表,然后在 SQL 中执行 UPDATE,连接适当的键。

Got to use the right tool for the job. Generally, you do most things in SSIS, with certain things done in "pure" SQL.

For instance, in cases where you do a lot of UPDATE (table difference on dimension table in a dimensional model, say), you really don't want to execute an UPDATE for each row. In this scenario, you do a regular insert into a temporary table and then do the UPDATE in SQL, joining on appropriate keys.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文