填充报告/数据仓库数据库的策略

发布于 2024-08-17 01:12:51 字数 381 浏览 6 评论 0原文

对于我们的报告应用程序,我们有一个每晚将多个数据库聚合到一个“报告”数据库的流程。报告数据库的架构与我们聚合的单独“生产”数据库的架构完全不同,因此存在大量涉及数据聚合方式的业务逻辑。

现在这个过程是由几个每晚运行的存储过程来实现的。随着我们向报告数据库添加更多详细信息,存储过程中的逻辑变得越来越脆弱且难以管理。

可以使用哪些其他策略来填充此报告数据库?

  • SSIS?这已经被考虑过,但似乎并没有提供比存储过程更干净、更易于维护的方法。
  • 一个单独的 C#(或任何语言)进程来聚合内存中的数据,然后将其推送到报告数据库中?这将使我们能够为逻辑编写单元测试,并以更易于维护的方式组织代码。

我正在寻找有关上述内容的任何新想法或其他想法。谢谢!

For our reporting application, we have a process that aggregates several databases into a single 'reporting' database on a nightly basis. The schema of the reporting database is quite different than that of the separate 'production' databases that we are aggregating so there is a good amount of business logic that goes into how the data is aggregated.

Right now this process is implemented by several stored procedures that run nightly. As we add more details to the reporting database the logic in the stored procedures keeps growing more fragile and unmanageable.

What are some other strategies that could be used to populate this reporting database?

  • SSIS? This has been considered but doesn't appear to offer a much cleaner, more maintainable approach that just the stored procedures.
  • A separate C# (or whatever language) process that aggregates the data in memory and then pushes it into the reporting database? This would allow us to write Unit Tests for the logic and organize the code in a much more maintainable manner.

I'm looking for any new ideas or additional thoughts on the above. Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

娇女薄笑 2024-08-24 01:12:52

我们的一般流程是:

  1. 将数据从源表复制到
    具有完全相同的表
    加载数据库中的结构
  2. 将数据转换为暂存
    表,具有相同的结构
    作为最终的事实/维度表
  3. 将数据从临时表复制到
    事实/维度表

SSIS 适合步骤 1,这或多或少是一个 1:1 复制过程,具有一些基本数据类型映射和字符串转换。

对于步骤 2,我们混合使用存储过程、.NET 和 Python。大多数逻辑都在过程中,其中包括外部代码中的大量解析。纯 TSQL 的主要好处是,转换通常依赖于加载数据库中的其他数据,例如,在 SQL JOIN 中使用映射表比在外部脚本中执行逐行查找过程要快得多,即使使用缓存也是如此。诚然,这只是我的经验,程序处理可能更适合您的数据集。

在某些情况下,我们确实必须进行一些复杂的(DNA 序列)解析,而 TSQL 并不是一个可行的解决方案。这就是我们使用外部 .NET 或 Python 代码来完成工作的地方。我想我们可以在 .NET 过程/函数中完成这一切并将其保存在数据库中,但还需要其他外部连接,因此单独的程序是有意义的。

第 3 步是一系列 INSERT...SELECT... 语句:速度很快。

所以总而言之,使用最适合工作的工具,不要担心把事情搞混。 SSIS 包(或多个包)是将存储过程、可执行文件以及您需要执行的任何其他操作链接在一起的好方法,因此您可以在一个位置设计、执行和记录整个加载过程。如果这是一个巨大的过程,您可以使用子包。

我知道你说 TSQL 感觉尴尬是什么意思(实际上,我发现它比其他任何东西都更重复),但它对于数据驱动的操作来说非常非常快。所以我的感觉是,在TSQL中进行数据处理,在外部代码中进行字符串处理或其他复杂操作。

Our general process is:

  1. Copy data from source table(s) into
    tables with exactly the same
    structure in a loading database
  2. Transform data into staging
    table, which have the same structure
    as the final fact/dimension tables
  3. Copy data from the staging tables to
    the fact/dimension tables

SSIS is good for step 1, which is more or less a 1:1 copy process, with some basic data type mappings and string transformations.

For step 2, we use a mix of stored procs, .NET and Python. Most of the logic is in procedures, with things like heavy parsing in external code. The major benefit of pure TSQL is that very often transformations depend on other data in the loading database, e.g. using mapping tables in a SQL JOIN is much faster than doing a row-by-row lookup process in an external script, even with caching. Admittedly, that's just my experience, and procedural processing might be better for syour data set.

In a few cases we do have to do some complex parsing (of DNA sequences) and TSQL is just not a viable solution. So that's where we use external .NET or Python code to do the work. I suppose we could do it all in .NET procedures/functions and keep it in the database, but there are other external connections required, so a separate program makes sense.

Step 3 is a series of INSERT... SELECT... statements: it's fast.

So all in all, use the best tool for the job, and don't worry about mixing things up. An SSIS package - or packages - is a good way to link together stored procedures, executables and whatever else you need to do, so you can design, execute and log the whole load process in one place. If it's a huge process, you can use subpackages.

I know what you mean about TSQL feeling awkward (actually, I find it more repetitive than anything else), but it is very, very fast for data-driven operations. So my feeling is, do data processing in TSQL and string processing or other complex operations in external code.

若有似无的小暗淡 2024-08-24 01:12:52

我会再看一下 SSIS。虽然有一个学习曲线,但它可以非常灵活。它支持许多不同的数据操作方式,包括存储过程、ActiveX 脚本和各种文件操作方式。它能够处理错误并通过电子邮件或日志记录提供通知。基本上,它应该能够处理几乎所有事情。另一种选择是自定义应用程序,可能需要做更多的工作(SSIS 已经涵盖了很多基础知识)并且仍然很脆弱 - 对数据结构的任何更改都需要重新编译和重新部署。我认为对 SSIS 包进行更改可能会更容易。对于一些更复杂的逻辑,您甚至可能需要使用多个阶段 - 自定义 C# 控制台程序来稍微操作数据,然后使用 SSIS 包将其加载到数据库。

SSIS 学起来有点痛苦,而且肯定有一些技巧可以充分利用它,但我认为这是一项值得的投资。一两本好的参考书可能是一笔不错的投资(Wrox 的 Expert SQL Server 2005 Integration Services 也不错)。

I would take another look at SSIS. While there is a learning curve, it can be quite flexible. It has support for a lot of different ways to manipulate data including stored procedures, ActiveX scripts and various ways to manipulate files. It has the ability to handle errors and provide notifications via email or logging. Basically, it should be able to handle just about everything. The other option, a custom application, is probably going to be a lot more work (SSIS already has a lot of the basics covered) and is still going to be fragile - any changes to data structures will require a recompile and redeployment. I think a change to your SSIS package would probably be easier to make. For some of the more complicated logic you might even need to use multiple stages - a custom C# console program to manipulate the data a bit and then an SSIS package to load it to the database.

SSIS is a bit painful to learn and there are definitely some tricks to getting the most out of it but I think it's a worthwhile investment. A good reference book or two would probably be a good investment (Wrox's Expert SQL Server 2005 Integration Services isn't bad).

琉璃繁缕 2024-08-24 01:12:52

我会研究 ETL(提取/转换/加载)最佳实践。您询问的是购买与构建、特定产品和特定技术。首先备份几个步骤可能是值得的。

一些考虑因素:

  • 提供良好的 ETL 有很多微妙的技巧:使其运行速度非常快、非常容易管理、处理规则级审计结果、支持高可用性甚至可靠的恢复,甚至用作报告解决方案(而不是数据库备份)。
  • 您可以构建自己的 ETL。缺点是商业 ETL 解决方案具有预构建的适配器(您可能根本不需要),并且自定义 ETL 解决方案往往会失败,因为很少有开发人员熟悉所涉及的批处理模式(请参阅您现有的体系结构)。由于 ETL 模式尚未得到充分记录,因此除非您引入在该领域非常有经验的开发人员,否则不太可能成功编写自己的 ETL 解决方案。
  • 在查看商业解决方案时,请注意元数据和审核结果是解决方案中最有价值的部分:基于 GUI 的转换构建器实际上并不比编写代码更有效率,但在以下情况下元数据可能比读取代码更有效率:涉及到维护。
  • 由于网络访问、性能、延迟、数据格式、安全性或其他要求与您的 ETL 工具不兼容,复杂的环境很难使用单一 ETL 产品来解决。所以,定制和定制的结合。无论如何,商业往往会产生结果。
  • 如果您需要支持或关键功能,像 Pentaho 这样的开源解决方案实际上是商业解决方案。

因此,如果从商业应用程序中提取数据,如果要求(性能等)很严格,或者如果您有一个初级或不可靠的编程团队,我可能会选择商业产品。否则你可以自己写。在这种情况下,我会得到一本 ETL 书籍或顾问来帮助理解典型的功能和方法。

I'd look at ETL (extract/transform/load) best practices. You're asking about buying vs building, a specific product, and a specific technique. It's probably worthwhile to backup a few steps first.

A few considerations:

  • There's a lot of subtle tricks to delivering good ETL: making it run very fast, be very easily managed, handling rule-level audit results, supporting high-availability or even reliable recovery and even being used as the recovery process for the reporting solution (rather than database backups).
  • You can build your own ETL. The downside is that commercial ETL solutions have pre-built adapters (which you may not need anyway), and that custom ETL solutions tend to fail since few developers are familiar with the batch processing patterns involved (see your existing architecture). Since ETL patterns have not been well documented it is unlikely to be successful in writing your own ETL solution unless you bring in a developer very experienced in this space.
  • When looking at commercial solutions note that the metadata and auditing results are the most valuable part of the solution: The GUI-based transform builders aren't really any more productive than just writing code - but the metadata can be more productive than reading code when it comes to maintenance.
  • Complex environments are difficult to solution with a single ETL product - because of network access, performance, latency, data format, security or other requirements incompatible with your ETL tool. So, a combination of custom & commercial often results anyway.
  • Open source solutions like Pentaho are really commercial solutions if you want support or critical features.

So, I'd probably go with a commercial product if pulling data from commercial apps, if the requirements (performance, etc) are tough, or if you've got a junior or unreliable programming team. Otherwise you can write your own. In that case I'd get an ETL book or consultant to help understand the typical functionality and approaches.

为你鎻心 2024-08-24 01:12:52

我运行过基于存储过程构建的数据仓库,并且使用过 SSIS。恕我直言,这也不比另一个好多少。我听说过的管理现代 ETL 复杂性的最佳工具称为数据构建工具 (DBT) (https:// www.getdbt.com/)。它具有大量使事情变得更易于管理的功能。需要刷新报告服务器中的特定表?一个命令将重建它,包括将它依赖的所有表刷新回源。需要动态 SQL?这使 Jinja 能够以您从未想过的方式编写动态 SQL 脚本。需要对数据库中的内容进行版本控制吗? DBT 可以满足您的需求。毕竟,它是免费的。

I've run data warehouses that were built on stored procedures, and I have used SSIS. Neither is that much better than the other IMHO. The best tool I have heard of to manage the complexity of modern ETL is called Data Build Tool (DBT) (https://www.getdbt.com/). It has a ton of features that make things more manageable. Need to refresh a particular table in the reporting server? One command will rebuild it, including refreshing all the tables it depends on back to the source. Need dynamic SQL? This offers Jinja for scripting your dynamic SQL in ways you never thought possible. Need version control for what's in your database? DBT has you covered. After all that, it's free.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文