填充报告/数据仓库数据库的策略
对于我们的报告应用程序,我们有一个每晚将多个数据库聚合到一个“报告”数据库的流程。报告数据库的架构与我们聚合的单独“生产”数据库的架构完全不同,因此存在大量涉及数据聚合方式的业务逻辑。
现在这个过程是由几个每晚运行的存储过程来实现的。随着我们向报告数据库添加更多详细信息,存储过程中的逻辑变得越来越脆弱且难以管理。
可以使用哪些其他策略来填充此报告数据库?
- SSIS?这已经被考虑过,但似乎并没有提供比存储过程更干净、更易于维护的方法。
- 一个单独的 C#(或任何语言)进程来聚合内存中的数据,然后将其推送到报告数据库中?这将使我们能够为逻辑编写单元测试,并以更易于维护的方式组织代码。
我正在寻找有关上述内容的任何新想法或其他想法。谢谢!
For our reporting application, we have a process that aggregates several databases into a single 'reporting' database on a nightly basis. The schema of the reporting database is quite different than that of the separate 'production' databases that we are aggregating so there is a good amount of business logic that goes into how the data is aggregated.
Right now this process is implemented by several stored procedures that run nightly. As we add more details to the reporting database the logic in the stored procedures keeps growing more fragile and unmanageable.
What are some other strategies that could be used to populate this reporting database?
- SSIS? This has been considered but doesn't appear to offer a much cleaner, more maintainable approach that just the stored procedures.
- A separate C# (or whatever language) process that aggregates the data in memory and then pushes it into the reporting database? This would allow us to write Unit Tests for the logic and organize the code in a much more maintainable manner.
I'm looking for any new ideas or additional thoughts on the above. Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我们的一般流程是:
具有完全相同的表
加载数据库中的结构
表,具有相同的结构
作为最终的事实/维度表
事实/维度表
SSIS 适合步骤 1,这或多或少是一个 1:1 复制过程,具有一些基本数据类型映射和字符串转换。
对于步骤 2,我们混合使用存储过程、.NET 和 Python。大多数逻辑都在过程中,其中包括外部代码中的大量解析。纯 TSQL 的主要好处是,转换通常依赖于加载数据库中的其他数据,例如,在 SQL JOIN 中使用映射表比在外部脚本中执行逐行查找过程要快得多,即使使用缓存也是如此。诚然,这只是我的经验,程序处理可能更适合您的数据集。
在某些情况下,我们确实必须进行一些复杂的(DNA 序列)解析,而 TSQL 并不是一个可行的解决方案。这就是我们使用外部 .NET 或 Python 代码来完成工作的地方。我想我们可以在 .NET 过程/函数中完成这一切并将其保存在数据库中,但还需要其他外部连接,因此单独的程序是有意义的。
第 3 步是一系列 INSERT...SELECT... 语句:速度很快。
所以总而言之,使用最适合工作的工具,不要担心把事情搞混。 SSIS 包(或多个包)是将存储过程、可执行文件以及您需要执行的任何其他操作链接在一起的好方法,因此您可以在一个位置设计、执行和记录整个加载过程。如果这是一个巨大的过程,您可以使用子包。
我知道你说 TSQL 感觉尴尬是什么意思(实际上,我发现它比其他任何东西都更重复),但它对于数据驱动的操作来说非常非常快。所以我的感觉是,在TSQL中进行数据处理,在外部代码中进行字符串处理或其他复杂操作。
Our general process is:
tables with exactly the same
structure in a loading database
table, which have the same structure
as the final fact/dimension tables
the fact/dimension tables
SSIS is good for step 1, which is more or less a 1:1 copy process, with some basic data type mappings and string transformations.
For step 2, we use a mix of stored procs, .NET and Python. Most of the logic is in procedures, with things like heavy parsing in external code. The major benefit of pure TSQL is that very often transformations depend on other data in the loading database, e.g. using mapping tables in a SQL JOIN is much faster than doing a row-by-row lookup process in an external script, even with caching. Admittedly, that's just my experience, and procedural processing might be better for syour data set.
In a few cases we do have to do some complex parsing (of DNA sequences) and TSQL is just not a viable solution. So that's where we use external .NET or Python code to do the work. I suppose we could do it all in .NET procedures/functions and keep it in the database, but there are other external connections required, so a separate program makes sense.
Step 3 is a series of INSERT... SELECT... statements: it's fast.
So all in all, use the best tool for the job, and don't worry about mixing things up. An SSIS package - or packages - is a good way to link together stored procedures, executables and whatever else you need to do, so you can design, execute and log the whole load process in one place. If it's a huge process, you can use subpackages.
I know what you mean about TSQL feeling awkward (actually, I find it more repetitive than anything else), but it is very, very fast for data-driven operations. So my feeling is, do data processing in TSQL and string processing or other complex operations in external code.
我会再看一下 SSIS。虽然有一个学习曲线,但它可以非常灵活。它支持许多不同的数据操作方式,包括存储过程、ActiveX 脚本和各种文件操作方式。它能够处理错误并通过电子邮件或日志记录提供通知。基本上,它应该能够处理几乎所有事情。另一种选择是自定义应用程序,可能需要做更多的工作(SSIS 已经涵盖了很多基础知识)并且仍然很脆弱 - 对数据结构的任何更改都需要重新编译和重新部署。我认为对 SSIS 包进行更改可能会更容易。对于一些更复杂的逻辑,您甚至可能需要使用多个阶段 - 自定义 C# 控制台程序来稍微操作数据,然后使用 SSIS 包将其加载到数据库。
SSIS 学起来有点痛苦,而且肯定有一些技巧可以充分利用它,但我认为这是一项值得的投资。一两本好的参考书可能是一笔不错的投资(Wrox 的 Expert SQL Server 2005 Integration Services 也不错)。
I would take another look at SSIS. While there is a learning curve, it can be quite flexible. It has support for a lot of different ways to manipulate data including stored procedures, ActiveX scripts and various ways to manipulate files. It has the ability to handle errors and provide notifications via email or logging. Basically, it should be able to handle just about everything. The other option, a custom application, is probably going to be a lot more work (SSIS already has a lot of the basics covered) and is still going to be fragile - any changes to data structures will require a recompile and redeployment. I think a change to your SSIS package would probably be easier to make. For some of the more complicated logic you might even need to use multiple stages - a custom C# console program to manipulate the data a bit and then an SSIS package to load it to the database.
SSIS is a bit painful to learn and there are definitely some tricks to getting the most out of it but I think it's a worthwhile investment. A good reference book or two would probably be a good investment (Wrox's Expert SQL Server 2005 Integration Services isn't bad).
我会研究 ETL(提取/转换/加载)最佳实践。您询问的是购买与构建、特定产品和特定技术。首先备份几个步骤可能是值得的。
一些考虑因素:
因此,如果从商业应用程序中提取数据,如果要求(性能等)很严格,或者如果您有一个初级或不可靠的编程团队,我可能会选择商业产品。否则你可以自己写。在这种情况下,我会得到一本 ETL 书籍或顾问来帮助理解典型的功能和方法。
I'd look at ETL (extract/transform/load) best practices. You're asking about buying vs building, a specific product, and a specific technique. It's probably worthwhile to backup a few steps first.
A few considerations:
So, I'd probably go with a commercial product if pulling data from commercial apps, if the requirements (performance, etc) are tough, or if you've got a junior or unreliable programming team. Otherwise you can write your own. In that case I'd get an ETL book or consultant to help understand the typical functionality and approaches.
我运行过基于存储过程构建的数据仓库,并且使用过 SSIS。恕我直言,这也不比另一个好多少。我听说过的管理现代 ETL 复杂性的最佳工具称为数据构建工具 (DBT) (https:// www.getdbt.com/)。它具有大量使事情变得更易于管理的功能。需要刷新报告服务器中的特定表?一个命令将重建它,包括将它依赖的所有表刷新回源。需要动态 SQL?这使 Jinja 能够以您从未想过的方式编写动态 SQL 脚本。需要对数据库中的内容进行版本控制吗? DBT 可以满足您的需求。毕竟,它是免费的。
I've run data warehouses that were built on stored procedures, and I have used SSIS. Neither is that much better than the other IMHO. The best tool I have heard of to manage the complexity of modern ETL is called Data Build Tool (DBT) (https://www.getdbt.com/). It has a ton of features that make things more manageable. Need to refresh a particular table in the reporting server? One command will rebuild it, including refreshing all the tables it depends on back to the source. Need dynamic SQL? This offers Jinja for scripting your dynamic SQL in ways you never thought possible. Need version control for what's in your database? DBT has you covered. After all that, it's free.