从头开始构建 OLAP 解决方案时应该注意什么?

发布于 2024-09-19 00:26:11 字数 900 浏览 10 评论 0原文

我在一家运行基于 MS SQL 数据库服务器的软件产品的公司工作,多年来我已经用 PHP 开发了 20-30 个相当高级的报告,直接从数据库获取数据。这非常成功,人们对此感到高兴。

但它有一些缺点:

  • 对于新的更改,它可能是相当密集的
  • 开发 用户无法对数据进行太多实验 - 它被锁定到硬编码视图
  • 对于大型报告来说可能很慢

我正在考虑逐渐转向基于 OLAP 的方法,可以从 Excel 或某些基于 Web 的服务进行查询。但我希望以一种在 IT 环境中引入最少数量的新复杂性的方式来实现这一点 - 最少数量的不同服务、同步作业等!

我对此有一些疑问:

1)工作流程相关:

  • 从“黑盒SQL Server”到“OLAP即用”有什么好的发展路线?
  • 应该设置哪些服务器和服务,应该编写哪些脚本?
  • 哪些是最难/最关键/最耗时的部分?

2) ETL:

  • 我认为最好为数据仓库和生产 SQL 配备单独的服务器?
  • 它们如何保持同步(推/拉)?使用哪些技术/语言?
  • 对我来说,SSIS 看起来过于复杂,而且图形工作流程对我来说没有多大吸引力——我宁愿使用基于文本的脚本来完成这项工作。这可行吗?
  • 或者使用只有一个源和一个目标的图形客户端是否有利?

3) 开发:

  • 其中有多少内容(数据集成、分析服务)可以通过 CLI 工具进行有效维护?
  • 设置可以在生产和开发之间轻松地来回转移吗?

我对仅涵盖其中一部分的任何答案感到满意 - 尽管它是 MS 环境,但我也有兴趣了解其他技术的优势。

I'm working for a company running a software product based on a MS SQL database server, and through the years I have developed 20-30 quite advanced reports in PHP, taking data directly from the database. This has been very successful, and people are happy with it.

But it has some drawbacks:

  • For new changes, it can be quite development intensive
  • The user can't experiment much with the data - it is locked to a hard-coded view
  • It can be slow for big reports

I am considering gradually going to a OLAP-based approach, which can be queried from Excel or some web-based service. But I would like to do this in a way that introduces the least amount of new complexity in the IT environment - the least amount of different services, synchronization jobs etc!

I have some questions in this regard:

1) Workflow-related:

  • What is a good development route from "black box SQL server" to "OLAP ready to use"?
  • Which servers and services should be set up, and which scripts should be written?
  • Which are the hardest/most critical/most time-intensive parts?

2) ETL:

  • I suppose it is best to have separate servers for their Data Warehouse and Production SQL?
  • How are these kept in sync (push/pull)? Using which technologies/languages?
  • For me SSIS looks overly complicated, and the graphical workflow doesn't appeal much to me -- I would rather like a text based script that does the job. Is this feasible?
  • Or is it advantagous to use the graphical client with only one source and one destination?

3) Development:

  • How much of this (data integration, analysis services) can be efficiently maintained from a CLI-tool?
  • Can the setup be transferred back and forth between production and development easily?

I'm happy with any answer that covers just some of this - and even though it is a MS environment, I'm also interested to hear about advantages in other technologies.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

韵柒 2024-09-26 00:26:11

我只有 Microsoft OLAP 的经验,因此以下是我所知道的两点建议:

  1. 如果您要实现多维数据集,则将生产 SQL Server 与多维数据集的源分开。多维数据集需要大量 SELECT DISTINCT column_name FROM source.table。您不希望多维数据集处理阻塞您的关键任务生产系统。

  2. 虽然您可以使用标准关系表实现 OLAP 多维数据集,但您很快就会发现,除非您的数据是分类账式系统,否则您可能需要完全重新处理事实表和维度表,这将需要重新查询源数据库再来一次。这是构建一个对事实表使用账本式事务的单独数据仓库的一个重要论点。例如,如果客户订购了某样东西然后取消了它,您的源系统可能会将此作为状态更改进行跟踪。在您的事实表中,您可能需要将其显示为具有正数量和收入流的用于订购的行以及具有负数量和收入流的用于取消的行。

  3. OLAP 对于您的环境来说可能有些过分了。您似乎提出的主要问题是您的报告是静态的,并且用户希望直接访问数据。您可以构建一个数据模型,并为用户提供 SSRS 中的报表生成器访问权限,或在其他 BI 套件(如 Cognos、Business Objects 等)中提供报表编写访问权限。我通常不推荐这种方法,因为它远远超出了大多数用户应有的权限知道如何获取数据,但在小商店中这可能就足够了,并且很容易实现。让我们面对现实吧——用户通常只想将数据导入 Excel 中以进一步操作。因此,如果您不想为他们提供 Web 前端,而只想让他们从 Excel 获取数据,则可以为他们提供对生产数据副本的直接数据库访问权限。这种方法的缺点是用户通常不了解 SQL 或数据库关系。 OLAP 可以帮助您避免强迫用户学习 SQL 或关系,但在您端实现起来并不容易。如果您只有几个高级用户需要这种访问权限,那么教少数高级用户如何在 Excel 中对数据库进行基本查询可能很容易,他们明天会很高兴得到这个。 OLAP 到明天还无法准备就绪。

  4. 如果您只有几种源数据系统,您可以构建一个超级动态的静态报告。例如,我有一个用 C# 编写的报告,基本上允许用户从 30 列的列表中选择所需数量的列,并在几个日期范围字段和字段过滤器列表上过滤数据。这个简单的报告涵盖了最终用户所有临时报告请求的约 40%,因为它涵盖了所有基本的核心客户指标和字段。我们最近将此报告移至 SSRS,这使我们能够将字段数量增加到大约 100 个,并改善了整体用户体验。无论使用哪种报告平台,即使在静态报告系统的范围内,也可以为用户提供一些动态灵活性。

  5. 如果您只有几个数据库,您可以将数据库作为 ETL 进行备份和恢复。但是,如果您想做除此之外的任何事情,那么您不妨硬着头皮使用 SSIS(或其他一些 ETL 工具)。一旦您进入数据仓库的 ETL,您将使用面向图形的设计工具。编码对于应用程序来说效果很好,但 ETL 更多的是关于工作流程,这就是为什么这些工具倾向于集中在图形 UI 上。您可以解决这个问题并尝试从文本编辑器编写数据仓库代码,但最终您会损失很多。 有关加载数据之间的差异的更多详细信息,请参阅这篇文章

有关如何在关系数据存储中使用多维数据集的反馈

可以在关系数据存储上实现多维数据集,但使用此方法存在一些主要问题。它在技术上可行的主要原因与您配置 DSV 的方式有关。 DSV 本质上是物理数据库和多维数据集/维度定义之间的逻辑层。您可以定义命名查询或在数据库中创建扁平化数据的视图,而不是将关系表导入 DSV。

这种方法的优点如下:

  1. 实现起来相对容易,因为您不必构建整个 ETL 子系统来开始使用 OLAP。

  2. 这种方法非常适合对您想要如何构建更长期的解决方案进行原型设计。您可以在 1-2 天内对其进行原型设计,并展示 OLAP 的一些优势。

  3. 一些非常非常大的表不必完全复制来支持 OLAP 多维数据集。我有几个数十亿行的表,它们几乎是完全标准化的事实表。它们唯一没有的列是日期键,并且它们还包含一些根本不应该有空值的字段上的空值。您可以创建代理日期键并为视图或命名查询中的空值设置值,而不是复制这些非常庞大的表。如果您不会看到复制表带来的巨大性能提升,那么这可能是在数据库本身中以更原始的格式保留的候选者。

这种方法的缺点如下:

  1. 如果您还没有构建真正的 Kimball 方法数据仓库,那么您可能不会以分类帐方式跟踪交易。金博尔方法事实表(至少据我理解)总是通过添加和减去行来更改值。如果有人取消部分订单,您将无法更新单笔交易的多维数据集中的值。相反,您必须用负值来平衡交易。如果您必须更新事务,那么您将必须完全重新处理多维数据集的分区以替换该值,这可能是一个非常昂贵的操作。除非您的源系统是分类账式交易系统,否则您可能必须在 ETL 子系统中构建分类账式副本。

  2. 如果您没有构建 Kimball 方法数据仓库,那么您可能在数据库中使用清晰且可能非整数的主键。这直接影响多维数据集中的查询性能。它还让您拥有理论上不灵活的数据仓库。例如,如果您有一个使用整数键的产品订购系统,并且您开始使用第二个产品订购系统作为旧系统的替代品或与旧系统结合使用,则您可能很难仅通过以下方式将数据组合在一起: DSV,因为每个系统都有不同的数据点、指标、工作流程、数据类型等。更糟糕的是,如果订单 id 的数据类型相同并且订单 id 值在系统之间重叠,那么您必须声明一个代理键可以跨两个系统使用。 如果不使用扁平化数据仓库,这可能很困难,但并非不可能实现。

  3. 如果您从关系数据存储开始然后转向扁平化数据库,您可能需要构建两次系统。坦白说,我认为重复工作量微不足道。您在基于关系数据存储构建多维数据集时学到的大部分内容都将转化为设置新的 OLAP 多维数据集。但主要问题是,您可能会完全创建一个新的多维数据集,然后旧多维数据集的任何用户都必须迁移到新多维数据集。任何用 SSRS 或 Excel 构建的报告都可能在此时崩溃并需要从头开始重写。因此,重建多维数据集的主要成本实际上是重建依赖对象,而不是重建多维数据集本身。

如果您希望我详细阐述上述任何一点,请告诉我。祝你好运。

I only have experience with Microsoft OLAP, so here are my two cents regarding what I know:

  1. If you are implementing cubes, then separate the production SQL Server from the source for the cubes. Cubes require a lot of SELECT DISTINCT column_name FROM source.table. You don't want cube processing to block your mission critical production system.

  2. Although you can implement OLAP cubes with standard relation tables, you will quickly find that unless your data is a ledger-style system you will probably need to fully reprocess your fact and dimension tables and this will require requerying the source database over and over again. That's a large argument for building a separate data warehouse that uses ledger-style transactions for the fact tables. For instance, if a customer orders something and then cancels it, your source system may track this as a status change. In your fact table, you probably need to show this as a row for ordering that has a positive quantity and revenue stream and a row for cancelling that has a negative quantity and revenue stream.

  3. OLAP may be overkill for your environment. The main issue you appeared to raise was that your reports are static and users want access to the data directly. You could build a data model and give users Report Builder access in SSRS, or report writing access in some other BI suite like Cognos, Business Objects, etc. I don't generally recommend this approach since it is way beyond what most users should have to know to get data, but in a small shop this may be sufficient and it is easy to implement. Let's face it -- users generally just want to get the data into Excel to manipulate it further. So if you don't want to give them a web front-end and you just want them to get to the data from Excel, you could give them direct database access to a copy of the production data. The downside of this approach is users don't generally understand SQL or database relationships. OLAP helps you avoid forcing users to learn SQL or relationships, but is isn't easy to implement on your end. If you only have a couple of power users who need this kind of access, it could be easy enough to teach the few power users how to do basic queries in Excel against the database and they will be happy to get this tomorrow. OLAP won't be ready by tomorrow.

  4. If you only have a few kinds of source data systems, you could get away with building a super-dynamic static report. For instance, I have a report that was written in C# that basically allows users to select as many columns as they want from a list of 30 columns and filter the data on a few date range fields and field filter lists. This simple report covers about 40% of all ad hoc report requests from end-users since it covers all the basic, core customer metrics and fields. We recently moved this report to SSRS and that allowed us to up the number of fields to about 100 and improved the overall user experience. Regardless of the reporting platform, it is possible to give users some dynamic flexibility even in the confines of a static reporting system.

  5. If you only have a couple of databases, you can probably backup and restore the databases as your ETL. However, if you want to do anything beyond that, then you might as well bite the bullet and use SSIS (or some other ETL tool). Once you get into ETL for data warehousing, you are going to use a graphic-oriented design tool. Coding works well for applications, but ETL is more about workflows and that's why the tools tend to converge on a graphical UI. You can work around this and try to code a data warehouse from a text editor, but in the end you are going to lose out on a lot. See this post for more details on the differences between loading data from code and loading data from SSIS.

FEEDBACK ON HOW TO USE CUBES WITH A RELATIONAL DATA STORE

It is possible to implement a cube over a relational data store, but there are some major problems with using this approach. The main reason it is technically feasible has to do with how you configure your DSV. The DSV is essentially a logical layer between the physical database and the cube/dimension definitions. Instead of importing the relational tables into the DSV, you could define Named Queries or create views in the database that flatten the data.

The advantage of this approach are as follows:

  1. It is relatively easy to implement since you don't have to build an entire ETL subsystem to get started with OLAP.

  2. This approach works well for prototyping how you want to build a more long-term solution. You can prototype it in 1-2 days and show some of the benefits of OLAP today.

  3. Some very, very large tables don't have to be completely duplicated just to support an OLAP cube. I have several multi-billion row tables that are almost completely standardized fact tables. The only columns they don't have are date keys and they also contain some NULL values on fields that shouldn't have nulls at all. Instead of duplicating these very massive tables, you can create the surrogate date keys and set values for the nulls in the view or named query. If you aren't going to see a huge performance boon for duplicating the table, then this may be a candidate for leaving in a more raw format in the database itself.

The disadvantages of this approach are as follows:

  1. If you haven't built a true Kimball method data warehouse, then you probably aren't tracking transactions in a ledger-style. Kimball method fact tables (at least as I understand them) always change values by adding and subtracting rows. If someone cancels part of an order, you can't update the value in the cube for the single transaction. Instead, you have to balance out the transaction with a negative value. If you have to update the transaction, then you will have to fully reprocess the partition of the cube to replace the value which can be a very expensive operation. Unless your source system is a ledger-style transaction system, you will probably have to build a ledger-style copy in your ETL subsystem.

  2. If you don't build a Kimball method data warehouse, then you are probably using unobscured and possibly non-integer primary keys in your database. This directly impacts query performance inside the cube. It also sets you up for having a theoretically inflexible data warehouse. For instance, if you have an product ordering system that uses an integer key and you start using a second product ordering system either as a replacement for the legacy system or in tandem with the legacy system, you may struggle to combine the data together merely through the DSV since each system has different data points, metrics, workflows, data types, etc. Worse, if they have the same data types for the order id and the order id values overlap between systems, then you must declare a surrogate key that you can use across both systems. This can be difficult, but not impossible, to implement without using a flattened data warehouse.

  3. You may have to build the system twice if you start with the relational data store and then move to flattened database. Frankly, I think the amount of duplicated work is trivial. Most of what you learned building the cube off a relational data store will translate to setting up the new OLAP cube. The main problem, though, is that you will probably create a new cube altogether and then any users of the old cube will have to migrate to the new cube. Any reports built in SSRS or Excel will probably break at that point and need to be rewritten from the ground up. So the main cost of rebuilding the cube is really on rebuilding dependent objects -- not on rebuilding the cube itself.

Let me know if you want me to expand on any of the above points. good luck.

⊕婉儿 2024-09-26 00:26:11

您基本上是在问“如何构建 DWH”这个价值数百万美元的问题。这实在不是一个可以果断回答的问题。

尽管如此,这里有一个启动:

如果您正在寻找一种最低限度可行的产品,请注意您处于数据环境中,而不是纯软件环境中。在数据密集的环境中,增量构建产品要困难得多,因为在系统中引入更改的工作量要大得多。想象一下,您在软件中所做的每项更改都必须以某种方式向后兼容您曾经做过的任何事情。现在你明白微软现在的处境了:-)。

此外,数据系统涉及许多第三方工具,例如数据库、ETL工具和报告平台。您所做的选择应该对于系统的预期开发来说是可行的,否则您可能必须完全替换这些工具。

虽然您可以从基于简单复制 SQL 的数据库克隆开始,然后将其聚合或推送到 OLAP 中,但我建议您从一开始就使用真正的 ETL 工具。如果您预见到增长的需要,这一点尤其正确。十分之九的需求将会增长

如果您不介意成本,MS-SQL 对于数据库来说是一个不错的选择。自然的 ETL 工具是 SSIS,它也是一个可靠的工具。

即使您的第一次转换只是“获取此表并将其转储到那里”,您仍然在流程管理(作业是否运行?如果失败会发生什么?等)和调试方面获益匪浅。此外,由于必须处理需求和/或特殊情况,有机增长也更容易。

You're basically asking the million dollar question of "How do I build a DWH". This is not really a question that can decisively be answered.

Nevertheless, here is a kickstart:

If you are looking for a minimum viable product, be aware that you are in a data environment, and not a pure software one. In data-heavy environments, it is much harder to incrementally build a product, because the amount of effort to introduce changes in the system is much greater. Think about it as if every change you make in a piece of software has to be somehow backwards-compatible with anything you've ever done. Now you understand the hell Microsoft are in :-).

Also, data systems involve many third-party tools such as DBs, ETL tools and reporting platforms. The choices you make should be viable for the expected development of your system, else you might have to completely replace these tools down the road.

While you can start with a DB cloning that will be based on simple copy SQLs and then aggregating it or pushing it into an OLAP, I would recommend getting your hands dirty with a real ETL tool from the start. This is especially true if you foresee the need to grow. 9 out of 10 times, the need will grow.

MS-SQL is a good choice for a DB if you don't mind the cost. The natural ETL tool would be SSIS, and it's a solid tool as well.

Even if your first transformations are merely "take this table and dump it in there", you still gain a lot in terms of process management (has the job run? What happens if it fails? etc) and debugging. Also, it is easier to organically grow as requirements and/or special cases have to be dealt with.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文