有没有数据仓库框架?

发布于 2024-07-06 16:08:17 字数 857 浏览 9 评论 0原文

我有很多 mysql 数据,需要从中生成报告。 它主要是历史数据,因此不会发生太大变化,但其重量很容易达到 20-30 GB,并且预计会增长。 我目前有一组 php 脚本,可以执行一些复杂的查询并输出 csv 和 excel 文件。 我还使用 phpMyAdmin 进行书签查询。 我手动编辑它们来更改参数。 数据量不断增长,需要访问数据的人数也在不断增长,因此我正在抽出时间来改善这种情况。

前几天我开始阅读有关数据仓库的内容,这似乎是一个与我需要做的事情相关的领域。 我读过一些 文章,甚至正在等待一本书。 我想我正在了解这类系统的作用和可能性。

为我的数据创建一个报告系统一直是我的待办事项列表,但直到最近我才认为这将是一个非常利基的编程项目。 由于我现在知道数据仓库是很常见的事情,因此我认为必须有某种报告/仓库框架可用于简化开发。 我很乐意跳过编写接口和脚本来安排和发送电子邮件报告等,并坚持编写查询和建立关系。

我主要是一个灯迷,但我并不排斥切换语言或平台。 我只需要一个更强大的解决方案,因为我的一次性脚本不能很好地扩展。

那么从哪里开始比较好呢?

I've got a lot of mysql data that I need to generate reports from. It's mostly historic data so it won't be changing much, but it weighs in at 20-30 gigabytes easily and is expected to grow. I currently have a collection of php scripts that will do some complex queries and output csv and excel files. I also use phpMyAdmin with bookmarked queries. I manually edit them to change the parameters. The amount of data is growing and the number of people who need access to it is also growing, so I'm making the time to improve this situation.

I started reading about data warehousing the other day and it seems that this an area that relates to what I need to do. I've read some good articles and am even waiting on a book. I think I'm getting a handle on what these sorts of systems do and what's possible.

Creating a reporting system for my data has always been on a todo list, but until recently I figured it would be a highly niche programing venture. Since I now know data warehousing is a common thing, I figure there must be some sort of reporting/warehousing frames available to ease in the development. I'd gladly skip writing interfaces and scripts to schedule and email reports and the like and stick to writing queries and setting up relations.

I've mostly been a lamp guy, but I'm not above switching languages or platforms. I just need a more robust solution as my one off scripts don't scale well.

So where's a good place to get started?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

笑梦风尘 2024-07-13 16:08:17

我将讨论有关{预算、商业效用函数、时间框架}范围的几点。 链接到的架构概念化。

为了方便起见,让我们遵循您在    WikipediaDataWarehouseArticle

  • 操作数据库层

    数据仓库的源数据 - 标准化为仅在一个地方进行数据维护

  • 数据访问层

    将源数据转换为信息访问层。
    用于提取、转换数据并将数据加载到仓库的 ETL 工具就属于这一层。

  • 信息访问层

      • 便于报告的数据结构

          此处不维护数据。 它只是您源数据的反映

          因此,非规范化结构(包含重复但系统导出的数据)

          通常在这里最有效

      • 报告工具

          您如何真正允许您的用户访问数据

          • 预制报告(简单)

          • 更动态的切片访问方法

方法       用于报告和分析的数据以及用于报告和分析数据的工具

        落入这一层。 以及 Inmon-Kimball 关于设计方法的差异,

        维基百科文章后面讨论的内容都与这一层有关。

  • 元数据层(促进自动化、组织等)

推出自己的(低端)

只需很少的自付费用,只需认识到对非规范化结构的需求就可以为那些不使用它的人带来一些效率

参与其中(需要一些支出)

您不需要立即使用平台的所有功能。

然而,在我看来,您希望处于一个您知道会增长的平台上,并且在竞争激烈且整合的 BI 环境中,这似乎是四大企业级大型供应商之一(我认为)

  • 微软(该平台 110 名员工的公司)
  • SAP
  • Oracle
  • IBM

我们拥有   BiMarketStateArticle

我的公司位于此阶段,使用 SQL Server Integration Services (SSIS) 提供的一些 ETL 功能以及开源的一些替代用法,但实际上需要“数据访问层”中的 Talend 产品许可,这是一种非规范化的报告结构(完全在基本 SQL Server 数据库)和 SQL Server Reporting Services (SSRS),以在很大程度上自动(根据您的技能)生成预先指定的报告。 请注意,SSRS“报告”仅仅是一个(可扩展的)XML 配置/规范,它在运行时通过 SSRS 引擎呈现。 诸如导出到 Excel 文件之类的选择是简单的选项。

认真的承诺(需要一些重要的人类承诺)

注意上面我们还没有利用数据挖掘/动态切片/切块
SQL Server Analysis Services 的功能。 我们正在为此努力,
但现在专注于提高“数据访问层”中数据清理的质量。

我希望这可以帮助您了解从哪里开始寻找。

I'll discuss a few points on the {budget, business utility function, time frame} spectrum out there. For convenience, let's follow the architecture conceptualization you linked to at

    WikipediaDataWarehouseArticle

  • Operational database layer

    The source data for the data warehouse - Normalized for In One Place Only data maintenance

  • Data access layer

    The transformation of your source data into your informational access layer.
    ETL tools to extract, transform, load data into the warehouse fall into this layer.

  • Informational access layer

      • Report-facilitating Data Structure

          Data is not maintained here. It is merely a reflection of your source data

          Hence, denormalized structures (containing duplicate, but systematically derived data)

          are usually most effective here

      • Reporting tools

          How do you actually allow your users access to the data

          • pre-canned reports (simple)

          • more dynamic slice-and-dice access methods

        The data accessed for reporting and analyzing and the tools for reporting and analyzing data

        fall into this layer. And the Inmon-Kimball differences about design methodology,

        discussed later in the Wikipedia article, have to do with this layer.

  • Metadata layer (facilitates automation, organization, etc)

Roll your own (low-end)

For very little out-of-pocket cost, just recognizing the need for the denormalized structures can buy those that are not using it some efficiencies

Get in the ballgame (some outlays required)

You don't need to use all the functionality of a platform right off the bat.

IMO, however, you want to be on a platform that you know will grow, and in the highly competitive and consolidating BI environment, that seems to be one of the four enterprise mega-vendors (my opinion)

  • Microsoft (the platform of our 110 employee firm)
  • SAP
  • Oracle
  • IBM

    BiMarketStateArticle

My firm is at this stage, using some of the ETL capability offered by SQL Server Integration Services (SSIS) and some alternate usage of the open source, but in practice license requiring Talend product in the "Data Access Layer", a denormalized reporting structure (implemented completely in the basic SQL Server database), and SQL Server Reporting Services (SSRS) to largely automate (based on your skill) the production of pre-specified reports. Note that an SSRS "report" is merely a (scalable) XML configuration/specification that gets rendered at runtime via the SSRS engine. Choices such as export to an excel file are simple options.

Serious Commitment (some significant human commitment required)

Notice above that we have yet to utilize the data mining/dynamic slicing/dicing
capabilities of SQL Server Analysis Services. We are working toward that,
but now focused on improving the quality of our data cleansing in the "Data Access Layer".

I hope this helps you to get a sense of where to start looking.

她如夕阳 2024-07-13 16:08:17

我认为您应该首先查看 Kimball 和 Inmon,看看您是否想以特定方式处理数据仓库。 Kimball 特别为仓库的建模和构建提供了一个非常好的框架。

I think you should first check out Kimball and Inmon and see if you want to approach your data warehouse in a particular way. Kimball, in particular, lays out a very good framework for the modelling and construction of the warehouse.

笑看君怀她人 2024-07-13 16:08:17

Pentaho 已经整合了一套相当全面的产品。 这些产品是“免费”的,但一旦你交出你的身份信息,就要做好迎接通常的大卖的准备。

我没有机会真正扩展它们,因为我们是一家微软商店,从一个悲伤的结局到另一个悲伤的结局。

Pentaho has put together a pretty comprehensive suite of products. The products are "free", but be prepared for the usual heavy sell once you fork over your identifying information.

I haven't had a chance to really stretch them as we're a Microsoft shop from one sad end to the other.

梦在夏天 2024-07-13 16:08:17

有许多工具试图完成设计、实施和管理/操作数据仓库的过程,它们各自都有自己的优点和缺点,而且价格往往相差很大。 如果您对金博尔和/或英蒙营地的仓储原则有深入的了解,那么在幕后,您总是会过得最好。

除了 Kalido 和Wherescape RED 等工具(它们以非常不同的方式执行类似的操作)之外,许多 ETL 平台现在都对实现的关键工作(SCD 组件等和沿袭跟踪)提供了良好的内置支持。

最好将所有这些视为在你(工匠)手中使用的工具,它们使某些简单的事情变得更加容易(甚至微不足道),一些困难的事情变得更容易,但有些事情他们只是以恕我直言;)首先学习方法和原理并充分理解它们,然后您就会知道要从工具包中应用哪些工具以及何时应用......

There are a number of tools which try to make the process of designing, implementing and managing/operating a Data Warehouse and they each have their strengths and weaknesses and often vastly differing price points. Under the covers you are always going to be best off if you have a good knowledge of warsehousing principles from the Kimball and/or Inmon camps.

As well as tools like Kalido and Wherescape RED (which do similar thing in very different ways), many of the ETL platforms now have good in-built support for the donkey work of implementation - SCD components etc and lineage tracking.

Best though to view all these as tools to be used in the hands of you, the craftsman, they make certain easy things even easier (or even trivial), some hard things easier but some things they just get in they way of IMHO ;) Learn the methodology and principles first and get a good understanding of them and then you will know which tools to apply from your kitbag and when...

┾廆蒐ゝ 2024-07-13 16:08:17

它已经有一段时间没有更新了,但是有一个很好的数据仓库/ETL Ruby 包,名为 ActiveWarehouse

但我会查看 Pentaho 产品,就像 Nick 在另一个答案中提到的那样。 它应该可以轻松处理您拥有的数据量,并且可以为您提供比您想象的更多的数据切片和切块方法。

It hasn't been updated in a while but there's a nice Data Warehousing/ETL Ruby package called ActiveWarehouse.

But I would check out the Pentaho products like Nick mentioned in another answer. It should easily handle the volume of data you have and may provide you with more ways to slice and dice your data than you could have ever imagined.

梨涡 2024-07-13 16:08:17

目前您可以获得的最佳框架是Anchor Modeling
它可能看起来相当复杂,因为它具有通用结构和内置的数据历史记录功能。
而且建模技术与 ERD 有很大不同。
但是您最终会使用 sql 代码来生成所有数据库对象,包括 3NF 视图和:

  • 由触发器处理的插入/更新
  • 查询历史上的任何点/范围
  • ,应用程序开发人员将看不到底层的 6NF 锚定模型。

该技术是开源的,目前是无与伦比的。

如果您有 AM 问题,您可能想在该标签上提问

The best framework you can currently get is Anchor Modeling.
It might look quite complex because of it's generic structure and built-in capability to historize data.
Also modeling technique is quite different than ERD.
But you end-up with sql code to generate all db objects including 3NF views and:

  • insert/update handled by triggers
  • query any point/range in history
  • you application developers will not see underlying 6NF anchor model.

The technology is open sourced and at the moment is unbeatable.

If you would have AM question you may want to ask on that tag .

ら栖息 2024-07-13 16:08:17

Kimball 是更简单的数据仓库方法。

我们使用 Informatica 来移动数据,但默认情况下它不会执行诸如索引之类的 DW 操作。
我喜欢Wherescape RED 的想法,它作为一个DW 工具并使用MS SQL 的链接服务器来消除对ETL 工具的需要。

Kimball is the simpler method for data warehousing.

We use Informatica for moving data around, but it doesn't do DW things like indexing by default.
I like the idea of Wherescape RED, as a DW tool and using MS SQL's Linked Servers to obviate the need for an ETL tool.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文