关系数据库与维度数据库,有什么区别?

发布于 2024-08-31 20:51:41 字数 970 浏览 1 评论 0原文

我正在尝试了解 OLAP 和数据仓库,但我对关系建模和维度建模之间的区别感到困惑。维度建模基本上是关系建模,但允许冗余/非标准化数据吗?

例如,假设我有(产品、城市、销量)的历史销售数据。我知道以下将是一个关系的观点:

Product | City | # Sales
Apples, San Francisco, 400
Apples, Boston, 700
Apples, Seattle, 600
Oranges, San Francisco, 550
Oranges, Boston, 500
Oranges, Seattle, 600

虽然以下是一个更维度的观点:

Product | San Francisco | Boston | Seattle
Apples, 400, 700, 600
Oranges, 550, 500, 600

但似乎这两种观点仍然会在相同的星型模式中实现:

Fact table: Product ID, Region ID, # Sales
Product dimension: Product ID, Product Name
City dimension: City ID, City Name

并且直到你开始向差异开始出现的每个维度添加一些额外的细节。例如,如果您还想跟踪区域,关系数据库往往会拥有一个单独的区域表,以便使所有内容保持规范化:

City dimension: City ID, City Name, Region ID
Region dimension: Region ID, Region Name, Region Manager, # Regional Stores

而维度数据库将允许非规范化以将区域数据保留在城市维度内,为了更容易地分割数据:

City dimension: City ID, City Name, Region Name, Region Manager, # Regional Stores

这是正确的吗?

I'm trying to learn about OLAP and data warehousing, and I'm confused about the difference between relational and dimensional modeling. Is dimensional modeling basically relational modeling, but allowing for redundant/un-normalized data?

For example, let's say I have historical sales data on (product, city, # sales). I understand that the following would be a relational point-of-view:

Product | City | # Sales
Apples, San Francisco, 400
Apples, Boston, 700
Apples, Seattle, 600
Oranges, San Francisco, 550
Oranges, Boston, 500
Oranges, Seattle, 600

While the following is a more dimensional point-of-view:

Product | San Francisco | Boston | Seattle
Apples, 400, 700, 600
Oranges, 550, 500, 600

But it seems like both points of view would nonetheless be implemented in an identical star schema:

Fact table: Product ID, Region ID, # Sales
Product dimension: Product ID, Product Name
City dimension: City ID, City Name

And it's not until you start adding some additional details to each dimension that the differences start popping up. For instance, if you wanted to track regions as well, a relational database would tend to have a separate region table, in order to keep everything normalized:

City dimension: City ID, City Name, Region ID
Region dimension: Region ID, Region Name, Region Manager, # Regional Stores

While a dimensional database would allow for denormalization to keep the region data inside the city dimension, in order to make it easier to slice the data:

City dimension: City ID, City Name, Region Name, Region Manager, # Regional Stores

Is this correct?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

音栖息无 2024-09-07 20:51:41

星型模式实际上位于数据的关系模型和数据的维度模型的交叉点。这实际上是一种从维度模型开始并将其映射到 SQL 表的方法,该表有点类似于从关系模型开始时获得的 SQL 表。

我说有点相似是因为许多关系设计方法会导致标准化设计,或者至少是接近标准化的设计。星型模式将与完全标准化有很大的不同。

每次偏离完全标准化都会带来随之而来的数据更新异常。 (我将插入、更新和删除操作的异常归为一类)。这些异常与您开始使用的数据模型没有任何关系。

关于 OLTP 与 OLAP 的评论与此相关。在这两种情况下,更新异常将对性能和/或编程难度产生不同的影响。

除了 SQL 数据库中的星型模式之外,还有维度数据库产品,它们以该产品独有的物理形式存储数据。对于这些产品,您不会看到星型模式,而是会看到维度模型的直接实现,以及可能是产品特有的界面。其中一些界面允许完全通过点击操作进行 OLAP 操作。

顺便说一句,我曾经构建了一个星型模式,作为支持基于事务的应用程序的 OLTP 数据库和 Cognos PowerPlay 内的数据立方体之间的中间步骤。使用标准 ETL 技术,从 OLTP 数据库到星型模式,然后从星型模式到数据立方体的组合传输实际上优于从 OLTP 数据库到数据立方体的直接传输。这是一个意想不到的结果。

希望这有帮助。

A star schema really lies at the intersection of the relational model of data and the dimensional model of data. It's really a way of starting with a dimensional model, and mapping it into SQL tables that somewhat resemble the SQL tables you get if you start from a relational model.

I say somewhat resemble because many relational design methodologies result in a normalized design, or at least a nearly normalized design. A star schema will have significant departures from full normalization.

Every departure from full normalization carries with it a consequent data update anomaly. (I'm including anomlaies on insert, update and delete operations under one umbrella). Those anomalies don't have anything to do with what data model you started with.

The comment on OLTP versus OLAP is relevant here. Update anomalies will have different impacts on performance and/or programming difficulty in those two situations.

In addition to a star schema in an SQL databaase, there are dimensional database products out there that store data in a physical form that is unique to that product. With those products, you don't see a star schema so much as you see a direct implementation of the dimensional model, and an interface that might be peculiar to the product. Some of those interfaces allow OLAP operations to be completely point-and-click.

Just as a digression from your question, I once built a star schema as an intermediate step between an OLTP database that supported a transaction based application and a datacube inside Cognos PowerPlay. Using standard ETL techniques, the combined transfer from the OLTP database to the star schema and then from the star schema to the data cube actually outperformed the direct transfer from the OLTP database to the datacube. This was an unexpected result.

Hope this helps.

郁金香雨 2024-09-07 20:51:41

简而言之,OLTP 规范化数据库是从最优化的“事务”角度设计的。数据库经过标准化以最佳地适应事务系统。当我说事务系统的优化时,我的意思是……达到数据库结构的设计状态,其中所有事务操作(如删除、插入、更新和选择)都得到平衡,以便在任何时间点对所有这些操作给予同等或最佳的重要性。 .因为它们在交易系统中具有同等价值。

标准化系统提供的内容是数据更新可能的最小更新、新条目可能的最小插入、类别删除的单处删除等(例如新产品类别)...所有这一切都是可能的,我们分支一个创建主表.....但这是以“选择”操作延迟为代价的..但正如我所说,它的(标准化)不是所有操作的最有效模型..它的“最佳”......话虽如此,我们得到了其他方法提高数据获取速度..如索引等

另一方面,维度模型(主要用于数据仓库设计)..意味着仅重视一种操作,即数据选择...如数据仓库中的操作房屋..数据更新/插入定期发生..并且是一次性成本。

因此,如果尝试调整标准化数据结构,以便在任何时间点只有选择才是最重要的操作……我们最终将得到非标准化(我会说部分非标准化)……维星结构。

  • 所有外键都放在一个地方事实
    -没有维度到维度连接(即主表到主表连接)..雪花代表相同的维度

    • 理想设计的事实仅包含数字..度量或外键
    • 维度用于携带描述和不可聚合的信息
    • 数据冗余被忽略......但在极少数情况下,如果维度本身增长太多。雪花设计被视为选项......但这仍然是可以避免的

有关详细信息,请阅读有关此主题的详细书籍。

In simple words OLTP normalized database are designed with most optimal "transactional" point of view. Databases are normalized to work optimally to a transactional system. When I say optimization of transactional system i mean ..getting to a design state of database structure where all transactional operations like delete,insert,update and select are balanced to give equal or optimum importance to all of them at any point of time...as they are equally valued in a transactional system.

And that what a normalized system offer ..minimal updates possible for a data update,minimal insert possible for new entry,one place delete for category deletion etc (e.g. new product category )...all this is possible a we branch a create master tables .....but this comes at the cost of "select" operation delay ..but as I said its(normalization) not most efficient model for all operations ..its "Optimal"...having said we get other methods to enhance data fetching speed..like indexing etc

On the other hand Dimensional model (mostly used for data-ware house design)..meant for giving importance to only one kind of operations thats Selection of data...as in data-ware houses ..data update/insertion happens periodically ..and its a one time cost.

So if one try to tweak normalized data structure so that only selection is the most important operation at any point in time ...we will end up getting a denormalized (I would say partially denormalized)..dimensional star structure.

  • all the foreign keys a one place Fact
    -no dimension to dimension join (i.e. master to master table join)..snowflake represent same dimension

    • ideally designed facts carry only numbers ..measures or foreign keys
    • dimension are used to carry description and non aggregatable info
    • redundancy of data is ignored ...but in rare cases if Dimensions itself grow too much .snowflake design is seen as option..but that still is avoidable

For details please go through detailed books on this topic.

懷念過去 2024-09-07 20:51:41

我最近刚刚阅读了维度数据建模和关系数据建模之间的区别,因为我们在存储企业数据仓库 (EDW) 的企业中主要使用关系模型。

根据 Steve Hoberman 在他的《数据建模变得简单》一书中所说,这两种模型之间的区别是:

  • 关系数据模型捕获业务部分如何工作的业务解决方案,也称为业务流程
  • 维度数据模型捕获业务细节需要回答有关其表现如何的问题

可以说,关系模型也可以用作回答业务问题的基础,但只是在战术层面上。 “由于信用冻结,客户 x 有多少订单处于未履行状态?”但区别在于报告问题何时需要表格的“原生粒度”以及报告问题何时可以用汇总数据来回答。

在上面的两个示例中,它们实际上都是维度数据建模的示例,因为这两个表都没有以其“本机粒度”存储销售订单,因此没有捕获创建销售订单的业务流程。这两个表之间的唯一区别在于,在第二个表中,城市维度已转置到事实表中。

I have just recently read up on the difference between Dimensional and Relational Data Modeling since we primarily use Relational models at my business where we store an Enterprise Data Warehouse (EDW).

According to Steve Hoberman in his book "Data Modeling Made Simple" the distinction between the 2 types of models is this:

  • Relational Data Models captures the business solution for how part of the business works, a.k.a business process
  • Dimensional Data Models capture the details the business needs to answer questions about how well it is doing

It can be argued that a relational model can also be used as a foundation upon which to answer business questions, but at a tactical level. "How many orders are in an unfulfilled state for customer x due to credit hold?" But the distinction is that of where the reporting question needs the 'native grain' of the table and when the reporting question can be answered with summarized data.

In your above 2 examples they are actually both examples of Dimensional data modeling since neither of the 2 tables are storing the Sales Order at its 'native grain', and therefore does not capture the business process of creating a sales order. The only difference between the 2 tables is that in the 2nd table the city dimension has been transposed into the fact table.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文