当前位置：文江博客话题详情

Oracle数据仓库设计——事实表充当维度？

发布于 2024-08-08 18:49:05 字数 1092 浏览 8 评论 0原文

谢谢：这里的两个答案都非常有帮助，但我只能选择一个。我非常感谢您的建议！

我们的数据仓库将更多地用于工作流程报告，而不是传统的分析报告。我们的用户关心“当前图片”远远超过历史。（尽管历史也很重要。）我们是一个政府实体，没有成本或相关计算。大多数情况下只是计算给定地点内具有相关历史的人数。

我们正在使用 Oracle，并且我发现只要有可能就使用星型连接具有明显的优势，并且希望将所有内容重新架构为尽可能类似于星型模式，这对我们的业务用途来说是合理的。在这个 DW 中，速度至关重要，并且许多测试已经向我证明了星型模式方法。

我们的“person”表是关键 - 它包含超过 400 万条记录并且将成为查询中最常用的来源。它可以在具有多个维度（如年龄、性别）的星形中心看到、隶属关系、地点等）。这是一个非常长的表，特别是当我将它连接到地址和联系信息时。

然而，当我们开始查看历史时，它更像是一个维度表。例如，有两个不同的历史表，它们都有一个指向 person 表的 person 键。其中一个拥有超过 2000 万条记录，另一个拥有近 5000 万条记录，并且每天都在增长。

该表是事实表还是维度表？一个人可以兼任两者吗？如果是这样，这会是一个很大的性能问题吗？查询更多维度而不是事实是否很常见？如果使用人员表作为维度的不同事实表实际上只有 60,000 条记录（小得多），会发生什么情况。

我认为我的问题是我们的数据及其使用不符合星型模式的常用示例。

澄清：下面添加了一些好的想法，但也许我遗漏了太多内容，无法真正很好地解释。以下是更多信息：

我们处理选民数据库。除了按不同群体划分的选民计数外，我们没有任何衡量标准：按政党、按年龄、按地点划分的选民计数；按选票类型和选举、选票状态和选举等进行选民计数。我们有“投票历史记录”日志以及活动审核日志（地址、政党等变更）。我们有关于哪些选民是选举工作人员的信息以及所有相关信息。我想稍后我会讨论外围的东西。

现在我专注于我们的两个主要“业务流程”：选民登记（即选民）和选举投票率。首先，选民是一个事实。在第二个维度中，选民是一个维度，还有政党、选举和选票类型。（以防万一有人担心 - 不，我们不知道人们如何投票。只是他们知道。哈哈）

我希望这能澄清一些事情。

原文

THANKS: Both answers here are very helpful, but I could only pick one. I really appreciate the advice!

our datawarehouse will be used more for workflow reports than traditional analytical reports. Our users care about "current picture" far more than history. (though history matters, too.) We are a government entity that does not have costs or related calculations. Mostly just counts of people within given locations and with related history.

We are using Oracle, and I have found distinct advantage in using the star join whenever possible and would like to rearchitect everything to as closely resemble the star schema as is reasonable for our business uses. Speed in this DW is vital, and a number of tests have already proven the star schema approach to me.

Our "person" table is key - it contains over 4 million records and will be the most frequently used source in queries. It can be seen at the center of a star with multiple dimensions (like age, gender, affiliation, location, etc.). It is a very LONG table, particularly when I join it to the address and contact information.

However, it is more like a dimension table when we start looking at history. For example, there are two different history tables that have a person key pointing to the person table. One has over 20 million records and the other has almost 50 million and grows daily.

Is this table a fact table or a dimension table? Can one work as both? If so, is that going to be a big performance problem? Is it common to query more off of a dimension than a fact? What happens if a DIFFERENT fact table that uses the person table as a dimension is actually only 60,000 records (much smaller.).

I think my problem is that our data and use of it does not fit with the commonly use examples of star schemas.

CLARIFICATION: Some good thoughts have been added below, but perhaps I left too much out to really explain well. Here's some more info:

We handle a voter database. We don't have any measures except voter counts by various groups: voter counts by party, by age, by location; voter counts by ballot type and election, by ballot status and election, etc. We do have a "voting history" log as well as an activity audit log (change of address, party, etc.). We have information on which voters are election workers and all that related information. I figure I'll get to the peripheral stuff later.

For now I'm focusing on our two major "business processes": voter registration(which IS a voter.) and election turnout. In the first, voter is a fact. In the second, voter is a dimension, along with party, election, and type of ballot. (and in case anyone is worried - no we don't know HOW people vote. Just that they do. LOL )

I hope that clarifies things a bit.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雨夜星沙 2024-08-15 18:49:05

如果可能的话，我的建议是重构这些表，使它们更符合真正的星型模式。尽管 5000 万条记录听起来很多（考虑事务系统时），但我们有多个事实表，其中行数多达 5 亿。假设您的硬件是针对此类工作指定的，那么将表组合成一个大型事实表时应该不会有任何问题（假设它们都在同一主题区域内）。

话虽如此，请确保在选择高度非规范化结构时考虑应考虑的其他因素。星型模式对于报告数据来说是一种很好的设计，因为它减少了必要的连接，但是，在更新表和磁盘空间时，您通常会为此付出很大的代价。当您说您正在考虑将此架构用于更多工作流应用程序，而不是主要用于分析时，那么我会确保考虑到更新。是否需要实时或近实时更新？如果是这样，您可能不想考虑成为明星。

最后，是的，在某些情况下，我们只查询维度表，通常当应用程序需要特定的项目列表（即产品、客户等）时，这是一种有效的用途，但是，更好的解决方案可能会利用 ODS 而不是我们的星型模式。

我发现，尽管我试图让我的模式看起来像英蒙或金博尔教科书上的东西，但如果没有一些现实世界的定制，它几乎永远不会起作用。

编辑
我确实对 ODS 的提及更加具体。

操作数据存储（或“ODS”）是一种数据库，旨在集成多个来源的数据，以便更轻松地进行分析和报告。由于数据来自多个来源，因此集成通常涉及清理、解决冗余以及根据业务规则检查完整性。 ODS 通常被设计为包含具有有限历史记录的低级或原子（不可分割）数据（例如交易和价格），这些数据是“实时”或“接近实时”捕获的，而不是存储在 ODS 中的大量数据。数据仓库通常不太频繁。

根据该概念的创始人 Bill Inmon 的说法，ODS 是“面向主题的、集成的、易变的、当前值的、仅详细的数据集合，用于支持组织对最新、可操作的需求”。、综合的、集体的信息。”

ODS 与 Inmon 的企业数据仓库定义不同，其历史记录有限，而且更新比 EDW 更频繁。在实践中，ODS 往往更能反映源结构，以加快实施速度并提供更真实的生产数据表示。

http://en.wikipedia.org/wiki/Operational_data_store

If possible, my suggestion would be to refactor these tables so they are more in alignment with a true star schema. Although 50 million records sounds like a lot (when thinking about a transactional system) we have multiple fact tables with as many as 500 million rows. Assuming that your hardware was speced for this type of work, you should not have any issues with combining your tables into one large fact table (assuming they are all within the same subject area).

With that said, make sure that you account for the other factors that should be considered when choosing a highly denormalized structure. The star schema is a great design for reporting on data because of the reduction in the necessary joins, however, you often pay a large price for this while updating tables and in disk space. When you say that you are considering using this schema for more of a workflow application, rather than mostly analytics, then I would make sure to account for the updates. Are updates needed in real time or near real time? If so, again you may not want to consider a star.

Finally, yes in some cases we query only our dimension tables, often when an application needs a specific list of items (i.e. products, customers, etc), this is a valid use, however, a better solution would likely leverage an ODS rather than our star schema.

What I have found is as much as I try to make my schema look like something right out of an Inmon or Kimball textbook, it almost never works without some real world custimization.

Edit
I sure have been more specific with the reference to the ODS.

An operational data store (or "ODS") is a database designed to integrate data from multiple sources to make analysis and reporting easier. Because the data originates from multiple sources, the integration often involves cleaning, resolving redundancy and checking against business rules for integrity. An ODS is usually designed to contain low level or atomic (indivisible) data (such as transactions and prices) with limited history that is captured "real time" or "near real time" as opposed to the much greater volumes of data stored in the Data warehouse generally on a less frequent basis.

According to Bill Inmon, the originator of the concept, an ODS is "a subject-oriented, integrated, volatile, current-valued, detailed-only collection of data in support of an organization's need for up-to-the-second, operational, integrated, collective information."

ODS differ from Inmon's definition of enterprise data warehouse by having a limited history, and more frequent update than an EDW. In practice ODS tend to be more reflective of source structures in order to speed implementations and provide a truer representation of production data.

http://en.wikipedia.org/wiki/Operational_data_store

回复收藏 0 原文

近箐 2024-08-15 18:49:05

大“人”（客户）维度在电信、银行、保险等领域很常见。Kimball 在 CRM 第 (6) 章下有一个名为“大变化的客户维度”的部分。它展示了如何创建“迷你尺寸”。经常更改或经常分析的属性（列）被分解为单独的小维度表。这些小维度通过事实表连接，因此事实表对于每个表都有一个单独的 FK。

在我看来，你的例子与此很接近。

作为一般规则，维度表是很少变化的对象（人员、账户、时间、产品、商店）的查找表，而事实表捕获这些对象之间交互的活动（历史）。事实表包含您想要聚合的度量（总销售额、工作小时数、生产的零件数量等）。

澄清后：
我想说的是，投票者实际上是一个一致的维度——对于所有数据集市（业务流程）来说都是通用的。其他一致的维度包括：日期、政党、选举、投票站。迷你维度是人口统计和地理区域。
事实表将是：RegistrationEvent（谁何时何地注册）和 ElectionEvent（谁何时何地在哪次选举中投票，使用什么）。
Dimension Voter 和事实 RegistrationEvent 是从捕获选民注册和其他更改的操作系统加载的。
这是简化的，但我希望它能抓住基本的想法。

回复收藏 0 原文

雨巷深深 2024-08-15 18:49:05

好的 - 这不是一个完整的“答案”，但已经很接近了。

请注意这篇描述 Kimball 讲座的博客文章：
http://database-geek .com/2005/03/28/a-day-with-ralph-kimball-part-2/

我挣扎的原因是这是一个“堕落”的维度。我的选民身份和相关信息与我的“注册”事实表是一对一的。所以看来 Kimball 甚至可以将其放入事实表中。

所以现在我只是研究当一个事实表被另一个事实表使用时会发生什么。

编辑：另外，我发现谷歌搜索“怪物维度”一词非常有帮助。这很像一个缓慢变化的客户维度。只要我愿意雪花，我就可以实现我所需要的——查询选民时的星型连接，并且不会导致使用选民作为各种事实表的维度出现问题。

编辑：
这是我的最终结论：正如上面所建议的，重点是促进业务流程，而不是适应教科书图表。

我们的业务是这样的，绝对没有理由拆分选民表（有一个用于“注册”的事实表和一个用于“选民”的维度） - 当使用该表查询时，我们将需要所有属性以及所有属性标志和文本信息。我不想将属性单独分解为“事实”（就像金博尔在书中为客户和订单展示的那样），因为这些属性在附加到事实时和附加到维度时意味着不同的东西。此外，选民在其他多个地方也被用作一个属性，其中一些确实适合传统的明星。

我的主要目的是速度。所以我选择了一种修改后的格式 - 很像雪花 - 其中选民是多个表的中心，当我对所有内容进行正确索引时，Oracle 可以使用星型连接。然后，我使用选民作为所有其他“明星”的维度。在每种情况下，我都会进行设置，以便大多数（如果不是全部）表都可以使用星型连接进行连接，即使它不是“教科书”。

再次感谢您的帮助！