数据库仓库设计:事实表和维度表

发布于 2024-09-03 22:22:51 字数 1110 浏览 12 评论 0原文

我正在使用 RDBMS 构建一个穷人的数据仓库。我已确定要记录的关键“属性”为:

  • 性别(真/假)
  • 人口统计分类(A、B、C 等)
  • 出生地点
  • 出生日期
  • 体重(每天记录): 正在记录的事实

我的要求是能够运行“OLAP”查询,使我能够:

  • “切片和切块”
  • “向上/向下钻取”数据,并且
  • 通常能够从不同的角度查看数据

在阅读完该主题领域后,普遍的共识是似乎最好使用维度表而不是标准化表来实现。

假设这个断言是正确的(即解决方案最好使用事实和维度表来实现),我想在这些表的设计中寻求一些帮助。

“自然”(或明显)维度包括:

  • 日期维度
  • 地理位置

具有层次属性。然而,我正在努力解决如何对以下字段进行建模:

  • 性别(真/假)
  • 人口统计分类(A、B、C 等)

我在这些字段上挣扎的原因是:

  1. 它们没有明显的分层属性来帮助聚合(AFAIA) - 这表明它们应该在事实表中
  2. 它们大多是静态的或很少变化 - 这表明它们应该在维度表中。

也许我上面使用的启发式太粗糙了?

我将给出一些关于我想对数据仓库进行的分析类型的示例 - 希望这能进一步澄清问题。

我想按性别和人口统计分类汇总和分析数据 - 例如回答以下问题:

  • 不同人口统计分类中男性和女性的体重如何比较?
  • 哪个人口统计分类(男性和女性)显示本季度体重增加最多。

等等。

任何人都可以澄清性别和人口统计分类是否是事实表的一部分,或者它们是否(正如我怀疑的)维度表。?

另外假设它们是维度表,有人可以详细说明表结构(即字段)吗?

“显而易见”的模式:

CREATE TABLE sex_type (is_male int);
CREATE TABLE demographic_category (id int, name varchar(4));

可能不是正确的。

I am building a poor man's data warehouse using a RDBMS. I have identified the key 'attributes' to be recorded as:

  • sex (true/false)
  • demographic classification (A, B, C etc)
  • place of birth
  • date of birth
  • weight (recorded daily): The fact that is being recorded

My requirements are to be able to run 'OLAP' queries that allow me to:

  • 'slice and dice'
  • 'drill up/down' the data and
  • generally, be able to view the data from different perspectives

After reading up on this topic area, the general consensus seems to be that this is best implemented using dimension tables rather than normalized tables.

Assuming that this assertion is true (i.e. the solution is best implemented using fact and dimension tables), I would like to seek some help in the design of these tables.

'Natural' (or obvious) dimensions are:

  • Date dimension
  • Geographical location

Which have hierarchical attributes. However, I am struggling with how to model the following fields:

  • sex (true/false)
  • demographic classification (A, B, C etc)

The reason I am struggling with these fields is that:

  1. They have no obvious hierarchical attributes which will aid aggregation (AFAIA) - which suggest they should be in a fact table
  2. They are mostly static or very rarely change - which suggests they should be in a dimension table.

Maybe the heuristic I am using above is too crude?

I will give some examples on the type of analysis I would like to carryout on the data warehouse - hopefully that will clarify things further.

I would like to aggregate and analyze the data by sex and demographic classification - e.g. answer questions like:

  • How does male and female weights compare across different demographic classifications?
  • Which demographic classification (male AND female), show the most increase in weight this quarter.

etc.

Can anyone clarify whether sex and demographic classification are part of the fact table, or whether they are (as I suspect) dimension tables.?

Also assuming they are dimension tables, could someone elaborate on the table structures (i.e. the fields)?

The 'obvious' schema:

CREATE TABLE sex_type (is_male int);
CREATE TABLE demographic_category (id int, name varchar(4));

may not be the correct one.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

苍白女子 2024-09-10 22:22:56

通常,所有数值数量和度量都是事实表中的列。那么其他一切都是维度属性。它们属于哪个维度是相当务实的,取决于数据。

除了您已经收到的建议之外,我没有看到任何提及退化维度的内容。在这些情况下,需要将每个事实都不同的发票号或序列号时间戳等内容存储在事实中,否则维度表将与事实表变为 1-1。

如果研究正在进行,您的案例中的一个关键设计决策可能是分析与年龄相关的数据。因为人的年龄会随着时间的推移而变化,在某个时候他们会进入另一个年龄段。根据研究开始时分组是否固定,这可能会决定您想要如何聚合。我不一定是说您应该有一个群体维度并通过它来了解年龄,但您可能需要在 ETL 期间确定正确的年龄/人口统计维度。但这取决于最终用途(或适应与事实表链接的二维角色 - 初始人口统计数据,永远不会改变,当前人口统计数据将随着时间而变化)。

类似的事情也适用于地理。虽然您显然可以通过分析当前地理位置随时间的变化来跟踪一个人的地理位置,但维度 DW 的要点是将所有相关维度直接链接到事实(通常可以通过标准化模型通过网络导出的内容)。实体关系模型 - 这些在 ETL 时被锁定)。这种冗余使得对传统 RDBMS 中的维度模型的分析更快。

请注意,其中很多内容不适用于像 Teradata 这样的大规模并行 DW,它们在星型模式下表现不佳 - 它们喜欢将所有数据标准化并链接到同一个主索引,因为它们是用于分布数据的主索引处理单元。

Generally, all numeric quantities and measures are columns in the fact table(s). Then everything else is a dimensional attribute. Which dimension they belong in is rather pragmatic and depends on the data.

In addition to the suggestions you have already received, I saw no mention of degenerate dimensions. In these cases, things like an invoice number or sequence number timestamp which is different for every fact needs to be stored in the fact, otherwise the dimension table will become 1-1 with the fact table.

A key design decision in your case is probably the analysis of data related to age if the study is ongoing. Because people's ages change with time, they will move to another age group at some point. Depending on whether the groups are fixed at the beginning of a study or not, this may determine how you want to aggregate. I'm not necessarily saying you should have a group dimension and get to age through that, but that you may need to determine the correct age/demographic dimension during the ETL. But this depends on the end use (or accommodate both with two dimension roles linked from the fact table - initial demographics, which never changes, and current demographics which will change with time).

A similar thing could apply with geography. Although you can obviously track a person's geography by analysing current geography changes over time, the point of a dimensional DW is to have all the relevant dimensions linked straight to the fact (things which you might normally derive in a normalized model through the network of an Entity-Relationship model - these get locked in at the time of ETL). This redundancy makes analysis quicker on the dimensional model in traditional RDBMSes.

Note that a lot of this does not apply in massively parallel DW like Teradata, which don't perform well with star schemas - they like all the data normalized and linked up to the same primary index because they the primary index to distribute the data over the processing units.

も让我眼熟你 2024-09-10 22:22:56

您打算使用什么 OLAP/表示层工具?它们通常具有自己的功能来支持多维数据集、层次结构、聚合等的构建。

规范形式通常是灵活高效的数据仓库的最可靠基础,尽管 Mart 有时会进行非规范化以支持一组特定的报告要求。在没有任何其他信息的情况下,我建议您确保您的数据库至少采用 Boyce-Codd / 第五范式。

What OLAP / presentation tier tool are you intending to use? These often have features of their own to support building of cubes, hierarchies, aggregations, etc.

Normal Form is usually the most sound basis for a flexible and efficient Data Warehouse, although Marts are sometimes denormalized to support a specific set of reporting requirements. In the absence of any other information I suggest you aim to ensure your database is in at least Boyce-Codd / 5th Normal Form.

以可爱出名 2024-09-10 22:22:55

星型模式搜索在 SQL 中相当于维恩图的交点。正如您的示例查询清楚地显示的那样,SEX_TYPE 和 DEMOGRAPHIC_CATEGORY 是您要搜索的集合,因此必须是维度。

至于表结构,我认为您对 SEX_TYPE 的设计是错误的。对于初学者来说,基于

where sex_type.name = 'FEMALE'

where sex_type.is_male = 1

设计查询更容易、更直观。 此外,在现实世界中,性别不是布尔值。大多数应用程序也应该收集“未知”和“跨性别”信息,对于健康/医疗应用程序来说确实如此,而这正是您似乎正在做的事情。此外,如果你有女同事,这也可以避免一些不愉快的办公室争吵。

编辑

“我正在考虑如何处理
新的性别类型和人口统计案例
尚未包含在的类别
数据库”

查询优化器可以使用这些元数据来导出最有效的搜索路径。当有大量数据和临时查询需要处理时,这一点尤其重要。处理新的维度值总是很困难,除非您的源系统向您提供通知,这实际上取决于您的设置。

Star schema searches are the SQL equivalent of the intersection points of Venn Diagrams. As your sample queries clearly show, SEX_TYPE and DEMOGRAPHIC_CATEGORY are sets you want to search by and hence must be dimensions.

As for the table structures, I think your design for SEX_TYPE is misguided. For starters it is easier, more intuitive, to design queries on the basis of

where sex_type.name = 'FEMALE'

than

where sex_type.is_male = 1

Besides, in the real world sex is not a boolean. Most applications should gather UNKNOWN and TRANSGENDER as well, and that's certainly true for health/medical apps which is what you seem to be doing. Furthermore, it will avoid some unpleasant office arguments if you have any female co-workers.

Edit

"I am thinking of how to deal with
cases of new sex_types and demographic
categories not already in the
database"

There was a vogue for not having foreign keys in Data Warehouses. But they provide useful metadata which a query optimizer can use to derive the most efficient search path. This is particularly important when there is a lot of data and ad hoc queries to process. Dealing with new dimension values is always going to be hard, unless your source systems provide you with notification. This really depends on your set-up.

眼眸里的那抹悲凉 2024-09-10 22:22:54

不知道为什么您觉得使用 RDBMS 是穷人的解决方案,但希望这可能有所帮助。

weight_model_01. png

表dimGeography 和dimDemographic 是所谓的迷你维度;它们允许基于人口统计和地理进行切片,而无需加入 dimUser,并且还可以在测量时捕获用户当前的人口统计和地理。

顺便说一句,在 DW 世界中,详细信息 - Gender = 'female'、AgeGroup = '30-35'、EducationLevel = 'university' 等

Not sure why you feel that using RDBMS is poor man's solution, but hope this may help.

weight_model_01.png

Tables dimGeography and dimDemographic are so-called mini-dimensions; they allow for slicing based on demographic and geography without having to join dimUser, and also to capture user's current demographic and geography at the time of measurement.

And by the way, when in DW world, verbose -- Gender = 'female', AgeGroup = '30-35', EducationLevel = 'university', etc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文