当前位置：文江博客话题详情

数据库仓库设计：事实表和维度表

发布于 2024-09-03 22:22:51 字数 1110 浏览 12 评论 0原文

我正在使用 RDBMS 构建一个穷人的数据仓库。我已确定要记录的关键“属性”为：

性别（真/假）
人口统计分类（A、B、C 等）
出生地点
出生日期
体重（每天记录）：正在记录的事实

我的要求是能够运行“OLAP”查询，使我能够：

“切片和切块”
“向上/向下钻取”数据，并且
通常能够从不同的角度查看数据

在阅读完该主题领域后，普遍的共识是似乎最好使用维度表而不是标准化表来实现。

假设这个断言是正确的（即解决方案最好使用事实和维度表来实现），我想在这些表的设计中寻求一些帮助。

“自然”（或明显）维度包括：

日期维度
地理位置

具有层次属性。然而，我正在努力解决如何对以下字段进行建模：

性别（真/假）
人口统计分类（A、B、C 等）

我在这些字段上挣扎的原因是：

它们没有明显的分层属性来帮助聚合(AFAIA) - 这表明它们应该在事实表中
它们大多是静态的或很少变化 - 这表明它们应该在维度表中。

也许我上面使用的启发式太粗糙了？

我将给出一些关于我想对数据仓库进行的分析类型的示例 - 希望这能进一步澄清问题。

我想按性别和人口统计分类汇总和分析数据 - 例如回答以下问题：

不同人口统计分类中男性和女性的体重如何比较？
哪个人口统计分类（男性和女性）显示本季度体重增加最多。

等等。

任何人都可以澄清性别和人口统计分类是否是事实表的一部分，或者它们是否（正如我怀疑的）维度表。？

另外假设它们是维度表，有人可以详细说明表结构（即字段）吗？

“显而易见”的模式：

CREATE TABLE sex_type (is_male int);
CREATE TABLE demographic_category (id int, name varchar(4));

可能不是正确的。

原文

I am building a poor man's data warehouse using a RDBMS. I have identified the key 'attributes' to be recorded as:

sex (true/false)
demographic classification (A, B, C etc)
place of birth
date of birth
weight (recorded daily): The fact that is being recorded

My requirements are to be able to run 'OLAP' queries that allow me to:

'slice and dice'
'drill up/down' the data and
generally, be able to view the data from different perspectives

After reading up on this topic area, the general consensus seems to be that this is best implemented using dimension tables rather than normalized tables.

Assuming that this assertion is true (i.e. the solution is best implemented using fact and dimension tables), I would like to seek some help in the design of these tables.

'Natural' (or obvious) dimensions are:

Date dimension
Geographical location

Which have hierarchical attributes. However, I am struggling with how to model the following fields:

sex (true/false)
demographic classification (A, B, C etc)

The reason I am struggling with these fields is that:

They have no obvious hierarchical attributes which will aid aggregation (AFAIA) - which suggest they should be in a fact table
They are mostly static or very rarely change - which suggests they should be in a dimension table.

Maybe the heuristic I am using above is too crude?

I will give some examples on the type of analysis I would like to carryout on the data warehouse - hopefully that will clarify things further.

I would like to aggregate and analyze the data by sex and demographic classification - e.g. answer questions like:

How does male and female weights compare across different demographic classifications?
Which demographic classification (male AND female), show the most increase in weight this quarter.

etc.

Can anyone clarify whether sex and demographic classification are part of the fact table, or whether they are (as I suspect) dimension tables.?

Also assuming they are dimension tables, could someone elaborate on the table structures (i.e. the fields)?

The 'obvious' schema:

CREATE TABLE sex_type (is_male int);
CREATE TABLE demographic_category (id int, name varchar(4));

may not be the correct one.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苍白女子 2024-09-10 22:22:56

通常，所有数值数量和度量都是事实表中的列。那么其他一切都是维度属性。它们属于哪个维度是相当务实的，取决于数据。

除了您已经收到的建议之外，我没有看到任何提及退化维度的内容。在这些情况下，需要将每个事实都不同的发票号或序列号时间戳等内容存储在事实中，否则维度表将与事实表变为 1-1。

如果研究正在进行，您的案例中的一个关键设计决策可能是分析与年龄相关的数据。因为人的年龄会随着时间的推移而变化，在某个时候他们会进入另一个年龄段。根据研究开始时分组是否固定，这可能会决定您想要如何聚合。我不一定是说您应该有一个群体维度并通过它来了解年龄，但您可能需要在 ETL 期间确定正确的年龄/人口统计维度。但这取决于最终用途（或适应与事实表链接的二维角色 - 初始人口统计数据，永远不会改变，当前人口统计数据将随着时间而变化）。

类似的事情也适用于地理。虽然您显然可以通过分析当前地理位置随时间的变化来跟踪一个人的地理位置，但维度 DW 的要点是将所有相关维度直接链接到事实（通常可以通过标准化模型通过网络导出的内容）。实体关系模型 - 这些在 ETL 时被锁定）。这种冗余使得对传统 RDBMS 中的维度模型的分析更快。

请注意，其中很多内容不适用于像 Teradata 这样的大规模并行 DW，它们在星型模式下表现不佳 - 它们喜欢将所有数据标准化并链接到同一个主索引，因为它们是用于分布数据的主索引处理单元。

回复收藏 0 原文

も让我眼熟你 2024-09-10 22:22:56

您打算使用什么 OLAP/表示层工具？它们通常具有自己的功能来支持多维数据集、层次结构、聚合等的构建。

规范形式通常是灵活高效的数据仓库的最可靠基础，尽管 Mart 有时会进行非规范化以支持一组特定的报告要求。在没有任何其他信息的情况下，我建议您确保您的数据库至少采用 Boyce-Codd / 第五范式。

回复收藏 0 原文

以可爱出名 2024-09-10 22:22:55

星型模式搜索在 SQL 中相当于维恩图的交点。正如您的示例查询清楚地显示的那样，SEX_TYPE 和 DEMOGRAPHIC_CATEGORY 是您要搜索的集合，因此必须是维度。

至于表结构，我认为您对 SEX_TYPE 的设计是错误的。对于初学者来说，基于

where sex_type.name = 'FEMALE'

比

where sex_type.is_male = 1

设计查询更容易、更直观。此外，在现实世界中，性别不是布尔值。大多数应用程序也应该收集“未知”和“跨性别”信息，对于健康/医疗应用程序来说确实如此，而这正是您似乎正在做的事情。此外，如果你有女同事，这也可以避免一些不愉快的办公室争吵。

编辑

“我正在考虑如何处理
新的性别类型和人口统计案例
尚未包含在的类别
数据库”

查询优化器可以使用这些元数据来导出最有效的搜索路径。当有大量数据和临时查询需要处理时，这一点尤其重要。处理新的维度值总是很困难，除非您的源系统向您提供通知，这实际上取决于您的设置。

Star schema searches are the SQL equivalent of the intersection points of Venn Diagrams. As your sample queries clearly show, SEX_TYPE and DEMOGRAPHIC_CATEGORY are sets you want to search by and hence must be dimensions.

As for the table structures, I think your design for SEX_TYPE is misguided. For starters it is easier, more intuitive, to design queries on the basis of

where sex_type.name = 'FEMALE'

than

where sex_type.is_male = 1

Besides, in the real world sex is not a boolean. Most applications should gather UNKNOWN and TRANSGENDER as well, and that's certainly true for health/medical apps which is what you seem to be doing. Furthermore, it will avoid some unpleasant office arguments if you have any female co-workers.

Edit

"I am thinking of how to deal with
cases of new sex_types and demographic
categories not already in the
database"

There was a vogue for not having foreign keys in Data Warehouses. But they provide useful metadata which a query optimizer can use to derive the most efficient search path. This is particularly important when there is a lot of data and ad hoc queries to process. Dealing with new dimension values is always going to be hard, unless your source systems provide you with notification. This really depends on your set-up.

回复收藏 0 原文