当前位置：文江博客话题详情

什么是 Dim，什么是 Fact？

发布于 2024-09-08 06:24:37 字数 4691 浏览 11 评论 0原文

我有一个应用程序，我知道它可以构成一个很棒的立方体，并且比标准的平面 Reporting Services 报告更有用。我们即将与顾问一起涉足 BI 领域，但我想在开始之前先尝试一下，主要是为了让我知道我们要做什么。

该应用程序跟踪全国疗养院的调查。它们可以是年度调查、投诉调查或几种其他类型的调查，它们有与给定标签相关的处罚，并有与之相关的文档。

我想做的是想出一种方法，让我们能够利用我们拥有的数据 - 六月佛罗里达州有多少个标签？有多少工厂按时交付了文件？与去年相比，今年第一季度发生了多少次年度（惊喜）调查？

我包含这些模式是希望有人不仅能够告诉我什么是模糊的、什么是事实，还可以告诉我什么数据去了哪里。我想这将是一个很好的开始。

任何事情都会非常有帮助。我正在尝试建立一个小型数据集市，同时正在研究 Kimball 的数据仓库生命周期工具包。

谢谢！ M@

实体表 - 我们所有设施的列表：主键是表示建筑物的五个字母代码

CREATE TABLE [dbo].[Entity](
 [entID] [varchar](10) NOT NULL,
 [entShortName] [varchar](150) NULL,
 [entNumericID] [int] NOT NULL,
 [orgID] [int] NOT NULL,
 [regionID] [int] NOT NULL,
 [portID] [int] NOT NULL,
 [busTypeID] [int] NOT NULL,
 [adpID] [varchar](50) NULL,
 [eHealthDataID] [varchar](50) NULL,
 [updateDate] [datetime] NULL CONSTRAINT [DF_Entity_updateDate]  DEFAULT (getdate()),
 [powProID] [int] NULL,
 [regionReportingID] [int] NULL,
 [regionPresEmail] [varchar](300) NULL,
 [regionClinDirEmail] [varchar](300) NULL,
 CONSTRAINT [PK_EntityNEW] PRIMARY KEY CLUSTERED 
(
 [entID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 75) ON [PRIMARY]
) ON [PRIMARY]

调查主要

CREATE TABLE [dbo].[surveyMain](
 [surveyID] [int] IDENTITY(1,1) NOT NULL,
 [surveyDateFac]  AS (([facility]+'-')+CONVERT([varchar],[surveyDate],(101))),
 [surveyDate] [datetime] NOT NULL,
 [surveyType] [int] NOT NULL,
 [surveyBy] [int] NULL,
 [facility] [varchar](10) NOT NULL,
 [originalSurvey] [int] NULL,
 [exitDate] [datetime] NULL,
 [dpnaDate]  AS (dateadd(month,(3),[exitDate])),
 [clearedTags] [varchar](1) NULL,
 [substantiated] [varchar](1) NULL,
 [firstRevisit] [int] NULL,
 [secondRevisit] [int] NULL,
 [thirdRevisit] [int] NULL,
 [fourthRevisit] [int] NULL,
 [updated] [datetime] NULL CONSTRAINT [DF_surveyMain_updated]  DEFAULT (getdate()),
 CONSTRAINT [PK_tagSurvey] PRIMARY KEY CLUSTERED 
(
 [surveyID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]

调查类型：

CREATE TABLE [dbo].[surveyTypes](
 [surveyTypeID] [int] IDENTITY(1,1) NOT NULL,
 [surveyTypeDesc] [varchar](100) NOT NULL,
 CONSTRAINT [PK_surveyTypes] PRIMARY KEY CLUSTERED 
(
 [surveyTypeID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

调查文件

CREATE TABLE [dbo].[surveyFiles](
 [surveyFileID] [int] IDENTITY(1,1) NOT NULL,
 [surveyID] [int] NOT NULL,
 [surveyFilesTypeID] [int] NOT NULL,
 [documentDate] [datetime] NOT NULL,
 [responseDate] [datetime] NULL,
 [receiptDate] [datetime] NULL,
 [dateCertain] [datetime] NULL,
 [fileName] [varchar](250) NULL,
 [fileUpload] [image] NULL,
 [fileDesc] [varchar](100) NULL,
 [updated] [datetime] NOT NULL CONSTRAINT [DF_surveyFiles_updated]  DEFAULT (getdate()),
 CONSTRAINT [PK_surveyFiles] PRIMARY KEY CLUSTERED 
(
 [surveyFileID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 75) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

调查罚款

CREATE TABLE [dbo].[surveyFines](
 [surveyFinesID] [int] IDENTITY(1,1) NOT NULL,
 [surveyID] [int] NULL,
 [surveyFinesTypeID] [int] NULL,
 [dateRecommended] [datetime] NULL,
 [dateImposed] [datetime] NULL,
 [totalFineAmt] [varchar](100) NULL,
 [wasImposed] [varchar](3) NULL,
 [dateCleared] [datetime] NULL,
 [comments] [varchar](500) NULL,
 [updated] [datetime] NOT NULL CONSTRAINT [DF_surveyFines_updated]  DEFAULT (getdate()),
 CONSTRAINT [PK_surveyFines] PRIMARY KEY CLUSTERED 
(
 [surveyFinesID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 75) ON [PRIMARY]
) ON [PRIMARY]

调查标签

CREATE TABLE [dbo].[surveyTags](
 [seq] [int] IDENTITY(1,1) NOT NULL,
 [surveyID] [int] NOT NULL,
 [tagDescID] [int] NOT NULL,
 [tagStatus] [int] NULL,
 [scopesev] [varchar](5) NOT NULL,
 [comments] [varchar](1000) NULL,
 [clearedDate] [datetime] NULL,
 [updated] [datetime] NULL CONSTRAINT [DF_surveyTags_updated]  DEFAULT (getdate()),
 CONSTRAINT [PK_tagMain] PRIMARY KEY CLUSTERED 
(
 [seq] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]

原文

I have an application that I know would make a great cube and would be useful for more than the standard flat Reporting Services report. We're about to jump into BI stuff with a consultant, but I'd like to give it a shot before we do, mostly so I know something of what we're going to do.

The application tracks surveys in nursing homes across the country. They can be annual, complaint, or several other type of survey, they have penalties associated with tags given, and have documentation associated with them.

What I'd like to do is come up with a way that will allow us to leverage the data we have - how many tags in florida for the month of June? How many facilities were on time delivering their documentation? How many annual(surprise) surveys happened in the 1st quarter of this year compared to last year?

I'm including the schemas in hopes that someone will be able to tell me not only what is dim and what is fact, but what data goes where. I figure that'll be a great start.

Anything would be really helpful. I'm trying to get a small data mart set up while I'm pouring through the Data Warehouse Lifecycle Toolkit by Kimball.

Thanks!
M@

The Entity table - a list of all of our facilities: Primary key is a five letter code denoting the building

CREATE TABLE [dbo].[Entity](
 [entID] [varchar](10) NOT NULL,
 [entShortName] [varchar](150) NULL,
 [entNumericID] [int] NOT NULL,
 [orgID] [int] NOT NULL,
 [regionID] [int] NOT NULL,
 [portID] [int] NOT NULL,
 [busTypeID] [int] NOT NULL,
 [adpID] [varchar](50) NULL,
 [eHealthDataID] [varchar](50) NULL,
 [updateDate] [datetime] NULL CONSTRAINT [DF_Entity_updateDate]  DEFAULT (getdate()),
 [powProID] [int] NULL,
 [regionReportingID] [int] NULL,
 [regionPresEmail] [varchar](300) NULL,
 [regionClinDirEmail] [varchar](300) NULL,
 CONSTRAINT [PK_EntityNEW] PRIMARY KEY CLUSTERED 
(
 [entID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 75) ON [PRIMARY]
) ON [PRIMARY]

Survey Main

CREATE TABLE [dbo].[surveyMain](
 [surveyID] [int] IDENTITY(1,1) NOT NULL,
 [surveyDateFac]  AS (([facility]+'-')+CONVERT([varchar],[surveyDate],(101))),
 [surveyDate] [datetime] NOT NULL,
 [surveyType] [int] NOT NULL,
 [surveyBy] [int] NULL,
 [facility] [varchar](10) NOT NULL,
 [originalSurvey] [int] NULL,
 [exitDate] [datetime] NULL,
 [dpnaDate]  AS (dateadd(month,(3),[exitDate])),
 [clearedTags] [varchar](1) NULL,
 [substantiated] [varchar](1) NULL,
 [firstRevisit] [int] NULL,
 [secondRevisit] [int] NULL,
 [thirdRevisit] [int] NULL,
 [fourthRevisit] [int] NULL,
 [updated] [datetime] NULL CONSTRAINT [DF_surveyMain_updated]  DEFAULT (getdate()),
 CONSTRAINT [PK_tagSurvey] PRIMARY KEY CLUSTERED 
(
 [surveyID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]

Survey Types:

CREATE TABLE [dbo].[surveyTypes](
 [surveyTypeID] [int] IDENTITY(1,1) NOT NULL,
 [surveyTypeDesc] [varchar](100) NOT NULL,
 CONSTRAINT [PK_surveyTypes] PRIMARY KEY CLUSTERED 
(
 [surveyTypeID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

Survey Files

CREATE TABLE [dbo].[surveyFiles](
 [surveyFileID] [int] IDENTITY(1,1) NOT NULL,
 [surveyID] [int] NOT NULL,
 [surveyFilesTypeID] [int] NOT NULL,
 [documentDate] [datetime] NOT NULL,
 [responseDate] [datetime] NULL,
 [receiptDate] [datetime] NULL,
 [dateCertain] [datetime] NULL,
 [fileName] [varchar](250) NULL,
 [fileUpload] [image] NULL,
 [fileDesc] [varchar](100) NULL,
 [updated] [datetime] NOT NULL CONSTRAINT [DF_surveyFiles_updated]  DEFAULT (getdate()),
 CONSTRAINT [PK_surveyFiles] PRIMARY KEY CLUSTERED 
(
 [surveyFileID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 75) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Survey Fines

CREATE TABLE [dbo].[surveyFines](
 [surveyFinesID] [int] IDENTITY(1,1) NOT NULL,
 [surveyID] [int] NULL,
 [surveyFinesTypeID] [int] NULL,
 [dateRecommended] [datetime] NULL,
 [dateImposed] [datetime] NULL,
 [totalFineAmt] [varchar](100) NULL,
 [wasImposed] [varchar](3) NULL,
 [dateCleared] [datetime] NULL,
 [comments] [varchar](500) NULL,
 [updated] [datetime] NOT NULL CONSTRAINT [DF_surveyFines_updated]  DEFAULT (getdate()),
 CONSTRAINT [PK_surveyFines] PRIMARY KEY CLUSTERED 
(
 [surveyFinesID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 75) ON [PRIMARY]
) ON [PRIMARY]

Survey Tags

CREATE TABLE [dbo].[surveyTags](
 [seq] [int] IDENTITY(1,1) NOT NULL,
 [surveyID] [int] NOT NULL,
 [tagDescID] [int] NOT NULL,
 [tagStatus] [int] NULL,
 [scopesev] [varchar](5) NOT NULL,
 [comments] [varchar](1000) NULL,
 [clearedDate] [datetime] NULL,
 [updated] [datetime] NULL CONSTRAINT [DF_surveyTags_updated]  DEFAULT (getdate()),
 CONSTRAINT [PK_tagMain] PRIMARY KEY CLUSTERED 
(
 [seq] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

伤痕我心 2024-09-15 06:24:37

我想做的是想出一种方法，让我们能够利用我们拥有的数据 - 佛罗里达州 6 月份有多少个标签？有多少工厂按时交付了文件？与去年相比，今年第一季度发生了多少次年度（惊喜）调查？

维度是一个测量范围。测量范围可以是连续的（如日期），也可以是离散的（如设施）。在您的问题中，维度分别是设施和日期、日期/时间和日期。

您可以回答“六月佛罗里达州有多少标签？”这个问题的唯一方法。是将标签与设施关联起来，将标签与日期关联起来。

您可以回答“有多少工厂准时交付文件？”这一问题的唯一方法。是将文件交付与融资以及融资到期日期关联起来。

您应该对您希望数据仓库回答的其余问题或查询遵循相同的分析过程。

事实是一个实体或对象。标签就是事实。文件交付是事实。事实一旦加载，在数据仓库中几乎总是不可变的。

至于你的模式，我必须更多地研究它才能给出具体的建议，但总的来说，你想使用 星型架构。星星的中心是你的事实、实体和物体。构成星形点的表格是维度表。

你需要做的第一件事就是区分你的事实和你的维度。您的任何实体表都不应包含日期、位置代码或您确定为维度的任何其他内容。但是，您的事实表将包含日期表、位置表或其他维度表的外键。

您可能还需要汇总表。汇总表包含与事实表相同的列，并添加了跨不同维度的一个或多个总和。例如，问题“六月佛罗里达州有多少个标签？”如果您已经拥有 2010 年 6 月（或每一天）佛罗里达州（或更准确地说，佛罗里达州的每个设施）的标签总和，则可以更快地回答。

您求和的时间段取决于您期望的查询的混合。在您的数据仓库中，一天可能太短。换句话说，在 SQL 中进行汇总与选择汇总行一样快。

您需要一个 日历表。日历表会提出这样的问题：“与去年（第一季度）相比，今年第一季度发生了多少次年度（意外）调查？”更容易查询。

What I'd like to do is come up with a way that will allow us to leverage the data we have - how many tags in Florida for the month of June? How many facilities were on time delivering their documentation? How many annual(surprise) surveys happened in the 1st quarter of this year compared to last year?

A dimension is a measurement range. The measurement range can be continuous, like dates, or discrete, like facilities. In your questions, the dimensions are facility and date, date/time, and date, respectively.

The only way you can answer the question "How many tags in Florida for the month of June?" is to associate tags with facilities and tags with dates.

The only way you can answer the question "How many facilities were on time delivering their documentation?" is to associate documentation delivery with facility and date due with facility.

You should follow this same analytical process with the rest of the questions or queries you expect the data warehouse to answer.

A fact is an entity or an object. A tag is a fact. Documentation delivery is a fact. Facts are almost always immutable in a data warehouse once they're loaded.

As to your schema, I'd have to study it more to give specific recommendations, but in general, you want to use a star schema. The center of the star(s) are your facts, entities, and objects. The tables that make up the points of the star are your dimension tables.

The first thing you need to do is separate your facts and your dimensions. None of your entity tables should contain dates, location codes, or whatever else you determine is a dimension. However, your fact tables will contain foreign keys to date tables, location tables, or other dimension tables.

You'll probably also need summary tables. Summary tables contain the same columns as your fact tables, with the addition of one or more sums across different dimensions. As an example, the question "How many tags in Florida for the month of June?" can be answered much quicker if you already have the sum of the tags for Florida (or, more properly, each facility in Florida) for the month (or each of the days) of June, 2010.

The period that you sum for depends on the mixture of queries that you expect. In your data warehouse, day might be too short a period. In other words, it's just as quick to do the summary in SQL as it is to select the summary row.

You'll need a calendar table too. A calendar table makes questions like, "How many annual(surprise) surveys happened in the 1st quarter of this year compared to (the 1st quarter of) last year?" much easier to query.

回复收藏 0 原文

玩套路吗 2024-09-15 06:24:37

对于支持论坛来说，这是一项艰巨的任务，因此我将只关注问题的一部分。
似乎一项调查可以由多次访问组成，因此我建议 factSurveyVisit 包含一次访问事件。 SurveyID 列在此模型中充当退化维度，对于同一调查的所有访问都是通用的。 SurveyVisitSequenceID 是一个独特的自动增量（整数），用于简化文档和标签的两个桥接表与事实表的链接。

您还可以将调查升级为全维度dimSurvey以添加一些注释等；使用 SurveyID 作为链接。

我没有在这里解决罚款问题，为此我建议 factFine 表，该表有自己的链接到 dimDate、dimTime、dimFacility 等，以便可以快速完成有关罚款 ($$) 的报告，而无需加入大多数与访问相关的表格。还应该有一个连接表，将 factFine 与 factSurveyVisit 连接起来，前提是罚款与每次访问相关，而不是与已完成的调查相关。

编辑

刚刚注意到您的标签表已清除date_cleared，所以诚然，我不了解此业务中的标签。在模型中，dimTag 只是可用标签的列表。可能还有一个链接 dimFacility 和 dimTag 的 factFacilityStatus 表，用于跟踪每个设施的标签状态。

回复收藏 0 原文

蓝天 2024-09-15 06:24:37

看起来您的每个调查都有多个罚款、文件和标签。

我期望有 4 个事实表 - 每个事实表中的事实看起来主要是日期时间数据（尽管这些通常被建模为日期和/或时间维度的角色 - 我在这里做了一些注释，但标志通常会消失位于维度中）：

SurveyMain

SurveyFine （wasImpose 位于与此事实相关的维度中，totalFineAmt 是此表中的事实）

SurveyFile

SurveyTag

他们都将共享一个调查维度，我将继续在以下位置共享一个实体/设施维度：每一个。您可以像雪花一样穿过调查维度，但这破坏了星型模型最有利的一点，即允许您直接获取所有数据，而不是通过桥接表。

您可以选择将调查类型放入其自己的维度（或者可能是垃圾维度）或通过调查维度（而不是通过雪花）访问它。这是维度建模的典型情况 - 您不需要遵循您的实体 - 您只需要避免太多维度和太少维度陷阱并观察维度的基数 - 特别是如果您不小心包含了一些退化维度，例如发票号码随每个事实而变化，因此需要存储在事实表中。

实际上，有时通过在 3NF 中执行典型的连接（创建典型的平面报告视图）然后简单地采用这些平面行并将它们变成星形来创建星形模型会更容易。（这就是实体关系模型与维度模型真正的相关性有多小）。因此，您可以在当前标准化键上将 SurveyMain 加入 SurveyTypes 和 SurveyFine 并查看所有列。这将是 SurveyFine 事实表的基础。我确定的其他事实表也是如此。共享的东西将是共享维度的候选者。实体是一致维度的良好候选者（即它将在这些调查模型和与您的企业相关的其他模型（例如人力资源模型或会计模型）之间共享）。

It looks like you have multiple Fines, Files and Tags for each survey.

I would expect 4 fact tables - with the facts in each looking like they are largely datetime data (although these are often modelled as roles of a date and/or time dimension - I've made a couple notes here, but flags are generally going to be in dimensions):

SurveyMain

SurveyFine (wasImposed is in a dimension linked to this fact, totalFineAmt is a fact in this table)

SurveyFile

SurveyTag

They would all share a Survey dimension, and I would go ahead and share an Entity/Facility dimension in each one. You could snowflake through the Survey dimension, but that defeats the most beneficial point of star models allowing you to get to all data directly instead of going though bridge tables.

You have an option of putting the survey type in it's own dimension (or a junk dimension, perhaps) or having it accessed through the Survey dimension (not through a snowflake). That's typical with dimensional modeling - you don't need to follow your entities - you just need to avoid the too many dimensions and too few dimensions trap and watch the cardinality of your dimensions - especially if you've accidentally included some degenerate dimension like an invoice number which changes with every fact and so needs to be stored in the fact table.

Actually, it's sometimes easier to do your star models by doing the typical joins in your 3NF which create typical flat reporting views and then simply taking those flat rows and turning them into stars. (That's how little relevance the entity-relationship model really has to the dimensional model). So you might join SurveyMain to SurveyTypes and SurveyFine on your current normalized keys and look at all the columns. This would be the basis for the SurveyFine fact table. Ditto for the other fact tables I identified. The shared stuff would be a candidate for shared dimensions. Entity is a good candidate for a conformed dimension (i.e. it's going to be shared between these survey models and other models related to your enterprise - like HR models or accounting models).

回复收藏 0 原文