为数据仓库设置 Dim 和 Fact 表

发布于 2024-07-12 00:37:45 字数 391 浏览 8 评论 0原文

我的任务是为客户创建一个数据仓库。 所涉及的表格并没有真正遵循传统的示例(产品/订单),因此我需要一些帮助来入门。 客户端本质上是案件的处理中心(类似于法律案件)。 每天,新病例都会输入数据库的“病例”表下。 每列都包含一些与案例相关的信息。 在处理案例时,会使用与案例相关的事件填充其他一对多表。 这些事件表有很多,示例表可能是:(case-open、case-dept1、case-dept2、case-dept3 等)。 每个表都有一个 caseid,映射回“cases”表。 还涉及一些查找表。

目前,报告需求涉及暴露各个阶段的瓶颈,并且对于流程的某些区域,粒度是小时级别。

我可能在这里要求太多,但我正在寻找一些关于如何设置我的 Dim 和 Fact 表的方向或您可能有的任何其他建议。

I'm tasked with creating a datawarehouse for a client. The tables involved don't really follow the traditional examples out there (product/orders), so I need some help getting started. The client is essentially a processing center for cases (similar to a legal case). Each day, new cases are entered into the DB under the "cases" table. Each column contains some bit of info related to the case. As the case is being processed, additional one-to-many tables are populated with events related to the case. There are quite a few of these event tables, example tables might be: (case-open, case-dept1, case-dept2, case-dept3, etc.). Each of these tables has a caseid which maps back to the "cases" table. There are also a few lookup tables involved as well.

Currently, the reporting needs relate to exposing bottlenecks in the various stages and the granularity is at the hour level for certain areas of the process.

I may be asking too much here, but I'm looking for some direction as to how I should setup my Dim and Fact tables or any other suggestions you might have.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

羅雙樹 2024-07-19 00:37:45

事实表是案例事件,它是“无事实的”,因为它没有数值。 维度可以是时间、事件类型、案例,也可能是其他一些维度,具体取决于系统中的其他数据。

您需要将事件表合并到一个事实表中,并标有“事件类型”维度。 吞吐量/瓶颈报告正在计算给定情况下事件类型的特定组合的事件时间之间的差异。

报告应该计算事件-事件时间,并可能将它们放入直方图中。 您还可以标记某些类型的事件组合并将标签应用于感兴趣的事件。 然后可以针对这些事件记录时间,这样就可以使用 OLAP 工具对时间进行切片操作。

如果您想对生命周期进程中的某些阶段进行基准测试,您将有一个包含案例类型、事件类型 1、事件类型 2、基准时间的表。

通过一些操作,您也许能够使用数据挖掘工具包甚至简单的回归分析来发现案例属性和事件-事件时间 (YMMV) 之间的相关性。

The fact table is the case event and it is 'factless' in that it has no numerical value. The dimensions would be time, event type, case and maybe some others depending on what other data is in the system.

You need to consolidate the event tables into a single fact table, labelled with an 'event type' dimension. The throughput/bottleneck reports are calculating differences between event times for specific combinations of event types on a given case.

The reports should calculate the event-event times and possibly bin them into a histogram. You could also label certain types of event combinations and apply the label to the events of interest. These events could then have the time recorded against them, which would allow slice-and-dice operations on the times with an OLAP tool.

If you want to benchmark certain stages in the life-cycle progression you would have a table that goes case type, event type1, event type 2, benchmark time.

With a bit of massaging, you might be able to use a data mining toolkit or even a simple regression analysis to spot correlations between case attributes and event-event times (YMMV).

那小子欠揍 2024-07-19 00:37:45

我建议你看看 Kimball 的书,特别是这本书,其中应该有一些让您思考在您的问题领域中的应用的示例。

无论如何,您需要确定维度模型是否合适。 很可能将 3NF 数据库“企业数据仓库”视为具有不同的索引或摘要等。

如果没有看到您当前的架构,真的很难说。 听起来你最终会得到几个明星模型,并通过一些一致的尺寸将它们绑在一起。 因此,您可能会将外壳尺寸作为您的一致尺寸之一。 来自每个其他表的事实实际上是链接到一致维度和适合事实的任何其他维度的事实表,因此,例如,如果案例开放中有一个员工 ID,则它将链接到员工一致维度,来自案例开放事实表。 这个一致的维度可能会与您的多个辅助事实表链接多次。

Kimball 的建模方法相当简单,可以像菜谱一样遵循。 您需要首先识别所有事实,将它们分组到事实表中,识别每个事实表上的各个维度,然后根据需要将它们分组到维度表中,并识别每个维度的类型。

I suggest you check out Kimball's books, particularly this one, which should have some examples to get you thinking about applications to your problem domain.

In any case, you need to decide if a dimensional model is even appropriate. It is quite possible to treat a 3NF database 'enterprise data warehouse' with different indexes or summaries, or whatever.

Without seeing your current schema, it's REALLY hard to say. Sounds like you will end up with several star models with some conformed dimensions tying them together. So you might have a case dimension as one of your conformed dimensions. The facts from each other table would be in fact tables which link both to the conformed dimension and any other dimensions appropriate to the facts, so for instance, if there is an employee id in case-open, that would link to an employee conformed dimension, from the case-open-fact table. This conformed dimension might be linked several times from several of your subsidiary fact tables.

Kimball's modeling method is fairly straightforward, and can be followed like a recipe. You need to start by identifying all your facts, grouping them into fact tables, identifying individual dimensions on each fact table and then grouping them as appropriate into dimension tables, and identifying the type of each dimension.

苦行僧 2024-07-19 00:37:45

与开发的任何其他方面一样,您必须从最终需求(如果愿意,可以称为“用户故事”)向后处理问题。 对于仓库来说,最保守的方法是简单地表示事务数据库的副本。 从这里开始,在需求的指导下,可以进行某些优化来增强某些数据访问模式的性能。 然而,我认为重要的是,将这些视为优化,而不是假设数据仓库自动必须是每个事实的每个可能维度的复杂爆炸。 我的经验是,对于大多数用途,直接表示对于 90% 以上的分析查询来说就足够了,甚至是理想的。 对于其余部分,首先考虑索引、索引视图、附加统计信息或可以在不影响结构的情况下进行的其他优化。 然后,如果需要聚合或其他冗余结构来提高性能,请考虑将它们分离到“数据集市”(至少在概念上),该数据集市提供原始事实及其冗余之间的分离。 最后,如果需求太不稳定并且聚合要求太高才能以这种方式有效运行,那么您可能会考虑数据的批发爆炸,即星型模式。 不过,再次将其限制为尽可能小的数据横截面。

Like any other facet of development, you must approach the problem from the end requirements ("user stories" if you will) backwards. The most conservative approach for a warehouse is to simply represent a copy of the transaction database. From there, guided by the requirements, certain optimizations can be made to enhance the performance of certain data access patterns. I believe it is important, however, to see these as optimizations and not assume that a data warehouse automatically must be a complex explosion of every possible dimension over every fact. My experience is that for most purposes, a straight representation is adequate or even ideal for 90+% of analytical queries. For the remainder, first consider indexes, indexed views, additional statistics, or other optimizations that can be made without affecting the structures. Then if aggregation or other redundant structures are needed to improve performance, consider separating these into a "data mart" (at least conceptually) which provides a separation between primitive facts and redundancies thereof. Finally, if the requirements are too fluid and the aggregation demands to heavy to efficiently function this way, then you might consider wholesale explosions of data i.e. star schema. Again though, limit this to the smallest cross section of the data as possible.

旧人九事 2024-07-19 00:37:45

这就是我基本上想到的。 感谢 NXC

事实事件

EventID
时间键
CaseID

暗淡事件

EventID
EventDesc

昏暗时间

TimeKey

昏暗区域

RegionID
区域描述

案例

CaseID
区域ID

Here's what I came up with essentially. Thx NXC

Fact Events

EventID
TimeKey
CaseID

Dim Events

EventID
EventDesc

Dim Time

TimeKey

Dim Regions

RegionID
RegionDesc

Cases

CaseID
RegionID

才能让你更想念 2024-07-19 00:37:45

这可能是在考虑问题之前就选择解决方案的情况。 并非所有数据仓库都适合星型模式模型。 我没有看到您在这里汇总任何数据。 到目前为止,我们有一个无事实的事实表和至少一个快速变化的维度(案例)。

看看到目前为止我所看到的,我认为这个数据库中的中心实体应该是这样的。 试图将事件固定在中间似乎并不正确。 尝试以不同的方式看待它。 也许,案件、事件、案件事件要开始了。

This may be a case of choosing a solution before you've considered the problem. Not all datawarehouses fit into the star schema model. I don't see that you are aggregating any data here. So far we have a factless fact table and at least one rapidly changing dimension (cases).

Looking at what I see so far I think the central entity in this database should be the case. Trying to stick the event at the middle doesn't seem right. Try looking at it a different way. Perhaps, case, events, and case events to start.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文