当所有维度值都具有 100% 重要性时处理多对多维度
我至少会尽力保持简洁。
假设我们正在跟踪一段时间内的账户余额。因此,我们的事实表将包含以下列:
帐户余额事实表
- (FK)AccountID
- (FK)DateID
- ...
- 余额
- ...
显然您有一个帐户维度表 strong> 和日期维度表。所以现在我们可以轻松地过滤帐户或日期(或日期范围等)。
但更重要的是...帐户可以属于组——在给定日期可以属于任意数量的组。组只是逻辑抽象,除了报告目的之外没有任何实际意义。帐户属于 0、1 或 17 组不会以任何方式影响其余额。例如,帐户 ID 1 可能位于组 38、76、104 和 159 中。帐户 2 可能位于组 1 中(其组描述为“未分组”。帐户 3 可能位于 17 个组中(真实示例)
。额外的好处是,我们的用户完全不懂 SQL,他们没有关系数据库的经验,并且过去一直在复杂的 Excel 解决方案中完成所有工作,现在我们正在构建他们的维度模型。可以使用 PowerPivot 进行切片和过滤,尽管这些帐户组可能会将原本极其简单的星型模式变成足够复杂的内容,以至于用户会犹豫并返回到他们当前的意大利面条解决方案。
所以让我们看看我们的选择...
布尔法 布尔方法不可行。我们有大约 570,000 个不同的帐户,但更重要的是,有 26,000 个不同的组。对于最终用户来说,这也是一个难以过滤的问题,因为他们是非技术人员,并且依靠非常简单的工具来完成此操作。
多列法 理论上这可行,但是,我们确实有一些属于 17 个组的帐户。同样,这些组实际上只是逻辑组 - 它们没有任何意义,但出于报告目的,业务需要它们。让最终用户从 17 个不同的列中筛选出组不会很好地获得用户接受,并且可能会导致用户拒绝使用该解决方案(我不会责怪他们)。
桥台 这个计数是有效的,但我们确实有 26,000 个不同的组。我不认为这对用户友好。
由于我不喜欢我的选择,我只能假设除了雪花之外还有更好的方法......除非雪花是唯一的方法。如果有人可以伸出援手并解释他们的理由,我们将不胜感激。
更新:为了澄清起见,我认为这里每个人都可以联系到的一个例子是想象您可以在简历上列出关键字技能。它们都与同一个人有关,但您可以拥有任意数量的技能。这些技能不会影响简历上的任何单独衡量标准 - 即“C++”并不比“C#”更有价值 - 您不能将所有简历/技能组合放入事实表中,否则您会结束重复计算(或远远超过两倍;))。
我认为我在这里能做的最好的事情就是为团体创建一个支腿桌子。我不喜欢它,但我认为这是我唯一真正的选择。
所以现在我们有...
帐户余额事实表
- (FK)AccountID
- (FK)DateID
- ...
- 余额
- ...
帐户维度
- (PK)AccountID
- 帐户名称
- .. ( FK
- )账户组键
账户组支腿
- (PK)AccountGroupID
- (PK)AccountID)
- 账户组名称
I'll at least try to keep this succinct.
Let's suppose we're tracking the balances of accounts over time. So our fact table will have columns such as...
Account Balance Fact Table
- (FK)AccountID
- (FK)DateID
- ...
- Balance
- ...
Obviously you have an Account Dimension Table and a Date Dimension Table. So now we can easily filter on Accounts or Dates (or date ranges, etc.).
But here's the kicker... Accounts can belong to Groups -- any number of Groups at a given Date. Groups are simply logical abstractions, and they have no tangible meaning aside from reporting purposes. An Account being in 0, 1, or 17 groups doesn't affect its Balance in any way. For example, AccountID 1 may be in Groups 38, 76, 104, and 159. Account 2 may be in Group 1 (which has a Group Description of "Ungrouped". Account 3 may be in seventeen groups (real example).
As an added bonus, our users are completely non-technical. They don't know SQL, they have no experience with relational databases, and have historically done all of their work in a convoluted Excel solution. Right now we're building a dimensional model they can slice and filter with PowerPivot, though these Account Groups are threatening to turn an otherwise ruthlessly simple Star Schema into something complex enough that the users will balk and return to their current spaghetti solution.
So let's look at our options...
Boolean Method
The Boolean method is not feasible. We have about 570,000 different accounts, but more importantly, 26,000 different groups. This would also be a devil for end-users to filter, since they're non-technical and are relying on very simple tools to get this done.
Multiple Column Method
In theory this could work, however, we do have some accounts that belong to 17 groups. Again, the groups are really just logical groups -- they have no meaning, but they are required by the business for reporting purposes. Having end-users filter out groups from 17 different columns isn't going to go over well in user-acceptance, and would likely result in users refusing to use the solution (and I don't blame them).
Bridge Table
This count work, but we do have 26,000 different groups. I'm not finding this to be user-friendly.
Since I'm not liking my options, I can only assume there's a better way other than snowflaking... unless snowflaking IS the only way. If someone could lend a hand and explain their rationale it'd be appreciated.
UPDATE: For clarification, an example I think everyone here can relate to is imagine you can list keyword skills on a resume. They all relate to the same person, but you can have any number of skills. The skills don't effect any of individual measures on a resume -- i.e. 'C++' isn't more valuable than 'C#' -- you can't put all the resume/skill combinations in the fact table or you'd end up double counting (or well more than double ;) ).
I think the best I'm going to be able to do here is to create an outrigger table for groups. I'm not a fan of it, but I think it's the only real option I have.
So now we have...
Account Balance Fact Table
- (FK)AccountID
- (FK)DateID
- ...
- Balance
- ...
Account Dimension
- (PK)AccountID
- Account Name
- ...
- (FK)Account Group Key
Account Group Outrigger
- (PK)AccountGroupID
- (PK)AccountID)
- Account Group Name
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我想说你必须从界面开始。在理想的世界中,用户希望如何进行过滤?
我想我最终会选择一个桥或无事实的事实表或类似的东西。也许是事实表上的代理键和从该事实表到组成员身份的多对链接表。
这绝对是困难的 - 并且界面和用例必须可行,所以我会从那里开始。也许有些东西会改变他们做报告的方式——比如组中的等价类或者他们划分帐户空间的某种方式。也许这些组有一个层次结构或组织,这使得它更易于管理,并且可以为更简单的设计提供信息。
I would say you've got to start from the interface. How would users like to do their filtering in an ideal world?
I think I would end up going for a bridge or factless fact table or something like that. Perhaps a surrogate key on the fact table and a many-many link table from that to group membership.
It's definitely tough - and the interface and usage cases has to be made workable, so I'd start from there. Perhaps something will shake out of how they do this reporting - like equivalence classes in the groups or some way they partition the account space. Maybe there is a hierarchy or organization to the groups which make it more manageable and may inform a simpler design.
如果我正确理解你的问题,这应该没问题:
If I correctly understood your question this should be okay: