合并来自不同来源的事实? 还是单独加载?
我们有两个不同来源的数据:一些来自客户,一些来自不同的供应商。 目前,我们将这些数据物理“合并”到一个庞大的表中,该表有近百列、数万行,并且两个维度没有正式分离。 因此,我们实际上不能充分利用这个表。
我将把这个烂摊子重新设计成一个合适的、但很小的星型模式。
两个维度是显而易见的。 例如,其中之一就是时间。
客户提供的数据提供了许多事实值。 每个供应商可能(或可能不)提供适合相同维度的附加事实值。
该事实数据都具有相同的粒度。 之所以称之为“稀疏”,是因为我们并不经常从所有供应商那里获得信息。
这是我的困境。
这是一个事实表(带有一些空值)是从不同的来源填充的吗?
或者这是n+1 事实表——一个由客户填充,其他由每个供应商填充?
每种设计都有优点和缺点。 我需要一些关于“合并”或“单独加载”之间选择的第二意见。
客户提供收入、成本、数量、重量以及他们了解的有关交易结束的其他信息。
供应商一提供一些有关某些交易的额外详细信息——重量、成本、持续时间。 其他交易对第一供应商没有任何价值。
供应商二提供一些有关某些交易的额外详细信息——数量、持续时间、长度、外币汇率。 其他交易对于供应商二来说没有任何价值。
有些交易会有两个供应商。 少数交易不会有任何供应商。
一张表有空值? 三张桌子?
We've got data with two different origins: some comes from a customer, some comes from different vendors. Currently, we physically "merge" this data into a massive table with almost a hundred columns, tens of thousands of rows and no formal separation of the two dimensions. Consequently, we can't actually use this table for much.
I'm going to redesign this mess into a proper, but small, star schema.
The two dimensions are obvious. One of them, for example, is time.
The customer-supplied data provides a number of fact values. Each vendor may (or may not) provide additional fact values that fit the same dimensions.
This fact data all has the same granularity. It can be called "sparse" because we don't often get information from all vendors.
Here's my dilemma.
Is this one fact table -- with some nulls -- populated from different sources?
Or is this n+1 fact tables -- one populated from the customer, the others populated from each vendor?
There are pros and cons to each design. I need some second opinions on the choice between "merge" or "load separately".
Customer supplies revenue, cost, counts, weights, and other things they know about their end of a transaction.
Vendor one supplies some additional details about some of the transactions -- weights, costs, durations. The other transactions will have no value from vendor one.
Vendor two supplies some additional details about some of the transactions -- volumes, durations, lengths, foreign currency rates. The other transactions will have no value for vendor two.
Some transactions will have both vendors. A few transactions will have neither vendor.
One table with nulls? Three tables?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我会选择单一事实表。 这种方法的优点是它将所有艰苦的工作留在加载时而不是查询时。
I'd go for the single fact table. The highlight pro of this approach is that it leaves all the hard work at load time rather than at query time.
根据您的描述,听起来单个事实表是可行的方法。
听起来事实表的粒度是时间 x 交易 x 客户(?)。
我之前的问题实际上是想找出某些供应商数据是否适合其自己的维度。 我将把它留给你来决定。 但听起来确实不像。
空事实可能会在聚合期间引发警告(取决于平台),但用可能误导性的零填充它们的替代方案会更糟糕。
From what you describe, it sounds like a single fact table is the way to go.
It sounds like the fact table would have a grain of time x transaction x customer(?).
My prior question was really trying to find out if some of the vendor data was a candidate for its own dimension. I'll leave it to you to determine that. but it doesn't really sound like it.
Null facts can throw warnings during aggregations (depending on the platform) but the alternative of populating them with possibly misleading zeros is worse.
我相信,由于两个来源共享相同的粒度,因此答案是您应该有一个事实表。 考虑一下您希望最终用户如何与信息交互。 如果这是有意义的,并且业务报告将受益于这些数据的共存,那么这就是您的答案。 但请尽量避免事实表中出现空值。 如果您可以输入零(并且零对于数据有意义,即温度),那么就这样做。 它将为您的用户节省一些混乱,并且正如 TrickyNixon 指出的那样会导致聚合问题。
实际上,您在“棕地”应用程序上处于一个很好的位置。 您可以看看当今存在的东西,并利用经验来创建更好的设计。 这是选择最好的颗粒的最重要时刻,希望这些颗粒在 DW 的生命周期内不会发生变化。
I believe that since both sources share the same grain the answer is that you should have one fact table. Think about how you want your end-users to interact with the information. If it makes sense and the business reports will benefit from those data being co-located then that is your answer. Try though to avoid nulls in your fact tables. If you can enter a zero (and the zero makes sense for the data, i.e., think temperature) then do that. It will save your users some confusion and as TrickyNixon pointed out will cause aggregation issues.
Actually you're at a great point here on the 'brownfield' application. You can look at what exists today and leverage experience to create a better design. This is the most important time to select the best grain that will hopefully not change for the life of the DW.