Data Vault 和维度建模之间的区别?
在对数据仓库进行建模时,是否有任何理由我们应该选择 Data Vault 而不是 维度建模?这两者之间的主要区别是什么?
When modeling a data warehouse, is there any reason we should favor Data Vault over Dimensional modelling? What are the major differences between these two?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
@Danny Shaw,这也是我的经验(尽管我在这个领域相对较新 - 来自 ETL,所以很好奇其他人对我的帖子的意见)。
我认为重要的是,尊重客户的需求随着他们的“成熟”而变化,并且不同的模型可能在不同的时间更适合。
我的感觉是,Data Vault 提供了操作灵活性,而现有的讨论(Kimball/Inmon)更多地围绕“业务灵活性”(由于缺乏更好的术语)。
Data Vault 允许您在粒度对象方面保持接近源的状态。这使得模型“可审计”且可扩展。它有助于提高 SOURCE 规范的灵活性。
因此,它是一个有用的中间工具,例如迁移项目,可以作为一个基础,为更多需要新旧集成视图的面向业务的 DWH/数据集市提供支持。然而,我的经验是,如果您开始直接从该模型填充数据集市,您最终会得到大量联接,尤其是递归,这仅仅是因为您远离业务概念。对于某些数据库来说并非完全糟糕,因此选择部分受到软件的影响(例如,Teradata 比 Oracle 更喜欢加入)。然而,总的来说,我的感觉是,如果您需要目标(业务)方面的灵活性,您最终会陷入 inmon-kimball 讨论,并且在该方面考虑维度建模而不是数据仓库并不是一个坏的开始。
因此,您的评估输入的一部分还应该是:业务概念的标准化程度如何?整个公司是否使用相同的 KPI 和数据概念?如果情况并非如此,那么对我来说,靠近数据仓库中某处的源(尤其是有很多源时)似乎是一个安全的选择。如果更成熟,请为报告需求的更大灵活性做好准备 - 并将数据模型的性能转移到报告方面。
这并不是说业务不能发展——而是说它必须作为一个整体来发展。我认为这是一个更“成熟”的客户,知道可以用他们的数据做什么,对他们的业务有一个非常集成和标准化的视图,在报告方面有越来越复杂的要求。因此,如果您需要为数据集市提供灵活性而进行建模,并且您拥有强大的 ETL 工具集,那么您不妨直接设置数据模型以使其更像业务。
总而言之,我认为,随着 BI 环境变得更加“成熟”,企业已经了解了可以利用数据做什么,而这方面的需求也变得更加复杂。 Data Vault 不会是这方面的出路。
但是,如果您正在进行迁移(尤其是长达数年的并行阶段),或者在一个年轻的组织中,并非所有部门都以同样的眼光看待他们的业务,但(对您有利)报告要求相当可监督,那么可以选择预先使用数据仓库,并尝试看看是否可以直接从中提供数据集市 - 可能会在两者之间添加 Kimball 维度的味道。
@Danny Shaw this is also my experience (though I am relatively new in this field - coming from ETL, so curious to input from others on my post).
I believe that it is important, to respect that your Clients' demands are evolving with their 'maturity', and that different models may fit better at different times.
My feeling is that Data Vault delivers operational flexibility, whereas existing discussion (Kimball/Inmon) revolves more around 'business flexibility' (for lack of better terminology).
Data Vault allows you to stay close to the source in terms of its granular objects. This makes the model 'auditable' and scalable. It helps with flexibility on SOURCE specifications.
Therefore it is a useful in-between in e.g. migration projects, serving as a base from where to feed more business-oriented DWH/Datamarts that require an integrated view of both old and new. My experience however is that if you start populating Datamarts directly from this model you end up with lots of joins and especially also recursions simply because you are far from the business concepts. Not entirely bad on certain databases so the choice is partly influenced by the software (e.g. Teradata likes joining much more so than Oracle). However generally my feeling is that if you need flexibility in TARGET (business) side, you end up in the inmon-kimball discussion and it would not be a bad start to consider dimensional modeling instead of data vault on that side.
So part of the input in your evaluation should also be: how standardised are business concepts? Is the whole company using the same KPIs and Data concepts? If this is not the case, staying close to the source (especially if there are many) somewhere in your data warehouse seems like a safe bet to me. If more mature, prepare for more flexibility in reporting demands - and shift the performance of your datamodel to the reporting side.
This is not to say that business cannot be evolving - just that it has to be evolving as a whole. I consider this a more 'mature' customer, that knows what it can do with their data, has a very integrated and standardised view on their business, with more and more complex requirements in terms of reporting. So if you need to model for flexibility in feeding datamarts, AND you have a strong ETL toolset, you might as well directly set up your datamodel to resemble business a bit more.
To summarise, I would argue that as the BI environment becomes more 'mature', business has learnt what it can do with the data and the demands on that side become more complex. Data Vault would not be the way to go on that side.
However if you are in a migration (especially with years-long parallel phases), or in a younger organisation where not all departments look at their business through the same eyes, but (in your advantage) the reports requirements are rather overseeable, it would be an option to use data vault up front and try to see if you can feed your datamarts directly from that - possibly adding a taste of Kimball's dimensions somewhere in between.
选择任何方法通常需要平衡经验和意见与系统的需求和要求。每种建模方法在与不同情况相关时都有一定的优势,因此在确定采用哪种方法时,您必须评估模型将与之交互的环境。
频繁且统一地添加数据的高度事务性系统通常适合维度建模方法。用于描述它的常见示例通常集中于零售和金融组织,因为随着时间的推移增加的销售或货币交易数量适合事实和维度概念。
Favouring any approach is usually a matter of balancing experience and opinion with the needs and requirements for the system. Each modelling approach has certain advantages when related to different situations, so you must evaluate the environment your model will interact with when figuring out which approach to take.
Highly transactional systems that add data frequently and uniformly usually suit a dimensional modelling approach. Common examples used to describe it normally focus on Retail and Financial organisations, as the number of sales or monetary transactions being added over time suits the Fact and Dimension concepts.
为什么你觉得你需要其中任何一个?它们大多是用于销售书籍和培训课程的大量行话设计模式。数百万人发现,即使没有它们,他们也能过得很好。设计数据仓库真正需要的是任何数据库所需的良好分析和建模技能。
如果您正在寻求有关构建数据仓库的有用建议,请查看 Bill Inmon 的书籍。如果这是您的第一个商业智能项目,那么请向具有该领域经验的人寻求帮助,这样您就可以避免一些常见的陷阱。
Why do you feel you need either of them? They are mostly jargon-heavy design patterns used to sell books and training courses. Millions of people find they can get on just fine without them. What you really need to design a data warehouse is the same good analysis and modelling skills you need for any database.
If you are seeking useful advice on building a data warehouse then check out Bill Inmon's books. If this is your first Business Intelligence project then get some help from someone with experience in the field so that you can avoid some of the common pitfalls.
我认为维度建模仍然是分析和分析的最佳实践。报告并作为业务用户最容易理解的可见模型。
Data Vault 更适合大型企业数据仓库,也是 Bill Inmon 推荐的,但不太适合分析和存储。报告,为此您可能仍然需要维度建模来创建“虚拟”数据集市。重点关注一些博客,例如 Martijn Evers、Hennie de Nooijer 或 Ronald Damhof 的博客。
Data Vault 更加灵活,更容易添加新来源,更易于审核,并且始终保留所有数据,因此您将能够始终重新创建 DM。
因此,结论可能是,理想的情况是使用 Data Vault 作为企业数据仓库,使用维度建模作为数据集市。
Dimensional modelling is in my opinion still the best practise for analysis & reporting and as a visible model best understand by business users.
Data Vault is more suitable for large Enterprise Data Warehousing, also recommended by Bill Inmon, but not that suitable for analysis & reporting, for that you still might need dimensional modelling for creating you "virtual" Data marts. Take a peak at some blogs like those from Martijn Evers, Hennie de Nooijer or Ronald Damhof.
Data Vault is more flexible, easier to add new sources, more audit able and keeps all data all the time so you will be able to always recreate you DM's.
So a conclusion might be that the ideal situation is to use Data Vault for your Enterprise Data Warehouse and Dimensional Modelling for you Datamarts.
我认为两者的结合最适合大多数大型组织。
Vault 对于中型企业 ODS 来说是一个不错的选择,其中较少的结构将提高灵活性和性能。然后可以从 Vault Db 中提取数据,以提供支持报告和分析的上下文特定维度数据集市。
在这种情况下,Vault Db 还可以用于支持更多大数据类型的挖掘和分析,这些挖掘和分析需要对数据关系有更成熟的理解。
I think a combination of the two would best serve most large organizations.
Vault would be a good choice for an intermediate enterprise ODS where less structure would facilitate flexibility and performance. Data can then be pulled from the Vault Db to feed context specific dimensional data marts that support reporting and analysis.
In that scenario the vault Db can also be used to support more big-data types of mining and analysis that require a more mature understanding of data relationships.