仓库增量负载;更新维度表
我试图弄清楚如何在数据进入我们的系统时增量加载我的事实和维度表。
有没有比以下更简单的方法:
- dim_id = select id from dim_table where dim_table.value='dim value';
- 如果行数 == 0 ->插入到dim_table ...
- 插入到fact(dim,measure)values(dim_id,23131)
如果我有10个维度加载会变得相当麻烦
I'm trying to figure out how to incrementally load my fact and dimension tables as data come into our system.
Is there any simpler way than:
- dim_id = select id from dim_table where dim_table.value='dim value';
- if rowcount == 0 -> insert into dim_table ...
- insert into fact(dim, measure) values(dim_id, 23131)
if i got 10 dimensions the loading gets rather cumbersome
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您真的需要它们是增量的吗?不能使用UUID吗?
我不明白为什么你需要一个 dim_table。
如果您正在使用星型架构,则可以通过以下方式使其发挥作用。
当您填写维度表时,会为您的维度生成一个 uuid。
当您填写事实表时,请查找维度表,以便获取与该值关联的 uuid。
编辑:我认为这可能对您有帮助 http://dev .mysql.com/doc/refman/5.0/en/insert-on-duplicate.html
Do you really need them to be incremental ? Can't you use a UUID ?
I do not understand why you need a dim_table.
If you are doing a Star Schema, this is how you can make it work
When you fill up your dimensions tables generate a uuid for your dimensions.
When you fille up your fact table, lookup into your dimension tables so you can get the uuid associated with this value.
Edit : I think that might help you http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html
在标准星型模式中,维度在事实之前加载。
因此,如果存在批量运行的 ETL 流程,最简单的解决方案是尝试更频繁地运行批量。如果您确实需要实时 DW 加载,则必须以某种方式延迟事实,以确保首先加载维度。
当谈到现成的解决方案时,您可能会考虑流事件处理。有多种商业工具,但 EsperTech 是开源的。该系统可以设置为连续运行的 ETL 转换。想象一个持续运行的 Kettle(Talend、SSIS、..)转换。
In a standard star schema, dimensions are loaded before facts.
So, if there is an ETL process in place which runs in batches, the simplest solution is to try to run the batch more often. If you do need a real-time DW loading, the facts have to be delayed somehow, to make sure that dimensions are loaded first.
When it comes to off-the shelf solutions, you may look into stream-event processing. There are several commercial tools out there, but EsperTech is open source. The system can be set-up as an ETL transformation that runs continuously. Think of an Kettle (Talend, SSIS, ..) transformation which keeps running.