更新仓库事实表中的事实表(如果它代表流程)
我正在设计数据仓库结构和ETL流程。
我已经确定了“购买”事实表,它将保存用户所做的所有购买。
问题是,用户必须在这个过程中执行多项操作:
- 订阅时事通讯(无需注册,可选步骤)
- 注册他的帐户
- 购买……
在仪表板上显示订阅、注册、购买的用户数量非常重要一些东西和转化率。
我创建了以下事实表(那里还有其他内容,我跳过它们是为了更清楚):
subscription_date
account_id
purchase_date
我的想法是,如果有人订阅但没有注册,则仅填写 subscription_date ,其余的将被填写为空。如果他订阅并注册 - 则填写 subscription_date 和 account_id。如果他完成了整个过程,则所有三个字段都将被填写。
问题是,用户可以在第一天订阅,在第二天注册并在第三天购买。 我正在使用 AWSglue 来创建 ETL 流程。一切工作正常 - 它从多个数据库表(订阅、帐户、购买)中获取记录,将它们组合在一起并插入到购买表中。但我不知道如何处理更新。
发生的情况是,如果用户在第一天订阅,ETL 作业将运行并插入此记录。如果用户在第 2 天注册并在第 3 天进行购买,则该行不会更新。那里也没有添加新记录。
我可以想到解决此问题的两种解决方案:
将事实表拆分为三个不同的事实表,并在填充仪表板时将它们连接起来(我正在使用 Quicksight)。那么就不需要更新了。如果在源数据库中插入帐户的新购买,则只会使用正确的 account_id 插入。
尝试添加到 AWS Glue 作业,检查此购买的 account_id 是否已存在并更新记录
从事实表中删除已更新的所有记录,并使用所有数据重新插入它们。
I'm in the process of designing data warehouse structure and ETL process.
I've identified 'purchase' fact table, which would hold all the purchases done by the users.
Problem is, that there are multiple things in the process that user has to do:
- subscribe to newsletter (without registration, optional step)
- register his account
- purchase sth
It's important to show on the dashboard number of users who subscribed, who registered, who purchased something and conversion rates.
I've created following fact table (there are other things there, I skipped them to be more clear):
subscription_date
account_id
purchase_date
My idea was, that if someone subscribed, but didn't register, only subscription_date will be filled-in, rest of those will be null. If he subscribed and registered - subscription_date and account_id is filled in. If he did whole process, all three fields will be filled.
The problem is, that user can subscribe on day 1, register on day 2 and purchase on day 3.
I'm using AWS glue to create ETL process. Everything works fine - it's fetching records from multiple database tables (subscription, account, purchase), combine them together and insert into purchase table. I do not know how to handle updates though.
What happens is if user subscribes on day 1, ETL job runs and inserts this record. If on day 2 user registers and on day 3 makes a purchase, this row is not updated. No new records are added there either.
I can think of two solutions for this problem:
Split fact table into three different fact tables and join them when populating the dashboard (I'm using quicksight for that). Then no updates would be needed. If new purchase for account is inserted in source database, it will be just inserted with proper account_id.
Somehow try add to AWS Glue Job check if this account_id for purchase already exists and update the record
Delete all records from fact table that are updated and re-insert them with all data.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
最简单的解决方案是使用 Glue 将数据放入临时表中(仅插入),然后使用 SQL/存储过程将该数据更新插入到目标表中。
顺便说一句,事实表中的维度键永远不应该为空。因此,如果客户尚未注册,那么事实表的注册日期列应引用日期维度中的“未知日期”行
The easiest solution would be to use Glue to land your data in staging tables (insert only) and then use SQL/Stored Procedures to upsert that data to the target tables.
BTW dimensional keys in your fact tables should never be null. So if a customer has not yet registered then the fact table registered date column should reference the “unknown date” row in your date dimension