数据仓库重复维度行
我们开始使用事件日志中的数据加载数据仓库。我们有一个正常的星型模式,其中事实表中的一行代表一个事件。我们的维度表是 user_agent、ip、referal、page 等的典型组合。一个维度表如下所示:
create table referal_dim(
id integer,
domain varchar(255),
subdomain varchar(255),
page_name varchar(4096),
query_string varchar(4096)
path varchar(4096)
)
我们自动生成 id 以最终连接到事实表。我的问题是:在批量加载过程中识别重复记录的最佳方法是什么?在实际插入持久存储之前,我们将日志文件的所有记录上传到临时表中,但是,id 只是自动递增,因此两天内的两个相同的暗淡记录将具有不同的 id。创建值列的哈希值然后尝试进行比较是否合适?尝试比较每个值列似乎会很慢。对于这种情况有什么最佳实践吗?
We're starting to load up a datawarehouse with data from event logs. We have a normal star schema where a row in the fact table represents one event. Our dimension tables are a typical combination of user_agent, ip, referal, page, etc. One dimension table looks like this:
create table referal_dim(
id integer,
domain varchar(255),
subdomain varchar(255),
page_name varchar(4096),
query_string varchar(4096)
path varchar(4096)
)
Where we autogenerate the id to eventually join against the fact table. My question is: whats the best way to identify duplicate records in our bulk load process? We upload all the records for a log file into temp tables before doing the actual insert into the persistent store, however, the id is just auto-incremented, so two identical dim records from two days would have different ids. Would creating a hash of the value columns be appropriate and then trying to compare on that? It seems like trying to compare on each value column would be slow. Is there any best practices for a situation like this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
代理 PK 的自动递增整数是可以的,但是(根据 Kimball 先生的说法)维度表也应该有一个自然键。因此,哈希
NaturalKey
列是有序的,“当前”或“过期”的Status
列也可能有助于支持 SCD 类型 2。Auto-increment integer for a surrogate PK is OK, but (according to Mr. Kimball) a dimension table should also have a natural key too. So a hash
NaturalKey
column would be in order, also aStatus
column for "current" or "expired" may be useful to allow for SCD type 2.