ETL操作-返回主键

发布于 2024-08-26 19:44:02 字数 697 浏览 18 评论 0原文

我正在使用 Talend 填充数据仓库。我的工作是将客户数据写入维度表,将交易数据写入事实表。事实表上的代理键 (p_key) 是自动递增的。当我插入新客户时,我需要事实表来反映相关客户的 ID。

正如我提到的,我的 p_key 是自动递增的,所以我不能只为 p_key 插入任意值。

有没有想过如何在维度表中插入一行并仍然检索主键以在我的事实记录中引用?

更多信息:

如果传入数据未标准化怎么办?例如,我有一个包含以下数据的csv:

order #   date        total customer# first_name last_name
111       1/2/2010    500    101      John        Smith     
222       1/3/2010    600    101      John        Smith

显然,我希望客户信息出现在维度表中,交易数据出现在事实表中:

dimension
101  john smith

fact
111       1/3/2010
222       1/3/2010

正如您所提到的,维度表的键将自动递增。事实表需要引用这个键。如何设计 etl 作业以便在插入后返回代理键?

另外,如果客户数据已进行重复数据删除(如上所述),您如何处理密钥?

I am using Talend to populate a data warehouse. My job is writing customer data to a dimension table and transaction data to the fact table. The surrogate key (p_key) on the fact table is auto-incrementing. When I insert a new customer, I need my fact table to reflect the id of the related customer.

As I mentioned my p_key is auto auto_incrementing so I can't just insert an arbitrary value for the p_key.

Any thought on how I can insert a row into my dimension table and still retrieve the primary key to reference in my fact record?

More info:

What if the incoming data isn't normalized? For instance I have a csv with the following data:

order #   date        total customer# first_name last_name
111       1/2/2010    500    101      John        Smith     
222       1/3/2010    600    101      John        Smith

Obviously, I want the customer info to appear in the dimension table, and the transaction data in the fact table:

dimension
101  john smith

fact
111       1/3/2010
222       1/3/2010

As you mentioned, the key of the dimension table will be auto incrementing. The fact table needs to reference this key. How do you design the etl job so that the surrogate key is returned after an insert to ?

Also, if the customer data is deduped (as above) how do you handle the keys?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

凉墨 2024-09-02 19:44:02

但是,我可能误解了您的问题:

  1. 事实表可能有也可能没有自动递增 PK,通常事实表中的 PK 是引用维度表的多个 FK 的组合。

  2. 维度表应该有一个自动递增的 PK。

  3. 新客户应该在交易事实到达DW(或者至少是事实表)之前“登陆”到客户维度表。

  4. 维度表应该有一个唯一标识客户的 BusinessKey,例如电子邮件、全名 + pin 或类似内容。

    维度

  5. 传入交易行也应该具有客户 BusinessKey 字段 - 这就是我们识别客户的方式。

  6. 在将事务插入事实表之前,使用 BusinessKey 从客户维度表中查找客户 PrimaryKey。

编辑

如果您的新客户数据与交易捆绑在一起,请找到一种方法来提取客户数据并在交易之前将其路由到 DW。

更新:

首先加载dimCustomer,决定BusinessKey - 因此维度将如下所示:

CustomerKey = 12345 (auto-incremented)
CustomerBusinessKey = john_smith_101 (must uniquely identify the John Smith)
CustomerFirstName = John
CustomerLastName = Smith

在维度加载过程中,您必须将传入行分隔成两个流:现有客户和新客户。来自“现有客户”流的行更新暗淡表(类型 1 SCD),同时插入来自“新客户”流的行。正在插入的行流中不应有重复项;您可以通过将它们插入临时表并在最终插入维度表之前删除其中的重复项来实现此目的。您还可以提取重复项并将其路由回加载过程以更新客户记录;它们可能包含更新的数据——例如更新的电话号码或类似数据。

客户进来后,加载事实。

事实表应该类似于:

DateKey     (PK)
CustomerKey
OrderNumber (PK)
Total

我使用了 DateKey 和 OrderNumber 的复合主键,允许订单号序列不时地重置。

在加载过程中,将事实记录修改为如下所示:

DateKey CustomerBusinessKey OrderNumber Total
20100201  john_smith_101       111       500
20100301  john_smith_101       222       600

此时,我们需要使用查找将 CustomerBusinessKey 替换为维度表中的 CustomerKey。因此,在查找之后,流将如下所示:

DateKey CustomerKey OrderNumber Total
20100201 12345       111         500
20100301 12345       222         600

现在可以将其插入到事实表中。

我也做了一些作弊——没有从 dimDate 中查找日期键,也没有在事实表中查找现有行。加载事实表时,您可以在加载之前查找现有的(DateKey、OrderNumer)组合,也可以将其留给主键以防止重复 - 您可以选择。无论如何,请确保尝试将相同数据重新加载到事实表中会失败。

I may have misunderstood you problem, however:

  1. A fact table may or may not have an auto-incrementing PK, usually a PK in a fact table is a composite of several FKs referencing dimension tables.

  2. A dimension table should have an auto-incrementing PK.

  3. A new customer should "land" into the customer dimension table before the transaction fact reaches the DW (or at least the fact table).

  4. A dimension table should have a BusinessKey which uniquely identifies a customer -- like email, full name + pin, or similar.

  5. An incoming transaction row should have the customer BusinessKey field too -- that's how we identify the customer.

  6. Use the BusinessKey to lookup the customer PrimaryKey from the customer dimension table before inserting the transaction into the fact table.

EDIT

If your new customer data is bundled with the transaction, find a way to extract customer data and route it to the DW ahead of the transaction.

UPDATE:

Load dimCustomer first, decide on BusinessKey -- so the dimension would look like:

CustomerKey = 12345 (auto-incremented)
CustomerBusinessKey = john_smith_101 (must uniquely identify the John Smith)
CustomerFirstName = John
CustomerLastName = Smith

During dimension loading process, you have to segregate incoming rows int two streams, existing and new customers. Rows from the "existing customer" stream update the dim table (type 1 SCD), while rows from the "new customer" stream are inserted. There should be no duplicates in the stream of rows that are being inserted; you can accomplish this by inserting them into a staging table and removing duplicates there, just before the final insert into the dimension table. You can also extract duplicates and route them back into the loading process to update customer records; they may contain newer data -- like updated phone numbers or similar.

Once the customer is in, load facts.

The fact table should look something like:

DateKey     (PK)
CustomerKey
OrderNumber (PK)
Total

I have used composite primary key of the DateKey and the OrderNumber, allowing for the order number sequence to reset from time to time.

During loading process, modify the fact record to look something like:

DateKey CustomerBusinessKey OrderNumber Total
20100201  john_smith_101       111       500
20100301  john_smith_101       222       600

At this point we need to replace the CustomerBusinessKey with the CustomerKey from the dimension table using a lookup. So, after the lookup the stream would look like:

DateKey CustomerKey OrderNumber Total
20100201 12345       111         500
20100301 12345       222         600

This can now be inserted into the fact table.

I have also cheated a bit -- did not lookup a date key from the dimDate, and did not look for existing rows in the fact table. When loading the fact table, you can look for existing (DateKey, OrderNumer) combination before loading, or you can leave it up to the primary key to protect agains duplicates -- your choice. In any case make sure that an attempt to re-load the same data into the fact table fails.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文