II 型尺寸连接

发布于 2024-07-28 22:39:35 字数 1051 浏览 8 评论 0原文

我在 OLTP 中有以下表查找表

CREATE TABLE TransactionState
(
    TransactionStateId INT IDENTITY (1, 1) NOT NULL,
    TransactionStateName VarChar (100)
)

当它进入我的 OLAP 时,我按如下方式更改结构:

CREATE TABLE TransactionState
(
    TransactionStateId INT NOT NULL, /* not an IDENTITY column in OLAP */
    TransactionStateName VarChar (100) NOT NULL,
    StartDateTime DateTime NOT NULL,
    EndDateTime NULL
)

我的问题是关于 TransactionStateId 列。 随着时间的推移,我的 OLAP 中可能会有重复的 TransactionStateId 值,但通过 StartDateTime 和 EndDateTime 的组合,它们将是唯一的。

我见过 Type-2 维度的示例,其中添加了 OriginalTransactionStateId,并将传入的 TransactionStateId 映射到它,加上新的 TransactionStateId IDENTITY 字段成为 PK 并用于连接。

CREATE TABLE TransactionState
(
    TransactionStateId INT IDENTITY (1, 1) NOT NULL,
    OriginalTransactionStateId INT NOT NULL, /* not an IDENTITY column in OLAP */
    TransactionStateName VarChar (100) NOT NULL,
    StartDateTime DateTime NOT NULL,
    EndDateTime NULL
)

我应该选择单身汉#2 还是单身汉#3?

I have the following table lookup table in OLTP

CREATE TABLE TransactionState
(
    TransactionStateId INT IDENTITY (1, 1) NOT NULL,
    TransactionStateName VarChar (100)
)

When this comes into my OLAP, I change the structure as follows:

CREATE TABLE TransactionState
(
    TransactionStateId INT NOT NULL, /* not an IDENTITY column in OLAP */
    TransactionStateName VarChar (100) NOT NULL,
    StartDateTime DateTime NOT NULL,
    EndDateTime NULL
)

My question is regarding the TransactionStateId column. Over time, I may have duplicate TransactionStateId values in my OLAP, but with the combination of StartDateTime and EndDateTime, they would be unique.

I have seen samples of Type-2 Dimensions where an OriginalTransactionStateId is added and the incoming TransactionStateId is mapped to it, plus a new TransactionStateId IDENTITY field becomes the PK and is used for the joins.

CREATE TABLE TransactionState
(
    TransactionStateId INT IDENTITY (1, 1) NOT NULL,
    OriginalTransactionStateId INT NOT NULL, /* not an IDENTITY column in OLAP */
    TransactionStateName VarChar (100) NOT NULL,
    StartDateTime DateTime NOT NULL,
    EndDateTime NULL
)

Should I go with bachellorete #2 or bachellorete #3?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

第几種人 2024-08-04 22:39:35

通过这句话:

通过组合 StartDateTimeEndDateTime,它们将是唯一的。

你的意思是它们永远不会重叠或者它们满足数据库UNIQUE约束?

如果是前者,那么您可以在连接中使用 StartDateTime,但请注意,它可能效率较低,因为它将使用 "<=" 条件而不是 >“=”。

如果是后者,那就用假身份吧。

数据库通常不允许对此查询使用有效的算法:

SELECT  *
FROM    TransactionState
WHERE   @value BETWEEN StartDateTime AND EndDateTime

,除非您对 SPATIAL 数据执行神秘的技巧。

这就是为什么您必须在 JOIN: 中使用此条件

SELECT  *
FROM    factTable
CROSS APPLY
        (
        SELECT  TOP 1 *
        FROM    TransactionState
        WHERE   StartDateTime <= factDateTime
        ORDER BY
                StartDateTime DESC
        )

,这将剥夺优化器使用 HASH JOIN 的可能性,而 HASH JOIN 在许多情况下对于此类查询来说是最有效的案例。

有关此方法的更多详细信息,请参阅本文:

重写查询以使其可以使用 HASH JOIN 会带来 600% 倍的性能提升,但这只有在您的日期时间精度达到一天或更低的情况下才有可能实现(或者哈希表会变得非常大)。

由于您的时间组件已删除 StartDateTimeEndDateTime,因此您可以创建如下 CTE

WITH    cal AS
        (
        SELECT CAST('2009-01-01' AS DATE) AS cdate
        UNION ALL
        SELECT DATEADD(day, 1, cdate)
        FROM   cal
        WHERE  cdate <= '2009-03-01'
        ),
        state AS
        (
        SELECT  cdate, ts.*
        FROM    cal
        CROSS APPLY
                (
                SELECT  TOP 1 *
                FROM    TransactionState
                WHERE   StartDateTime <= cdate
                ORDER BY
                        StartDateTime DESC
                ) ts
        WHERE   ts.EndDateTime >= cdate
        )
SELECT  *
FROM    factTable
JOIN    state
ON      cdate = DATE(factDate)

如果您的日期范围跨度超过 100 日期,调整 CTE 上的 MAXRECURSION 选项。

By this phrase:

With the combination of StartDateTime and EndDateTime, they would be unique.

you mean that they never overlap or that they satisfy the database UNIQUE constraint?

If the former, then you can use the StartDateTime in joins, but note that it may be inefficient, since it will use a "<=" condition instead of "=".

If the latter, then just use a fake identity.

Databases in general do not allow an efficient algorithm for this query:

SELECT  *
FROM    TransactionState
WHERE   @value BETWEEN StartDateTime AND EndDateTime

, unless you do arcane tricks with SPATIAL data.

That's why you'll have to use this condition in a JOIN:

SELECT  *
FROM    factTable
CROSS APPLY
        (
        SELECT  TOP 1 *
        FROM    TransactionState
        WHERE   StartDateTime <= factDateTime
        ORDER BY
                StartDateTime DESC
        )

, which will deprive the optimizer of possibility to use HASH JOIN, which is most efficient for such queries in many cases.

See this article for more details on this approach:

Rewriting the query so that it can use HASH JOIN resulted in 600% times performance gain, though it's only possible if your datetimes have accuracy of a day or lower (or a hash table will grow very large).

Since your time component is stripped of your StartDateTime and EndDateTime, you can create a CTE like this:

WITH    cal AS
        (
        SELECT CAST('2009-01-01' AS DATE) AS cdate
        UNION ALL
        SELECT DATEADD(day, 1, cdate)
        FROM   cal
        WHERE  cdate <= '2009-03-01'
        ),
        state AS
        (
        SELECT  cdate, ts.*
        FROM    cal
        CROSS APPLY
                (
                SELECT  TOP 1 *
                FROM    TransactionState
                WHERE   StartDateTime <= cdate
                ORDER BY
                        StartDateTime DESC
                ) ts
        WHERE   ts.EndDateTime >= cdate
        )
SELECT  *
FROM    factTable
JOIN    state
ON      cdate = DATE(factDate)

If your date ranges span more than 100 dates, adjust MAXRECURSION option on CTE.

半衬遮猫 2024-08-04 22:39:35

请注意,IDENTITY(1,1) 是该列中自动生成值的声明。 这与 PRIMARY KEY 不同,后者是使列成为主键聚集索引的声明。 这两个声明意味着不同的事情,如果您不说PRIMARY KEY,则会产生性能影响。

Please be aware that IDENTITY(1,1) is a declaration for auto-generating values in that column. This is different than PRIMARY KEY, which is a declaration that makes a column into a primary key clustered index. These two declarations mean different things and there are performance implications if you don't say PRIMARY KEY.

情释 2024-08-04 22:39:35

您还可以使用 SSIS 加载 DW。 在缓慢变化的维度 (SCD ) 变换,可以设置如何对待每个属性。 如果选择历史属性,则类型 2 SCD 将应用于整行,并且转换会处理细节。 如果您喜欢 start_dateend_datecurrent/expired 列,您还可以进行配置。

这里要区分的是主键和业务(自然)键之间的区别。 主键唯一标识表中的一行。 业务键唯一标识业务对象/实体,并且可以在维度表中重复。 每次应用SCD 2时,都会插入一个新行,具有新的主键,但业务键相同; 然后,旧行被标记为已过期,而新行被标记为当前行,或者适当填充开始日期和结束日期字段。

DW不应该暴露主键,因此从OLTP传入的数据包含业务键,而主键的分配由DW控制; IDENTITY int 有利于维度表中的 PK。

最酷的是 SSIS 中的 SCD 转换可以解决这个问题。

You could also use SSIS to load the DW. In the slowly changing dimension (SCD) transformation, you can set how to treat each attribute. If a historical attribute is selected, the type 2 SCD is applied to the whole row, and the transformation takes care of details. You also get to configure if you prefer start_date, end_date or a current/expired column.

The thing to differentiate here is difference between the primary key and a the business (natural) key. Primary key uniquely identifies a row in the table. Business key uniquely identifies a business object/entity and it can be repeated in a dimension table. Each time a SCD 2 is applied, a new row is inserted, with a new primary key, but the same business key; the old row is then marked as expired, while the new one is marked as current -- or start date and end date fields are populated appropriately.

The DW should not expose primary keys, so incoming data from OLTP contains business keys, while assignment of primary keys is under control of the DW; IDENTITY int is good for PKs in dimension tables.

The cool thing is that SCD transformation in SSIS takes care of this.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文