为什么 NULL 值在事实表中映射为 0?

发布于 2024-12-18 16:21:17 字数 49 浏览 2 评论 0原文

在事实表(维度建模数据仓库)的度量字段中,NULL 值通常映射为 0 的原因是什么?

What is the reason that in measure fields in fact tables (dimensionally modeled data warehouses) NULL values are usually mapped as 0?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

萌︼了一个春 2024-12-25 16:21:17

尽管您已经接受了另一个答案,但我想说使用 NULL 实际上是更好的选择,原因有几个。

第一个原因是,当 NULL 存在时,聚合会返回“正确”答案(即用户倾向于期望的答案),但当使用零时,会给出“错误”答案。考虑这两个查询中 AVG() 的结果:

-- with zero; gives 1.5
select SUM(measure), AVG(measure)
from
(
select 1.0 as 'measure'
union all
select 2.0
union all
select 3.0
union all
select 0
) dt

-- with null; gives 2
select SUM(measure), AVG(measure)
from
(
select 1.0 as 'measure'
union all
select 2.0
union all
select 3.0
union all
select null
) dt

如果我们假设此处的度量是“制造商品的天数”,并且 NULL 表示仍在生产的商品,那么零会给出错误的答案。同样的推理也适用于 MIN() 和 MAX()。

第二个问题是,如果零是默认值,那么如何区分默认值零和实际值零?例如,考虑“欧元运费”的衡量标准,其中 NULL 表示客户自己领取订单,因此没有运费,零表示订单免费运送给客户。你不能用零来代替NULL而不完全改变数据的含义。您显然可以争辩说,从其他维度(例如运输方式)来看,区别应该很明显,但这增加了报告和理解数据的复杂性。

Although you've already accepted another answer, I would say that using NULL is actually a better choice, for a couple of reasons.

The first reason is that aggregates return the 'correct' answer (i.e. the one that users tend to expect) when NULL is present but give the 'wrong' answer when you use zero. Consider the results from AVG() in these two queries:

-- with zero; gives 1.5
select SUM(measure), AVG(measure)
from
(
select 1.0 as 'measure'
union all
select 2.0
union all
select 3.0
union all
select 0
) dt

-- with null; gives 2
select SUM(measure), AVG(measure)
from
(
select 1.0 as 'measure'
union all
select 2.0
union all
select 3.0
union all
select null
) dt

If we assume that the measure here is "number of days to manufacture item" and NULL represents an item that is still being produced then zero gives the wrong answer. The same reasoning applies to MIN() and MAX() too.

The second issue is that if zero is a default value, then how do you distinguish between zero as a default and zero as a real value? For example, consider a measure of "shipping charges in EUR" where NULL means that the customer picked up the order himself so there were no shipping charges and zero means the order was shipped to the customer for free. You can't use zero to replace NULL without completely changing the meaning of the data. You can obviously argue that the distinction should be clear from other dimensions (e.g. shipping method) but that adds more complexity to reports and understanding the data.

昵称有卵用 2024-12-25 16:21:17

这取决于您要建模的内容,但一般来说,这是为了避免执行聚合时出现复杂情况。在许多情况下,出于这些目的,将 NULL 视为 0 是有意义的。

例如,客户在给定时间段内有 NULL 订单。或者销售收入NULL的销售人员(真丢脸!)。

It depends upon what you're modeling, but in general it's to avoid complications with performing aggregates. And in many scenarios it makes sense to treat NULL as 0 for those purposes.

For example, a customer with NULL orders for a given period of time. Or a sales person with NULL sales revenue (shame on him!).

复古式 2024-12-25 16:21:17

主要原因是数据库对的处理方式与空白不同,即使它们在人眼中看起来像空白或零。

这是链接Ralph Kimball关于同一主题的旧设计技巧。

这篇博文讨论如何避免度量中出现空值并给出了一些建议。

The main reason is that the database treats nulls differently from blanks or zeros, even though they look like blanks or zeros to the human eye.

Here is a link to an old design tip by Ralph Kimball on the same topic.

This blogpost talks about avoiding nulls in measures and gives a couple of suggestions.

只为一人 2024-12-25 16:21:17

如果您打算对事实列求平均值,则应使用 NULL 而不是 0。这是我唯一一次相信 NULL 在 dwh 事实或维度中是可以的,

如果事实值未知/延迟到达,那么保留 NULL 是最好的。

诸如 MIN、MAX 之类的聚合函数在 NULLS 上工作,只是忽略它们

(据记录,Ralph Kimball 的一位助手在他的课程中说过这一点)

with goodf as
(
select 1  x
union all
select null 
union all
select 4
)
select sum(x) sumx,min(x) minx,max(x) maxx,avg(cast(x as float)) avgx 
from goodf


with badf as
(
select 1  x
union all
select 0 /* unknown */ 
union all
select 4
)
select sum(x) sumx,min(x) minx,max(x) maxx,avg(cast(x as float)) avgx 
from badf

在 badf 中,高于平均值的结果是不正确的,因为它使用未知值的零作为字面意思 0

NULL instead of 0 should be used if you intend to do an average on your fact column. This is the only time i believe NULLS are ok in a dwh fact or dimensions

if a fact value is unknown/late arriving, then leaving as NULL is best.

aggregate functions suchs as MIN,MAX work on NULLS simply ignoring them

(For the record one of Ralph Kimball's sidekicks said this in his course I intended)

with goodf as
(
select 1  x
union all
select null 
union all
select 4
)
select sum(x) sumx,min(x) minx,max(x) maxx,avg(cast(x as float)) avgx 
from goodf


with badf as
(
select 1  x
union all
select 0 /* unknown */ 
union all
select 4
)
select sum(x) sumx,min(x) minx,max(x) maxx,avg(cast(x as float)) avgx 
from badf

in badf above the average comes out incorrect as it uses the zero of the unknown value as literally 0

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文