处理数据仓库中的空值
我想询问您有关数据仓库和 SSIS/SSAS 时处理 null 或空数据值的最佳实践的意见。
我有几个事实表和维度表,它们在不同的行中包含空值。
细节:
1) 处理空日期/时间值的最佳方法是什么? 我是否应该在时间或日期维度中创建一个“默认”行,并在发现空值时将 SSIS 指向默认行?
2) 处理维度数据内的空值/空值的最佳方法是什么。 例如:“帐户”维度中的一些行在“帐户名称”列中具有空(非 NULL)值。 我应该将列内的这些空值或空值转换为特定的默认值吗?
3) 与上面的第 1 点类似 - 如果我最终得到的 Facttable 行在维度列之一中没有记录,我该怎么办? 如果发生这种情况,我是否需要每个维度的默认维度记录?
4) 关于如何在 Sql 服务器集成服务 (SSIS) 中处理这些操作有什么建议或提示吗? 使用最佳数据流配置或最佳转换对象会很有帮助。
谢谢 :-)
I'd like to ask your input on what the best practice is for handling null or empty data values when it pertains to data warehousing and SSIS/SSAS.
I have several fact and dimension tables that contain null values in different rows.
Specifics:
1) What is the best way to handle null date/times values? Should I make a 'default' row in my time or date dimensions and point SSIS to the default row when there is a null found?
2) What is the best way to handle nulls/empty values inside of dimension data. Ex: I have some rows in an 'Accounts' dimensions that have empty (not NULL) values in the Account Name column. Should I convert these empty or null values inside the column to a specific default value?
3) Similar to point 1 above - What should I do if I end up with a Facttable row that has no record in one of the dimension columns? Do I need default dimension records for each dimension in case this happens?
4) Any suggestion or tips in regards to how to handle these operation in Sql server integration services (SSIS)? Best data flow configurations or best transformation objects to use would be helpful.
Thanks :-)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
正如前面的答案所述,维度的空值可能有许多不同的含义,未知、不适用、未知等。如果能够在应用程序中区分它们,添加“伪”维度条目会有所帮助。
在任何情况下,我都会避免使用空事实外键或维度字段,即使有一个“未知”维度值也将帮助您的用户定义包含包罗万象的分组的查询,其中数据质量不是 100%(并且它从来没有)。
我一直在使用的一个非常简单的技巧是使用 T-sql 中的 int IDENTITY(1,1) 定义维度代理键(从 1 开始,每行增加 1)。 伪键(“不可用”、“未分配”、“不适用”)被定义为负整数,并由在 ETL 过程开始时运行的存储过程填充。
例如,创建的表为
以及填充该表的存储过程,
我规定每个维度至少有一个这样的伪行,在维度查找失败的情况下使用,并构建异常报告来跟踪维度的数量分配给这些行的事实。
As the previous answer states there can be many different meanings attached to Null values for a dimension, unknown, not applicable, unknown etc. If it is useful to be able to distinguish between them in your application adding "pseudo" dimension entries can help.
In any case I would avoid having either Null fact foreign keys or dimension fields, having even a single 'unknown' dimension value will help your users define queries that include a catch-all grouping where the data quality isn't 100% (and it never is).
One very simple trick I've been using for this and hasn't bitten me yet is to define my dimensions surrogate keys using int IDENTITY(1,1) in T-sql (start at 1 and increment by 1 per row). Pseudo keys ("Unavailable", "Unassigned", "Not applicable") are defined as negative ints and populated by a stored procedure ran at the beginning of the ETL process.
For example a table created as
And a stored procedure populating the table with
I have made it a rule to have at least one such pseudo row per dimension which is used in cases where the dimension lookup fails and to build exception reports to track the number of facts which are assigned to such rows.
NULL 或日期维度中具有适当含义的保留 ID。 请记住,NULL 实际上可以有许多不同的含义,它可能是未知的、不适用的、无效的等。
我更喜欢空字符串(并且不可为 NULL),但在我现在正在处理的项目中,将空字符串转换为 NULL 并允许它们进入数据库。 要讨论的一个潜在问题是空白中间声母(没有中间名,因此已知中间声母为空)与未知中间声母或类似语义不同。 为了钱,我们的模型允许 NULL - 事实上我对此有一个很大的问题,因为通常它们确实应该是 0,它们总是被用作 0 并且它们总是必须用 ISNULL() 包装。 但由于将空字符串转换为 NULL 的 ETL 策略,它们被设置为 NULL - 但这只是固定宽度传输文件格式的一个产物,它在某些源系统中使用空格而不是 0。
我们的事实表通常具有基于所有维度的 PK,因此这是不允许的 - 它将链接到虚拟或未知维度
在 SSIS 中,我制作了一个修剪组件,用于修剪所有字符串末尾的空格。 我们通常必须在 SSIS 中进行大量日期验证和转换,这在组件中是最好的。
Either NULL or a reserved id from your date dimension with appropriate meaning. Remember NULL really can have many different meanings, it could be unknown, inapplicable, invalid, etc.
I would prefer empty string (and not NULLable), but in the project I'm working on now converts empty string to NULL and allows them in the database. A potential problem to be discussed is that a blank middle initial (no middle name, so middle initial is known to be empty) is different from an unknown middle initial or similar semantics. For money, our model allows NULLs - I have a big problem with this in the facts, since typically, they really should be 0, they are always used as 0 and they always have to be wraped with ISNULL(). But because of the ETL policy of converting empty string to NULL, they were set to NULL - but this was just an artifact of the fixed-width transport file format which had spaces instead of 0 from some source systems.
Our fact tables usually have a PK based on all the dimensions, so this wouldn't be allowed - it would be linked to a dummy or unknown dimension
In SSIS I made a trim component which trims spaces from the ends of all strings. We typically had to do a lot of date validation and conversion in SSIS, which would have been best in a component.
感谢您的投入,
我在最新项目中所做的两件事是:
1)使用 Steve 关于未知/特殊维度值的负 ID 键的建议。 这非常有效,并且在 SSAS 多维数据集构建过程中没有出现任何问题。
2) 创建转换来检查值是否为空,如果是,则转换为 -1(维度中的未知记录),或者如果它是度量值,则转换为 0。表达式如下所示作为示例(我在派生列转换):
希望这对将来的其他人有帮助;-)
Thanks for the input,
Two things I have done on my latest project are:
1) Used Steve's suggestion about negative ID keys for Unknown/special dimension values. This has worked perfectly and no issues arose during the SSAS cube building process.
2) Created transformations to check if a value is null, and if so, convert to either -1 (Unknown record in dimension) OR if it's a measure value, convert to 0. The expressions are shown below as examples (I used these in Derived column transformations):
Hopefully this helps someone else in the future ;-)
我可以建议的另一个解决方案是,在 ETL 步骤期间定义一个传输表,在所有必要的转换之后导入的记录将临时存储在该传输表中。
我会在该传输表中添加一些额外的属性,以允许某人; 紧邻原始值属性,可以为 NULL 或其他不需要的值; 插入一个“编码”值,一方面标识问题,另一方面插入错误值所在的属性名称。
完成此操作后,我仍然可以决定如何在后续步骤中使用非规范化和传输的数据...可能会过滤掉错误值或在单独的错误维度中提及它们,以便将其包含在报告中,说明哪些值不正常以及它们如何变化/可能会影响聚合值。
例如,
另一个属性 =
IdOrder
、BirthDate
、OrderAmount
等。当然,如果记录可以有多个,那么您会遇到更多麻烦1 个错误(NULL)值,但在这种情况下,可以扩展“跟踪”属性的数量或“返回源”并找出问题发生的位置和原因(与开发部门一起)。
这在某种程度上是一个复杂的步骤,但是为了完整性和正确性,我认为这是不可避免的和必要的,因为否则人们可能会遇到聚合不良的信息。
也许这也会对某人有所帮助;)
Another solution i can suggest is that during the
ETL-step
a transfer table is defined into which imported records are temporarily stored AFTER all the necessary transformations.I would add a few extra attributes to that transfer table allowing someone; next to the original value-attributes that can be NULL or some other not-desired value; to insert a "coded" value identifying the problem on the one hand and the attribute-name in which the erroneous value occurred.
Having done that i could still decide how to use the denormalized and transferred data in a later step... possibly filtering out the erroneous values or mentioning them in a separate error-dimension for inclusion in reports stating which values were deviant and how they can/could possibly affect the aggregated values.
e.g.
and the other attribute =
IdOrder
,BirthDate
,OrderAmount
, etc.Of course you are in a lot more trouble if records can have MORE than 1 erroneous (NULL) value, but in that case one could either expand the number of "tracing" attributes or "return to source" and find out where and why the problem occured (together with development dep.)
It is somewhat an involved step, however for the sake of completeness and correctness i suppose it's inevitable and necessary because otherwise one might be confronted with badly aggregated information.
Maybe this too will help someone ;)