对开发人员的元数据要求
我的任务是提供我们的数据仓库开发人员可能需要的元数据要求列表。
这不是业务元数据(很好的描述等),而是变更管理(也称为影响评估)、数据沿袭等所需的数据。
我已经看过这篇文章元元数据数据 - Ralph Kimball 但由于我不是第一个这样做的人,所以我将其扔给 SO社区。
实际的问题是这样的: 数据仓库开发人员需要哪些元数据来设计、开发和管理 ETL 例程中的变更?
PS:我试图让答案与平台无关,但就上下文而言,这是一个带有 PL/SQL 和 Datastage 的 Oracle 数据库。
I'm tasked with providing a list of metadata requirements our data warehouse developers might need.
This is not the business metadata (nice descriptions etc), but rather data required for change management (also known as impact assesment), data lineage etc.
I've seen this article Meta Meta Data Data - Ralph Kimball but as I'm not the first person to do this I'm throwing it to the SO community.
The actual question is this:
What metadata do datawarehouse developers require to design, develop and manage change in ETL routines?
PS: I'm trying to keep the answer platform agnostic but for context this is an Oracle database with PL/SQL and Datastage.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在我的工作场所,我们有自制的 ETL。我可以看到你扬起眉毛:)。我们拥有的最小元数据描述如下。订阅详细信息、审核、数据映射、运行顺序。
订阅详细信息再次分为两类:购买数据的供应商和使用数据的团队/应用程序。还存储 ftp/http 详细信息、访问凭据。幸运的是,我们被要求拥有绝对零的 SP,即主要异常“身份生成器”。
审核详细信息涉及数据日期、上次修改时间、运行它的用户、失败/成功计数。
数据映射表描述保存数据的表和列名称。我们曾经有一个额外的复合键描述符表。然而我们决定废除这一点。通过要求数据表所有者创建适当的分区策略来补偿性能损失。
Run_order 表 是我们的另一个表,它确定用户是否可以运行(是/否)以及运行发生的顺序。
元数据还与历史记录一起存储(基于日期)。因此,如果有人决定运行存档/历史订阅。跑步会继续进行。
上述用途:我们可以根据订阅的重要性对数据加载进行优先级排序。我们可以在通用级别(鸟瞰图)监控故障。我们可以编写可以创建动态 SQL 查询的通用代码(无需硬编码)。我们的加载和提取过程被迫使用数据映射表,因此任何用户都无法摆脱陈旧的信息。
到目前为止,根据我们的经验,这似乎是有效的。
At my workplace, we have home-brew ETL. I can see you raise a brow :). The minimal meta-data we have describes the following. Subscription details, Audit, Data-mapping, Run-order.
The subscription details again fall into two categories, vendor from whom the data was purchased and teams/applications using it. The ftp/http details, access credentials are also stored. Fortunately we were asked to have absolutely zero SPs, Major exception "identity generators".
Audit details involve, date of data, last modified time, user who ran it, failure/success count.
Data-mapping table describes the tables and column names which hold the data. We used to have an additional composite-key descriptor table. However we decided to do away with that. The performance loss was compensated by asking data table owners to create proper partitioning strategy.
Run_order table is another table we have which determines if the user can run (Y/N) and the order in which runs can occur.
The meta-data is also stored with an history (based on date). So if any one decides run an archived/historic subscription. The run would go ahead.
Uses of the above: We can prioritize the data loads based on importance of subscription. We can monitor failures at a generic level (birds-eye view). We can write generic code that can create dynamic sql queries (no hard-coding). Our load and extract processes are forced to use the data-mapping table, so no user can get away with stale information.
This seemed to work so far in our experience.
在最基本的级别上,您可以维护从源系统中提取的表名.字段名的列表。如果您从文件或其他非数据库源中提取数据,那么您可能无法像字段级别一样详细地列出依赖项。此类元数据只会告诉您 If(源字段/文件格式/表更改) 那么(ETL 过程中可能需要进行一些更改)。我说可能是因为有些变化可能很小,以至于 ETL 过程不会被破坏,但仍然应该执行测试和数据分析,以确保它不会使初始期间存在的任何假设失效。 ETL流程的设计。
虽然 ETL 开发人员可能非常熟悉源系统和 ETL 流程,但对于大型 ETL 项目,将源系统中的每个依赖项与 ETL 系统的特定流程/组件联系起来可能是明智的。例如,如果您的 ETL 过程由许多存储过程组成,那么您可能希望元数据与每个源系统字段/文件格式/表/等相关。到每个存储过程。这将是多对多关系,因为许多存储过程可能依赖于特定字段,并且单个存储过程可能依赖于许多源字段。这将是由 ETL 开发人员手动更新的内容,当他们创建或更新 ETL 系统时,他们有责任识别他们正在加载/清理/符合哪些字段,然后将该信息输入到跟踪这些依赖项的元数据中。
如果您的 ETL 系统由存储过程之外的其他东西或其某种组合提供支持,那么您将需要提出一些命名方案来引用这些组件。但概念是相同的,您将源字段/文件格式/等相关联。 ETL 系统的组件。
这将为您提供足够的信息来说明源系统中“Person.FirstName”以某种方式发生更改的情况,然后您可以编译一份报告,显示所有需要验证、可能更新和测试的存储过程以应对随着源系统的变化。
这意味着,如果知道 Person.FirstName 以某种方式发生了变化,无论是大小、数据类型和/或完全删除,都需要手动步骤通过某些数据库设计者收到更改通知,并采取响应措施。如果您想要一个真正复杂的系统,那么您需要对更改源系统的 DDL 进行触发器/审核,以便您可以自动记录更改并通知您的 ETL 架构师。
如果发生这样的更改,您会知道 sp_Person_Load、sp_Person_Clean、sp_Person_Transform 存储过程都与 Person.FirstName 字段进行一些处理,因为这些存储过程的作者在记录依赖项的元数据中指出了这一点。
您可以使其变得更复杂,其中 sp_Person_Clean 不依赖于 Person.Firstname,但实际上依赖于 sp_Person_Load。这样您就可以构建依赖链。这将使更改报告变得更加复杂,因为您必须将依赖项链接在一起才能确定源系统更改的影响。而且您还构建了一系列复杂的依赖项和潜在的循环引用,这可能会使维护元数据与维护 ETL 流程本身一样困难。如果 ETL 系统足够简单,ETL 架构师可以根据源系统中的字段/文件/表定义依赖关系,那么这样做可以使事情变得简单。
根据谁对您的数据仓库(目标系统)拥有权限,您可能需要跟踪这些依赖关系。 ETL 开发人员通常也是数据仓库开发人员。但是,如果其他人是数据仓库设计者,并且他们有权对数据仓库进行更改,那么您的 ETL 开发人员将需要有一个类似的系统(无论是自动的还是手动的)来跟踪更改并收到更改通知以及它们对 ETL 流程的影响。
事实上,当我考虑如何跟踪变化时,我会考虑权力的界限。如果 ETL 开发人员更改其 sp_Person_Clean 过程,则他不需要 emtadata 来告诉他 sp_Person_Transform 需要更新/测试。他已经非常直观地知道了这一点。另一方面,如果第三方/供应商系统发生变化,或者同一组织内的业务部门更改电子表格或文件格式,那么这些都不是 ETL 开发人员制定的。他与源系统的亲密程度不会像与自己的 ETL 系统和数据仓库一样密切。因此,他将从元数据中获益最多,这些元数据显示源系统的组件如何与 ETL 系统的组件相关。
决定要定义“组件”的粒度实际上取决于系统的设计方式以及您希望开发人员记录多少元数据。太细化会让您淹没在元数据中。太当然了,这就像说“我的房子在佛罗里达州”,这对 ETL 开发人员的工作没有真正的帮助。如果整个 ETL 过程都用单个 SQL 脚本进行编码,那么您只有一个组件。因此,需要提前设计系统,并了解您需要能够引用 ETL 过程的特定组件和/或步骤。
只有开发人员勤于更新,元数据才会有任何好处。有些系统/工具包可以自动更新此类元数据,但他们必须将您放入他们的工具包中,以便该工具可以分析依赖关系。我对这些系统是否非常可靠缺乏信心。很多时候,您必须做一些非常黑客的事情才能以所需的格式将数据从源系统获取到目标,并且我可以想象依赖性分析器无法理解依赖性。例如,如果您使用由字符串组成的动态 SQL,那么工具包无法真正弄清楚依赖项是什么。
然而,我要说的是,我从未深入研究过这些工具来真正了解它们有多好。我总是发现,在 SQL 中通常很容易的事情在该工具中却非常麻烦,并且认为这比它的价值更麻烦。我不断告诉自己,在做出最终决定之前,我会选择像 Talend 这样的东西,并真正成为这方面的专家,但总有其他优先事项将我拉向其他方向。
At the most rudimentary level you could maintain a listing of tablename.fieldname's that are extracted from the source system. If you are extracting from files or other non-database sources then you may or may not be able to list the dependencies as granular as the field level. Such meta data would tell you only that If(source field/fileFormat/table changes) then (some change may be needed in the ETL process). I say may because some changes may be minor enough that the ETL process is not broken, but one should still perform testing and data profiling to ensure it doesn't invalidate any assumptions that were in place during the initial design of the ETL process.
While the ETL developer may be intimately familiar with the source system and ETL process, for large ETL projects it would probably be wise to tie each dependency in the source system to specific processes/components of the ETL system. For example, if your ETL process is composed of many stored procedures, then you would want the metadata to relate each source system field/fileFormat/table/etc. to each stored procedure. This would be a many-to-many relationship since many stored procedures might depend on a particular field, and a single stored procedure may depend on many source fields. This would be something manually updated by the ETL developers, as they create or update the ETL system, it would be their responsibility to recognize what fields they are loading/cleaning/conforming and subsequently input that information into the metadata that tracks these dependencies.
If your ETL system is powered by something besides stored procedures, or some combination there of, then you will need to come up with some naming scheme to reference these components. But the concept is the same, you relate source fields/file formats/etc. to the components of the ETL system.
This will give you enough information to say that where "Person.FirstName" changes in the source system in some way, then you can compile a report that shows all the stored procedures which will need to be verified, potentially updated, and tested to cope with the change in the source system.
This kind of implies that knowing that Person.FirstName changed in some way, either size, datatype, and/or removed entirely, requires a manual step of being notified of the change via some database designer, and taking action in response. If you want a really sophisticated system then you'd need to have triggers/audit on DDL that changes your source system so that you can automatically log and notify your ETL architects of the change.
Should such a change occur, you'd know that sp_Person_Load, sp_Person_Clean, sp_Person_Transform stored procedures all have some dealings with the Person.FirstName field, because the author of those stored procedures noted that in the metadata documenting dependencies.
You could make it more complicated, where sp_Person_Clean doesn't depend on Person.Firstname, but actually depends on sp_Person_Load. Such that you build a chain of dependencies. This would make reporting on changes more complicated because you would have to chain dependencies together to determine the impact of a source system change. And you are also building a complex string of dependencies, and potential circular references, that might make maintaining the metadata as difficult as maintaining the ETL process itself. If the ETL system is simple enough that the ETL architect can define dependencies in terms of the fields/files/tables from the source system, then do that to keep things simple.
Depending on who has authority over your data warehouse, the destination system, you may need to track these dependencies. Often the ETL developer is also the data warehouse developer. However, if someone else is the data warehouse designer, and they have the authority to make changes to the data warehouse, then your ETL developers will need to have a similar system, whether it be automated or manual, to track and be notified of changes and the impact they will have on the ETL process.
Really when I think about how changes should be tracked, I think about the boundaries of authority. If a ETL developer changes his sp_Person_Clean procedure, he doesn't need emtadata to tell him that sp_Person_Transform will need to be updated/tested. He already knows this very intuitively. On the other hand, if a third party/vendor system changes, or if a business department within the same organization changes a spreadsheet or file format, then those are things not enacted by the ETL developer. He won't have the same level of intimacy with the source systems as he does with his own ETL system and data warehouse. Thus he will benefit the most from metadata that shows how components of the source system relates to components of the ETL system.
Deciding how granular you want to define "components" will really depend on how the systems are designed, and how much meta data you want the developer to document. Too granular and you drown in the metadata. Too course and it is like saying "My house is in Florida", which doesn't really help the ETL developer in his work. If the entire ETL process is coded in a single SQL script, then you only have one component. So the system needs to be designed in advance knowing that you need to be able to refer to specific components and/or steps of the ETL process.
The metadata will only be any good if the developer is diligent in updating it. There are systems/toolkits that can automatically update this kind of metadata, but they would have to box you into their toolkit so that the tool can analyze the dependencies. I have little confidence that these systems are very reliable. There are often times you have to do really hackish things to get data from a source system into the destination in the desired format, and I can imagine a dependency analyzer couldn't understand the dependencies. For example, if you were using dynamic SQL formed from strings, then the toolkit can't really figure out what the dependencies are.
However, I will say I have never dove into great depth into the tools to really know how good they are. I always have gotten to the point that I found things that would normally be easy in SQL to be very cumbersome in the tool, and decided it was more trouble than it was worth. I keep telling myself I will pick something up like Talend and really become an expert at it before I make my final verdict, but there are always other priorities pulling me in other directions.