SSIS 处理“不良”问题的推荐实践数据
早在 2000 年,我就使用 SQL Server DTS 包进行了一些 ETL 开发。我开始使用设计器来创建包,但最终决定使用 SQLDMO(数据库管理对象)库来动态生成包。在较高级别上,生成的包将读取输入文件并将数据传输到临时表。任何不符合预期列数据类型的记录都将作为错误记录和一条错误消息记录被踢出,说明该记录验证失败的原因。这段代码一般只写一次。
为了实现此目的,首先将数据从平面 CSV 布局源文件 DTS 到“无类型”目标表中,其中表中的列与输入文件中的列匹配,但列全部定义为 varchar(255)。目的是首先将原始数据放入一个表中,然后可以针对非类型化表运行动态生成的 SQL,以识别我们知道不会转换为预期数据类型的数据。类型或长度错误的记录将自动被踢出到与输入文件布局相同的错误表中。然后,公共处理库将动态创建 DTS 包,将剩余记录传输到“类型化”表中,其中所有列现在都是预期的类型,而不仅仅是 varchar。所有这些表也是在运行时动态创建的,全部基于预期文件布局的定义。
在这个初始阶段之后还有更多的事情,但是将数据放入数据库并识别“坏”数据并记录每条记录坏的原因现在再次成为新系统的焦点,这次使用 SSIS 包。我正在考虑重新创建我在 2000 年所做的事情,使用 SQLMO 生成 SSIS 包并创建一个公共库来使用 .NET 处理这个过程,但我想获得关于其他人正在做的可能更好的建议。
我个人对工具的偏好倾向于使用可视化方法与设计者一起开发 SSIS 包。就我个人而言,我觉得通过使用 .NET 创建一个可重用的库,我拥有最大的控制权和最好的开发环境,该库抽象了整个流程并利用文件系统、SSIS 和 SQL Serve,同时拥有丰富的开发语言。
我发现将文件定义放在表中具有各种附带好处。例如,我决定每个文件定义都有一个标准标头,该标头将唯一标识应使用哪种文件格式来解析它。然后,我编写了一个编辑器,它可以在网格中显示文件的内容,并允许动态编辑或动态生成 DTS 包,并将数据传输到表中,以允许作为独立工具对数据进行查询。此外,我生成了将文件布局显示为文档的 Word 文档,并创建了可由标准工具使用的库来提取指定格式的数据,因此使用它的源系统将保证以正确的格式提供数据。
所以,最后,我的问题是:
- 微软建议如何处理和处理不良数据,以及如何在不重新发明轮子的情况下完成每个包?我是 SSIS 新手。
Way back in 2000, I did some ETL development with SQL Server DTS packages. I started out using the designer to create the packages but then eventually settled with using the SQLDMO (Database Mgt Objects) Library to dynamically generate packages. At a high level, the generated packages would read an input file and transfer the data to a staging table. Any records that did not conform to the expected column data type were kicked out as error records and a error message record as to why the record failed validation. This code was written generically once.
To accomplish this, the data was first DTSed from the flat CSV layout source file into a "untyped" dest table where the columns in the table matched those in the input file but the columns were all defined as varchar(255). The intent was to get first get the raw data into a table where dynamically generated SQLs could then be run against the untyped table to indentify data that we know would not convert to the expected data type. Records that were of the wrong type or length would automatically be kicked out to an error table of the same layout as the input file. The common processing library would then dynamically create a DTS package to transfer the remaining records into a "typed" table where all of the columns were now of the types that were expected, not just varchar. All of these tables were dynamically created at run time, too, all based on the definition of the expected file layout.
There was more too it after this initial phase, but getting the data into the database and identify "bad" data and recording the reason each record is bad is now once again the focus of a new system, this time using SSIS packages. I was thinking of kind of recreating what I did back in 2000 using SQLMO to generate SSIS packages and create a common library to handle this process using .NET but I wanted to get advise on what other people are doing that may be better.
My personal preference for tools leans away from developing SSIS Packages with the designer using the visual approach. Personally, I feel like I have the most control and the best development environment by using .NET to create a reusable library that abstracts the whole process and leverages the File System, SSIS and SQL Serve while having a rich development lanaguage.
I found that putting file definitions in a table to have all sorts of side benefits. For example, I decided that every file defintiion would have a standard header which would uniquely identify which file format should be used to parse it. I then wrote a editor which would display the contents of a file in a grid and allow editing or dynamically generate a DTS package on the fly a transfer the data to a table to allow queries against the data as a standalone tool. Furthermore, I generated Word documents showing the file layouts as documentation and created libraries that could be used by standard tools to EXTRACT data in the specified format so source systems using it would be guranteed to provide the data in the correct format.
So, finally, my questions:
- How does Microsoft recommend that bad data be handled and process and how can this be done without reinventing the wheel with every package? I'm an SSIS rookie.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
处理提取时的错误数据没有“灵丹妙药”,我怀疑是否只有一种实践可以做到这一点。
通常,以下是从外部源导入数据时可用的选项(基本上是“提取”):
只要你的实现能够支持这三种场景并且允许用户选择你就或多或少的涵盖了所有可能的应用。
There is no "silver bullet" in handling erroneous data on extraction and I doubt that there is only a single practice how to do it.
Usually following are the options are available while importing the data from external source (that basically "extraction" is):
As long as your implementation can support these three scenarios and allows the user to choose you are more or less covering all possible applications.