ETL 的字段特定错误
我正在 MS SQL Server 中创建 ETL 过程,并且希望出现特定于特定行的特定列的错误。例如,数据最初从 Excel 文件加载到一个表(我们称之为初始表)中,其中所有列都是 varchar(2000)
,然后我将数据暂存到另一个表(DataTypedTable) )包含更具体的数据类型(datetime
、int
等)或更严格约束的 varchar 长度。我需要能够为特定字段创建错误消息,例如: “1 月 13 日”不是提交日期的有效日期格式。请使用 MM/DD/YYYY 格式
这些错误消息需要以某种方式存储,以便稍后在该过程中自动化流程可以创建包含错误消息的报告,以便每条消息引用特定的行和字段(有人会需要返回并更正源系统中的数据并重新提交 Excel 文件)。因此,理想情况下,它将被插入到某种类型的失败表中,并包含失败行的主键、列名和错误消息。
问题所以我想知道这是否可以通过 SSIS 或 Talend 之类的开源工具来完成,如果可以,您的一般方法是什么?或者您会采取什么手工编码方法?
我想过使用 SQL 的几种方法(直到没有我在 SQL 过程中手动完成 ETL,但我想考虑其他方法。甚至可能是 C#。):
使用游标读取初始表,并为每个行将仅包含主键的空白记录插入到 DataTyped 表中,然后对每一列使用单个更新语句,这样,如果更新失败,我可以在错误消息表中插入特定于该列的非常具体的错误消息。
将所有数据按原样插入到 DataTyped 表中,但包含重复的列,例如 SubmissionDate
和 SubmissionDateOld
。初始插入后,*Old 列有数据,其余为空白,并且我对每个列进行一次更新,根据 SubmissionDateOld 设置 SubmissionDate。
除了建议一种方法之外,我还想知道您是否在您所做的工作中使用该方法或类似的方法。
I am creating a ETL process in MS SQL Server and I would like to have errors specific to a particular column of a particular row. For example, the data is initially loaded from excel files into a table(we'll call the Initial table) where all columns are varchar(2000)
and then I stage the data to another table(the DataTypedTable) that contains more specific data types (datetime
,int
, etc.) or more tightly constrained varchar lengths. I need to be able to create error messages for a specific field such as:
"Jan. 13th" is not a valid date format for the submission date. Please use a format of MM/DD/YYYY
These error messages would need to be stored in some way such that later in the process a automated process can create reports with the error messages such that each message references a specific row and field(someone will need to go back and correct the data in the source system and resubmit the excel file). So ideally it would be inserted into a Failures tables of some sort and contain the primary key of the failed row, the column name, and the error message.
Question: So I am wondering if this can be accomplished with SSIS, or some open source tool like Talend, and if so, what would be your general approach? Or what hand coded approach you would take?
Couple approaches I've thought of using SQL(up until no I have done ETL by hand in SQL procs, but I want to consider other approaches. Possible C# even.):
Use a cursor to read through the Initial table, and for each row insert a blank record with only the primary key into the DataTyped table, then use a single update statement for each column, such that if that update fails I can insert a very specific error message specific to that column in the error messages table.
Insert all the data as is into the DataTyped table, but have duplicate columns like SubmissionDate
and SubmissionDateOld
. After the initial insert the *Old columns have data, the rest are blank, and I have a single update for each column that sets the SubmissionDate based on the SubmissionDateOld.
In addition to suggesting an approach, I'd like to know if you are using that approach or something similar already in the work you do.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我使用的方法是将条件分割放入数据流中。然后,不满足条件(无效日期、必填字段中没有数据等)的记录将被发送到异常表,其中包括记录标识符、错误数据、失败原因。然后,您可以稍后根据此信息创建包含错误的电子表格或文本文件,并将其发送回提供该文件的小组。好的记录当然会转到 pther 路径并插入到表中。
I use the aproach where I put a conditional split into the data flow. The records which fail the conditions (invalid date, no data in a required field, etc.) are then sent to an exception table which includes the record identifier, the bad data, the reason it failed. You can then later on create a spreadsheet or text file of errors from this information to send back to the group providing the file. Good records of course go to the pther path and are inserted into the table.
在加载到暂存表(您称为初始表)之前进行一些清理/转换怎么样?将数据从 Excel 转储到制表符或逗号分隔的文件,然后使用您选择的一些编程语言来执行您注意到的数据清理。另外,每个数据负载有多大?您可以利用多线程或多进程应用程序来处理主要负载(例如一次加载几百万行)。在此过程中,您遇到的任何错误都可以加载到异常表中,其中包含标识符、错误和注释详细信息。该技术有助于在数据清理阶段更好地控制。
如果负载不是那么高,并且您希望在数据库 (SQL) 中完成大部分工作,那么您可能希望进行尽可能多的数据分析,并充分了解您可以预期的可能的数据变化。这样,您就可以使用适当的组件(Talend 或 SSIS)来进行转换或控制数据流。此外,通过使用正则表达式,您可以捕获任何偏离设定规则的实体。
How about some cleaning/transformation before loading into the staging (what you call initial tables) tables? Dump the data from Excel to a tab or comma separated file and then use some programming languages of your choice to do the data cleansing that you have noted. Also, how big is each data load? You can make use of multi-threaded or multi-process application to handle major loads (like loading few million rows at a time). During this process any error you encounter can be loaded into the exception table with identifier, error and comment details. This technique helps in having better control during data cleaning phase.
If the load is not that high and you want to do most of your work in database (SQL), then you may want to do as much data profiling as possible and have good understading of possible data variations that you can expect. With that you can use appropriate component (Talend or SSIS) to do the transformation or control the data flow. Also, by using regular expressions you can catch any entity that deviates from the set rule.