具有可变列号的 SSIS 平面文件
SSIS 做了两件与处理平面文件相关的事情,这些事情特别令人沮丧,似乎应该有办法解决它们,但我无法弄清楚。如果您定义一个包含 10 列的平面文件,并使用 CRLF 制表符分隔作为行结束标记,那么这对于每行正好有 10 列的文件来说非常适合。两个痛苦的场景是:
如果有人在任何地方提供了一个包含第 11 列的文件,那么如果 SSIS 简单地忽略它就好了,因为你还没有定义它。它应该只读取您定义的 10 列,然后跳到行标记的末尾,但它所做的是将任何其他数据与第 10 列中的数据连接起来,并将所有这些数据塞到第 10 列中。确实有点没用。我意识到发生这种情况是因为第 10 列的分隔符不像所有其他列那样不是制表符,而是 CRLF,因此它只是获取 CRLF 之前的所有内容,从而将多余的制表符替换为任何内容。在我看来,这并不明智。
如果有人提供只有 9 列的文件,则会发生更糟糕的情况。它将暂时忽略意外发现的 CRLF,并用下一行开头的列填充任何缺失的列!这里的“不聪明”是轻描淡写的说法。谁会希望这种事发生呢?此时文件的其余部分就是垃圾。
无论出于何种原因,文件宽度的变化似乎都不是不合理的(当然,只有行末尾的变化可以合理地处理(更少或额外的列),但看起来这根本处理不好,除非我 。
到目前为止,我们唯一的解决方案是将一行加载为一个巨大的列(column0),然后使用脚本任务使用它找到的多个分隔符来动态分割它,这很有效,只是它限制了行 宽度为 4000 个字符(一个 unicode 列的最大宽度)如果您需要导入更宽的行(例如用于文本导入的多个 4000 宽列),那么您需要如上所述定义多个列,但您会陷入要求困境。每行的列数有严格限制,
有什么办法可以解决这些限制吗?
SSIS does 2 things in relation to handling flat files which are particularly frustrating, and it seems there should be a way around them, but I can't figure it out. If you define a flat file with 10 columns, tab delimited with CRLF as the end of row marker this will work perfectly for files where there are exactly 10 columns in every row. The 2 painful scenarios are these:
If someone supplies a file with an 11th column anywhere, it would be nice if SSIS simply ignored it, since you haven't defined it. It should just read the 10 columns you have defined then skip to the end of row marker, but what is does instead is concatenate any additional data with the data in the 10th column and bung all that into the 10th column. Kind of useless really. I realise this happens because the delimiter for the 10th column is not tab like all the others, but CRLF, so it just grabs everything up to the CRLF, replacing extra tabs with nothing as it does so. This is not smart, in my opinion.
If someone supplies a file with only 9 columns something even worse happens. It will temporarily disregard the CRLF it has unexpectedly found and pad any missing columns with columns from the start of the next row! Not smart is an understatement here. Who would EVER want that to happen? The remainder of the file is garbage at that point.
It doesn't seem unreasonable to have variations in file width for whatever reason (of course only variations at the end of a row can reaonably be handled (x fewer or extra columns) but it looks like this is simply not handled well, unless I'm missing something.
So far our only solution to this is to load a row as one giant column (column0) and then use a script task to dynamically split it using however many delimiters it finds. This works well, except that it limits row widths to 4000 chars (the max width of one unicode column). If you need to import a wider row (say with multiple 4000 wide columns for text import) then you need to define multiple columns as above, but you are then stuck with requiring a strict number of columns per row.
Is there any way around these limitations?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
格伦,我感受到你的痛苦:)
SSIS 无法使列动态化,因为它需要存储每个列的元数据,并且由于我们正在使用可以包含任何类型数据的平面文件,因此它不能假设“列”中的 CRLF -that-is-not-that-last-column',确实是它应该读取的数据行的末尾。
与 SQL2000 中的 DTS 不同,您无法在运行时更改 SSIS 包的属性。
您可以做的是创建一个父包,它读取平面文件(脚本任务),并且仅读取平面文件的第一行以获取列数和列名称。该信息可以存储在变量中。
然后,父包以编程方式加载子包(再次脚本任务),并更新子包的源连接的元数据。这是您可以选择的地方
1. 添加/删除列以匹配平面文件。
2. 设置列的列分隔符,最后一列必须是 CRLF - 与 ROW 分隔符匹配
3. 重新初始化数据流任务中源组件的元数据 (ComponentMetadata.ReinitializeMetadata())(以识别源连接中最近的更改)。
4. 保存子 ssis 包。
有关以编程方式修改包的详细信息仅是现成可用的。
然后,您的父包仅执行子包(执行包任务),并且它将使用您的新映射执行。
Glenn, i feel your pain :)
SSIS cannot make the columns dynamic, as it needs to store metadata of each column as it come through, and since we're working with flat files which can contain any kind of data, it can't assume that the CRLF in a 'column-that-is-not-that-last-column', is indeed the end of the data line its supposed to read.
Unlike DTS in SQL2000, you can't change the properties of a SSIS package at runtime.
What you could do is create a parent package, that reads the flat file (script task), and only reads the first line of the flat file to get the number of columns, and the column names. This info can be stored in a variable.
Then, the parent package loads the child package (script task again) programmatically, and updates the metadata of the Source Connection of the child package. This is where you would either
1. Add / remove columns to match the flat file.
2. Set the column delimiter for the columns, the last column has to be the CRLF - matching the ROW delimiter
3. Reinitialise the metadata (ComponentMetadata.ReinitializeMetadata()) of the Source Compoenent in the Dataflow task (to recognize the recent changes in the Source Connection).
4. Save the child ssis package.
Details on programmatically modifying a package is readily available only.
Then, your parent package just executes the Child package (Execute Package Task), and it'll execute with your new mappings.