如何从 DTS/SSIS 导入中删除控制字符?
我工作的公司有很多系统,其中源数据来源于平面文件或 Excel 工作簿。然后使用 DTS 将这些文件导入到 SQL Server 数据库中。
这些文件经常包含各种控制字符(例如,我刚刚花了一个小时发现某些记录中有一个 \0
,显然 SAP 喜欢用它来填充内容),理想情况下我们希望在它们进入我们的数据库之前将其删除。
我没有大量使用 DTS/SSIS,但我认为它们中一定已经有一些东西,或者至少是我应该使用的消毒最佳实践?
任何有关如何在许多导入中以可重复的方式执行此操作的想法将不胜感激。
The company I work for has a lot of systems where the source data originates as a flat file or Excel workbook. These files are then imported using DTS into a SQL Server database.
Quite often these files contain various control characters (for instance I've just spent an hour finding that some records have a \0
in them which apparently SAP like to pad things with) which ideally we would like to strip out before they get into our database.
I've not used DTS/SSIS that heavily, but I thought that there must be something already in them, or at least a sanitisation best practice, that I should be using?
Any ideas on how to do this in a repeatable fashion across many imports would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我只能根据我使用 SSIS 的经验来发言,所以如果这只是一点用处,我深表歉意。
在一般卫生方面,我还没有遇到任何适合这项工作的专用组件。我认为这是因为不同的场景对“坏数据”的定义可能有很大不同,但可以说,诸如删除不可打印字符之类的事情可能是常见的业务需求,因此这有点令人惊讶。
当然,您始终可以依靠派生列转换,或者在更复杂的情况下,依靠脚本组件,但这些很难使跨包的逻辑复制变得轻松。在某些情况下,我们使用 CozyRoc 组件来创建可以在包之间共享的脚本,但除此之外由于首先必须实际访问这些组件,因此每次需要该功能时仍然需要进行一定级别的设置。
因此,在这种情况下,我个人认为最好的选择是将常见的卫生习惯合并到 自定义数据流组件。我们为自己开发了一个来处理一些非常常见的数据验证逻辑,并设置组件编辑器,以便您可以选择复选框来启用您想要应用于给定列的任何检查。然后,只需拖放并连接数据流即可。至少对我们来说,花时间开发这个组件绝对是值得的。
我不知道 DTS 在这方面提供了什么(我认为它相当有限,或者至少不友好,但我可能是错的),所以这可能对您的旧包没有用处。我觉得凯德的解决方案在这种情况下可能是最没有压力的,假设它对于您的输入数据是可行的。
就一般 SSIS 见解而言,我找到了 Jamie Thomson 的 博客(以前是 此处)非常有帮助,因此您可能需要检查一下他之前是否讨论过这个特定主题。
无论如何,希望这至少有一点帮助,祝你好运找到解决方案。
附注:感谢您希望在将数据放入数据库之前对其进行清理。我们从一个永远不清理数据的特定供应商处获取数据,然后尝试将其放入 XML 中以通过 Web 服务将其发送给我们。不用说,XML 解析器不太喜欢不可打印的字符。
I can only speak coming from my experiences with SSIS, so I apologise if this is only marginally useful.
In terms of general sanitation, I haven't come across any sort of purpose-build component for this job. I assume that's due to the notion that different scenarios can have widely different definitions of "bad data", but arguably things like stripping out non-printable characters might be a common business requirement so this is somewhat surprising.
Naturally you can always fall back on the Derived Column Transformation or, in more complex cases, the Script Component, but these hardly make replication of the logic across packages painless. There are cases where we've used CozyRoc components to create scripts that we could share between packages, but in addition to having to actually have access to those components in the first place, it still involves a certain level of setup every time you need the functionality.
So, in cases like this, I personally feel like the best option is to consolidate your common sanitation routines into a Custom Data Flow Component. We developed one for ourselves to handle some very common data validation logic, and set the component editor up so that you could select checkboxes to enable whichever checks you wanted applied to a given column. Then it just becomes a matter of drag-and-drop and wiring up your data flow. At least for us, it was definitely worth taking the time to develop the component.
I don't know what DTS offers in this regard (I think it's fairly limited, or at least not friendly, but I could be wrong), so this is likely not useful for your legacy packages. I feel like Cade's solution would probably be most stress-free in that case, assuming it's feasible for your input data.
As far as general SSIS insight goes, I've found Jamie Thomson's blogs (previously here) very helpful, so you might want to check to see if he's discussed this particular topic before.
In any case, hopefully this was at least a little helpful, and good luck coming up with a solution.
On a side note: Kudos for wanting to sanitize your data before sticking it in your database. We get data from a particular vendor who doesn't ever clean their data, then tries to shove it in XML to send it to us via a web service. Needless to say, the XML parser isn't too fond of non-printable characters.
如果某些字符始终可以简单地替换/删除,那么我们通常使用自定义 C 程序来处理此预导入。这显然不适用于二进制文件或具有可变长度列长度指示符列或 EBCDIC 等的任何文件,但是当可以使用它时,无论导入工具如何,都可以应用它。
If certain characters can always simply be replaced/removed, then we typically handle this pre-import with a custom C program. This obviously doesn't work for binary files or anything with variable length column length indicator columns or EBCDIC, etc, but when it can be used, it can be applied irrespective of import tool.
根据您的需要,您还可以执行以下操作:
Depending on your needs, you could also do the following: