在SSIS中提取不同记录
我正在编写 SSIS 包以将数据从 *.csv 文件导入到 SQL 2008 DB。问题是其中一个文件包含 csv 文件中的重复记录,我只想从该源中提取不同的值。请看下图。
不幸的是,生成的文件不在我的控制之下,它属于第三方所有,我无法改变方式他们产生了。
我确实使用了 LookUp 组件。但它仅根据传入数据检查现有数据。它不检查传入数据中的重复记录。
I am writing the SSIS package to import the data from *.csv files to the SQL 2008 DB. The problem is that one of the file contains the duplicate records in the csv file and I want to extract only the distinct values from that source. Please see the image below.
Unfortunately, the generated files are not under my control and it is owned by the third party and I could not change the way they generated.
I did use the LookUp Component. But it only checks the existing data against the incoming data. It does not check the duplicate records in the incoming data.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我相信排序组件提供了删除重复行的选项。
I believe the sort component gives an option to remove duplicate rows.
取决于您对重复项的重视程度。您是否需要记录重复的内容,或者只需删除它们就足够了?排序组件将消除排序字段上的重复项。但是,重复项可能在其他字段中具有不同的数据,那么您需要不同的策略。通常我将所有内容加载到暂存表并从那里进行清理。我将删除的重复数据发送到异常表(我们必须回答客户提出的很多问题,了解为什么事情与他们发送的内容不匹配),并且我经常使用一组业务规则(并使用执行 SQl 或数据流任务来执行规则),以确定如果一个区域中有重复项但另一区域没有重复项(例如,当我们只能存储 1 个时有两个营业地址)时选择哪一个。我还确保客户了解我们如何确定选择两者中的哪一个。
Depends on how serious you want to get about the duplicates. Do you need a record of what was duplicated or is it enough to just get rid of them? Sort component will get rid of dups on the sort field. However, the dups may have different data in the other fields and then you want a differnt strategy. Usually I load all to staging tables and clean up from there. I send the dupes removed to an exception table (we have to answer a lot of questions from our customers about why things don't match what they sent) and I often use a set of business rules (and use either an execute SQl or data flow tasks to enforce the rules) to determine which one to pick if there are duplicates in one area but not another (say two business addresses when we can only store 1). I also make sure the client is aware of how we determine which of the two to pick.
使用工具箱中的排序工具,然后单击它。您将获得所有可用的输入列。
检查列并更改排序类型方向,然后选中“删除具有重复排序值的行”。
Use SORT tool for that from Toolbox, then click on it. You will get all available input columns.
Check the column and change sortType direction and then check "remove rows with duplicate sort value".
按原样从 csv 文件中引入数据,然后在加载后对其进行重复数据删除。
调试起来也会更容易。
Bring in the data from the csv file the way it is, then dedup it after it's loaded.
It'll be easier to debug, too.
我使用了聚合组件和按 QualificationID 和 UnitID 进行分组。如果需要,您也可以使用排序组件。也许,我的信息可以帮助其他人。
I used Aggregate Component and Group By both QualificationID and UnitID. If you want, you can also use Sort Component too. Perhaps, my information might help others.