大型 SSIS ETL 项目中的版本控制
我们将使用 SSIS 将数据从一个系统转换到另一个系统。我们有四个人,将持续为此工作两年,因此我们需要某种版本控制系统。我们不能使用团队基础。我们目前正在配置 SVN 服务器,但深入研究后我发现了一些很大的风险。
看起来解决方案存储在一个巨大的 XML 文件中。在像 SSIS 这样的组合代码/拖放环境中,这肯定是一个大问题,因为 SVN 不可能正确合并更改,并且每当我们在提交时遇到错误时,我们都必须查看那个巨大的 XML 文件并手动纠正错误。
解决此问题的一种方法是在 SSIS 中创建许多解决方案项目。然而,这并不是我们真正想要的设置,因为我们正在创建一个大怪物,它将有 2 天的时间来执行,我们希望在它执行时跟踪它的进度。如果我们必须创建多个解决方案,是否有方法可以链接它们的执行,并且仍然可以直观地了解正在发生的情况以及执行情况?
有人遇到过类似的问题和/或您对如何解决这些问题有什么建议吗?
We're about to make data transformation from one system to another using SSIS. We are four people people who will continuously be working on this for two years and therefore we need some sort of versioning system. We can not use team foundation. We're currently configuring a SVN server, but digging into it I've seen some big risks.
It seems that a solution is stored in one huge XML file. This must be a huge problem in a combined code/drag and drop environment as SSIS, as it will be impossible for SVN to merge the changes correctly, and whenever we get an error when commiting we will have to look inside that huge XML file and correct the mistakes manually.
One way to solve this problem is to create many solution projects in SSIS. However, this is not really the setup we want as we are creating one big monster which will have 2 days to execute and we want to follow its progress as it executes. If we have to create several solutions are there ways to link their execution and still have a visual look of whats going on and how well the execution is doing?
Has anyone had similar problems and/or do you have any suggestions as to how to solve them?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
你说的是多少个包裹?如果是数百个包,那么您要避免的具体问题是什么?根据您的帖子,您可能会尝试避免以下一些事情:
BIDS 中启动时解决方案缓慢且项目加载时间长。我想这有时会令人恼火。但如果您全天保持 BIDS 开放,这似乎是每天一次的成本。
当您从版本控制系统获取最新的解决方案定义时,解决方案和项目加载时间会变慢。同样,我认为这有时可能会令人恼火,但是您需要多久刷新一次整个解决方案?如果将解决方案分解为单独的项目,则只需刷新一个项目即可。如果您想访问解决方案中的新项目,则只需刷新整个解决方案。
“一个巨大的 XML 文件”是什么意思?解决方案文件是一个跟踪项目的 XML 文件。每个项目文件都是一个 XML 文件,用于跟踪其 SSIS 包。因此,如果您有 1,000 个 SSIS 包均匀分布在 1 个解决方案中的 10 个项目中,则每个文件要跟踪的对象不会超过 100 个。我可以根据经验告诉您,我的 Reporting Services 项目中的 RDL 文件比这个多,并且只需要几秒钟就可以在 BIDS 中正确加载解决方案。正如 @revelator 指出的,实际的 SSIS 包是它们自己的单独的 XML 文件。任何版本控制系统都应该将每个文件作为单独的文件进行跟踪,并且不会将它们组合成“一个巨大的 XML 文件”。如果您澄清这一点的意思,那么我认为您会在这个问题上得到更好的帮助。
无论您运行一个包还是 1,000 个包,您都不会从 BIDS 中以交互方式执行此操作。您可能会首先将包部署到服务器,然后让服务器运行这些包。如果是这种情况,那么您可能需要使用 SQL Server 代理作业来调用这些包。无论是通过让每个包调用另一个包来链接包,还是通过让作业将每个包作为单独的作业步骤调用来链接包,您仍然可以通过日志记录来跟踪您在链中的位置。如果您使用作业调用包,那么您也可以使用作业步骤来跟踪它。我运行一个包含数十个包的数据仓库,并且主要依赖于将进程分成作业,每个作业包含一个或多个包。我还使用启动作业命令链接作业,以便可以更轻松地监控逻辑负载组的性能。此外,每个包在步骤级别的作业历史记录中显示其执行时间。此外,我在每个存储过程和包中都有自定义日志记录,显示单个数据加载或存储过程花费的秒数和行数,以便我可以解决性能瓶颈问题。
无论您做什么,都不要依赖以交互方式运行包作为跟踪性能的方式!在计算机上运行 ETL 不会获得最佳性能,更不用说使用 GUI 运行它了。在服务器(而不是桌面)上的作业中运行包。交互式运行包只是为了帮助构建单个包并对其进行故障排除,而不是管理日常 ETL。
如果您正在构建根据参数更改其目标和源的通用包,那么您可能需要在数据库中构建一个跟踪进度的控制表。如果您只是将数据作为一次性事件从一个大型系统移动到另一个系统,那么您可能会将负载划分为一小组包,并为每个包分配单独的作业,以便您可以更轻松地管理从故障中的恢复。如果您打算构建一个定期运行来移动数据的程序,那么一个进程持续运行 2 天怎么可能有意义呢?听起来底层数据将在 2 天内发生变化...
如果您担心使用哪个版本控制系统来管理 SSIS 包项目,那么我可以说几乎任何版本都可以。我在不同的公司使用过 Visual SourceSafe 和 Perforce,它们都具有相同的签入和签出单个包的基本功能。我确信任何与 Visual Studio 集成的版本控制系统都可以为您完成此操作。
希望您在上述内容中找到有用的内容,并祝您的项目顺利。
Just how many packages are you talking about? If it is hundreds of packages, then what is the specific problem you are trying to avoid? Here are a few things you might be trying to avoid based on your post:
Slow solution and project load time at startup in BIDS. I suppose this could be irritating from time to time. But if you keep BIDS open all day, that seems like a once a day cost.
Slow solution and project load time when you get latest solution definition from your version control system. Again, I suppose this could be irritating from time to time, but how frequently do you need to refresh the whole solution? If you break the solution into separate projects, then you only need to refresh a project. You would only need to refresh the whole solution if you want to get access to a new project within the solution.
What do you mean by "one huge XML file"? The solution file is an XML file that keeps track of the projects. Each project file is an XML file that keeps track of its SSIS packages. So if you have 1,000 SSIS packages evenly distribution across 10 projects in 1 solution, then each file would have no more than 100 objects to track. I can tell you from experience that I've had Reporting Services projects with more RDL files than this and it only took seconds to load the solution properly in BIDS. And as @revelator pointed out, the actual SSIS packages are their own individual XML files. Any version control system should track each of these as separate files and won't combine them into "one huge XML file". If you clarify what you mean by this point, then I think you will get better help on the question.
Whether you are running one package or 1,000 packages, you won't be doing this interactively from BIDS. You will probably deploy the packages to server first and then have the server run the packages. If that's the case, then you will need to call the packages probably with a SQL Server Agent job. Whether you chain the packages by making each package call another package or if you chain the packages by having the job call each package as a separate job step, you can still track where you are in the chain with logging. If you are calling the packages with jobs, then you can track it with job steps too. I run a data warehouse that has scores of packages and I primarily rely on separating processes into jobs that each contain one or more packages. I also chain jobs with start job commands so that I can more easily monitor performance of logical groups of loads. Also, each package shows its execution time in the job history at the step level. Furthermore, I have custom logging in each stored procedure and package that shows how many seconds and rows an individual data load or stored procedure took so that I can troubleshoot performance bottlenecks.
Whatever you do, don't rely on running packages interactively as a way to track performance! You won't get optimal performance running ETL on your machine, let alone running it with a GUI. Run packages in jobs on servers, not desktops. Interactively running packages is just their to help build and troubleshoot individual packages, not to adminster daily ETL.
If you are building generic packages that change their targets and sources based on parameters, then you probably need to build a control table in a database tha tracks progress. If you are simply moving data from one large system to another as a one time event, then you are probably going to divide the load into small sets of packages and have separate jobs for each so that you can more easily manage recovering from failures. If you intend to build something that runs regularly to move data, then how could 2 days of constant running for one process even make sense? It sounds like the underlying data will change on you within 2 days...
If you are concerned about which version control system to use for managing SSIS package projects, then I can say that just about any will do. I've used Visual SourceSafe and Perforce at different companies and both have the same basic features of checking in and checking out individual packages. I'm sure just about any version control system that integrates with Visual Studios will do this for you.
Hope you find something useful in the above and good luck with your project.
版本控制使得多人一起开发并处理同一个项目成为可能。如果我正在做某件事,其他 ETL 开发人员将无法签出它并对其进行更改,直到我完成更改并将其重新签入。这解决了一个开发人员的项目工件和代码发生更改的常见情况意外地破坏了另一个开发人员的开发。
http://blog.sqlauthority.com/2011/08 /10/sql-server-who-needs-etl-version-control/
Version control makes it possible to have multiple people developing together and working on same project. If I am working on something, a fellow ETL developer will not be able to check it out and make changes to it until I am finished with my changes and check those back in. This addresses the common situation where one developer’s project artifact and code changes clobber that of another developer by accident.
http://blog.sqlauthority.com/2011/08/10/sql-server-who-needs-etl-version-control/
我工作的大多数 ETL 项目都使用 SVN 作为源代码控制存储库。我发现的最好方法是将每个项目或解决方案分解为更小的、不同的(通常是可独立运行的)包。例如,假设您有一个名为 ManufacturingImport 的流程,这可能是您的项目。在其中您将有一个主包,然后它根据需要调用其他包。这意味着团队成员可以处理不同的包或工作片段,而不是每个人都尝试编辑同一个包并陷入合并的麻烦情况。
Most ETL projects I work use SVN as the source control repository. The best method I have found is to break each project or solution down into smaller, distinct (and often independently runnable) packages. So for example, say you had a process called ManufacturingImport, this could be your project. Within this you would have a Master package, which then called other packages as required. This means that members of the team can work on distinct packages or pieces of work, rather than everyone trying to edit the same package and getting into troublesome situations with merging.