源代码控制数据库数据导入策略
因此,我得到了一个项目,并让数据库团队出售了数据库的源代码控制(奇怪吧?)无论如何,数据库已经存在,它很大,并且应用程序非常依赖于数据。开发人员在编写存储过程等时最多需要三种不同风格的数据来进行处理。
显然我可以编写数据插入脚本。
但我的问题是,您使用什么工具或策略从源代码管理构建数据库并用多个大型数据集填充它?
So I've gotten a project and got the db team sold on source control for the db (weird right?) anyway, the db already exists, it is massive, and the application is very dependent on the data. The developers need up to three different flavors of the data to work against when writing SPROCs and so on.
Obviously I could script out data inserts.
But my question is what tools or strategies do you use to build a db from source control and populate it with multiple large sets of data?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
很高兴看到您将数据库置于源代码控制之下。
我们的数据库对象位于源代码管理中,但没有数据(某些查找值除外)。为了维护开发上的数据,我们通过恢复最新的产品备份来刷新它,然后针对任何数据库更改重新运行脚本。如果我们正在做的事情需要特殊数据(例如不在产品或测试登录中的新查找值),我们也有一个脚本,它是源代码控制的一部分,并且将同时运行。不过,您不想用脚本输出所有数据,因为通过脚本重新创建 1000 万条记录会非常耗时(如果您有 1000 万条记录,您当然不希望开发人员针对具有 10 条测试记录的数据库进行开发! )。恢复产品数据要快得多。
由于我们所有的部署都是通过源代码控制的脚本完成的,因此我们在让人们编写他们需要的脚本方面没有问题。希望你也不会。当我们第一次开始时(以及当开发人员可以自己部署到产品时),我们实际上必须经历几次并删除不在源代码控制中的任何对象。我们很快就学会了将所有数据库对象放入源代码管理中。
Good to see you put your database under source control.
We have our database objects in source control but not data (except for some lookup values). To maintian the data on dev, we refresh it by restoring the latest prod backup, then rerunning the scripts for any database changes. If what we were doing would require special data (say new lookup values that aren't on prod or test logins), we have a script for that as well which is part of source control and which would be run at the same time. You wouldn't want to script out all the data though as it would be very timeconsuming to recreate 10 million records through a script (And if you have 10 million records you certainly don't want developers developing against a database with ten test records!). Restoring prod data is much faster.
Since all our deployments are done only through source controlled scripts, we don't have issues getting people to script what they need. Hopefully you won't either. When we first started (and back when dev coudl do their own deployments to prod) we had to actually go through a few times and delete any objects that weren't in source control. We learned very quickly to put all db objects in source control.
通常,我们仅将
.sql
文件放入源代码控制中以(重新)构建架构。然后,我们将能够读取生产或集成数据库的脚本放入源代码控制中,以便提取并填充数据库中由先前
.sql
执行产生的一组相关数据。这个想法是通过一个足够强大的脚本来获取最新的数据,以便从数据库中读取它们,而数据库的版本并不总是与正在构建的版本相同。 (但实际上,差异从来没有那么大,并且数据可以轻松读取)。
Usually, we only put in source control the
.sql
files for (re-) building the schema.Then we put in source control the script able to read a production or integration database in order to extract and populate a relevant set of data in the database resulting from the previous
.sql
execution.The idea is to get most recent data with a script robust enough to read them from a database which is not always at the same version than the one being build. (in reality though, the difference is never that big, and the data can easily be read).
最好将其作为两个单独的主题来处理。一方面,您需要一个可靠且一致的数据库模式(表、索引、vie、过程、函数以及查找值和系统所需的任何不变的“静态”数据),并且您希望对其进行版本控制这样您就可以跟踪随时间(以及由谁)发生的更改,还可以控制更改何时应用到哪些数据库实例。这个问题之前的帖子已经很好地涵盖了这个主题。
另一方面,您将需要填充有数据的数据库,您可以根据这些数据来测试和开发新代码。定义和加载此类数据与定义和加载保存数据的结构不同。虽然通过源代码控制管理数据库定义是一个很容易解决的问题,但在过去的很多年里,我从未听说过有一个同样简单(嗯,相对简单)的解决方案来解决数据问题。问题的几个方面包括:
确保有足够的数据。每个表添加 10-20 行很容易,但如果您的实时数据库将包含数百万行或更多行,您就无法预测性能。
一个快速而简单的解决方案是获取最新生产数据库的副本,使用最近的更改进行更新,然后就可以开始了。如果开发环境没有 SAN 来托管您所支持的多 TB 生产数据的副本,这可能会很棘手
同样,SOX 和/或 HIPAA 审核员可能不需要额外的副本潜在的机密数据位于不太安全的开发服务器上(在不太安全的开发人员面前——毕竟我们是一群狡猾的人)。在将敏感数据提供给开发人员之前,您可能需要对其进行加扰或随机化……这意味着需要一个临时的“加扰器”过程来清理数据。 (也许为所有这些 TB 提供另一个 SAN?)
在某些情况下,某个部门或其他部门为您提供一组正确的、连贯的、协调的数据来进行开发是理想的——他们制作的东西涵盖所有可能的情况,并且他们可以用于自己的测试(知道什么进入,他们知道应该出现什么,并且可以检查它)。当然,创建这样一组数据的努力是巨大的,说服非 IT 团体提供这样的数据集在政治上可能是不可能的。但这是一个美好的梦想。
当然数据也会改变。当你在开发过程中对副本进行了一周、一个月、一个季度的工作后,最终不可避免地你会发现生产数据不再“看起来”像那样了——使用模式将会改变,平均重要的值将会发生变化,您的所有日期都将变得陈旧且无关紧要...无论如何,您将需要重新获取新数据。
这是一个我从未听说过的丑陋问题,没有简单的解决方案。一种可能有帮助的可能性是:我记得过去读过一些产品的文章,这些产品可用于用虚构的但统计相关的数据“填充”数据库。您指定诸如“此表中的 10,000 行,此 col 是身份主键,此tinyint 范围从 1-10 且分布均匀,此 varchar 范围从 6 到 30 个字符,可能有 2% 的重复项”之类的内容。这样的事情可能是无价的,但这完全取决于你所处的环境。
This is best handled as two separate subjects. On the one hand, you want a solid and consistant database schema (tables, indexes, vies, procedures, functions, and also lookup values and any non-changing "static" data required by your system), and you want version control over that so you can track what changes over time (and by who) and also can control when the changes get applied to which database instances. Prior posts to this question have covered this subject well.
On the other hand, you will need the database populated with data against which you can test and devlop new code. Defining and loading such data is not the same as defining and loading the structures that will hold it. While managing database definitions via source control can be a readily solved problem, over the past many years I have never heard of an equally simple (well, relatively simple) solution for addressing the data problem. Aspects of the problem include:
Make sure there's enough data. Adding 10-20 rows per table is easy, but you can't possible predict performance if your live databases will contain millions of rows or more.
A quick and easy solution is to get a copy of the lastest Production database, update it with the recent changes, and off you go. This can be tricky if the development environment doesn't have a SAN upon which to host a copy of the multi-TB of Production data you're supporting
Similarly, the SOX and/or HIPAA auditors might not want extra copies of potentially confidential data sitting on not-so-secure development servers (in front of not-so-secure developers--we are a shifty bunch, after all). You might need to scramble or randomize sensitive data before making it available to developers... which implies an interim "scrambler" process to sanitize the data. (Perhaps another SAN for all those TB?)
In some situations, it'd be ideal for some department or other to provide you with a correct, coherent, and coordinated set of data to do development against -- something they make up to cover all likely possible situations, and that they could use for testing on their side (knowing what goes in, they know what should be coming out, and can check for it). Of course the effort to create such a set of data is substantial, and convincing non-IT groups to provide such data sets may be politically impossible. But it's a nice dream.
And of course the data changes. After you've worked the copy over in development for a week, a month, a quarter, eventually and inevitably you will discover that the Production data doesn't "look" like that any more -- usage patterns will have changed, averages of significant values will drift, all your dates will be old and irrelevant... whatever, you'll need to get fresh data all over again.
It's an ugly problem with no simple solution that I've ever heard of. One possibility that could help: I recall reading articles in the past of products that can be used to "stuff" a database with made up yet statistically relevant data. You specify things like "10,000 rows in this table, this col is an identity primary key, this tinyint ranges from 1-10 with equal distribution, this varchar ranges from 6 to 30 characters with maybe 2% duplicates", and so forth. Something like this might be invaluable, but it all depends upon t he circumstances in which you find yourself.
看一下 Visual Studio Team System,数据库版本 - 无论是带有 GDR 2 下载的 2008 年,还是 2010 年。它可以处理模式版本控制,完全集成到源代码控制中,并且可以处理测试数据生成(如随机名称等)。
我个人喜欢它 - 我使用 Management Studio 进行开发,然后启动 Visual Studio 并将更改同步到项目,然后从项目同步到生产。
我将其用于我的发展。我不会编写生产数据的脚本 - 我的主数据库现在大约有 300GB,并且我有一个接近 5 亿行的表。我有一个开发服务器,有时会在需要时加载数据副本。开发人员针对小型测试数据或开发服务器(这里人不多)进行工作。
初始数据由存储过程或特殊上传/验证脚本维护,这些脚本作为流程的一部分运行并检查查找表等元素。
Have a look at Visual Studio Team System, database edition - either the 2008 with the GDR 2 download, or 2010. It can handle schema versioning with full integration into source control, and can handle test data generation (like random names etc.).
I personally like it - I dot development using Management Studio, then fire up Visual Studio and sync the changes down to the project, from where they get then synced up to production.
I use that for my development. I do not script out production data - my maiin database has about 300gb right now, and I have a table approaching half a billion rows. I have a development server a copy of the data is sometimes loaded when needed. Developers work against smal ltest data or the dev server (not many people here).
Initial data is being maintained by stored procedures or special upload / validation scripts that run as part of the process and check elements like lookup tables.
我过去使用过多种策略,但这里有一个有效的策略:
如果您正在处理正在生产的系统并且具有无法擦除的数据:
I have used several strategies in the past, but here is one that works:
If you are dealing with a system that is in production and has data that you can't wipe out:
对于保存配置类型数据(非事务性)的表,我们使用 Excel。我们将 VBA 脚本插入到电子表格中来处理保存事件,并让它在保存时输出 sql 插入语句。业务分析师和客户喜欢 Excel,因此这项技术对他们来说非常有用,可以为我们提供预定义的场景供我们测试。我们通常对输出 .sql 文件进行源代码控制,以便我们可以使用它来加载数据以进行自动构建,并且 Excel 文件会进入 Team SharePoint 网站。
For tables that hold configuration type data (not transactional) we use Excel. We insert a VBA script into the spreadsheet to handle the save event and have it spit out sql insert statements upon save. Business analysts and customers love Excel so this technique works great for them to give us predefined scenarios for us to test with. We usually Source Control the output .sql file so we can use it to load data for automated builds and the Excel file goes in the Team SharePoint site.