在部署时将大量数据存入 SQL Server (Express) 数据库
对于我参与过的大多数数据库支持的项目,需要在部署项目之前将“启动”或测试数据放入数据库中。启动数据示例:列出世界上所有国家/地区的表格或列出将用于填充调色板的一堆颜色的表格。
我一直在使用一个系统,将所有启动数据存储在 Excel 电子表格中(每个工作表一个表),然后我有一个 SQL 实用程序脚本,它 (1) 创建数据库,(2) 创建模式,( 3) 创建表(包括主键和外键),(4) 作为链接服务器连接到电子表格,以及 (5) 将所有数据插入表中。
我最喜欢这个系统。我发现在 Excel 中布局列、使用简单的查找函数验证外键关系、执行串联操作、从 Web 表格或其他电子表格复制数据等非常容易。该系统的一个主要缺点是需要同步每当我更改表定义时,我的工作表中的列都会出现。
我一直在阅读一些教程来学习新的 .NET 技术或设计模式,我注意到这些通常涉及使用 Visual Studio 创建数据库和添加表(而不是脚本),并且通常使用以下命令输入数据内置设计器。这让我想知道我这样做的方式是否不是最有效或最可维护的。
问题
一般来说,您认为通过脚本或 GUI 设计器(例如 SSMSE 或 Visual Studio)构建整个数据库更可取吗?
您建议使用什么方法用启动或测试数据填充数据库?为什么?
澄清
从目前的答案来看,我认为我应该澄清一些事情。假设我有大量数据(数百或数千行)需要找到进入数据库的方式。这些数据可以来自不同的地方,例如文本文件、电子表格、网络表格等。我收到了一些使用 INSERT
语句编写此过程的建议,但这真的可行吗?正在谈论大量数据?
这让我想到...
新问题
如何编写 SQL 脚本来获取 此页面并将其插入数据库?
使用 Excel,我只需将表复制/粘贴到工作表中并运行我的实用程序脚本,基本上就完成了。
如果您后来意识到需要一个新专栏《CapitalCity》怎么办?
使用 Excel,我可以从此页面获取该信息,将其粘贴到 Excel 中,然后通过快速的文本到列操作,我就可以获得所需格式的数据。
老实说,我写这个问题并不是为了捍卫 Excel 作为将数据输入数据库的最佳方法,甚至是一个好方法,但到目前为止的答案似乎并没有解决我主要关心的问题——如何获取所有这些数据进入您的数据库。手动编写包含数百条 INSERT
语句的脚本将非常耗时且容易出错。不知何故,这个脚本需要机器生成,但是如何生成呢?
For most database-backed projects I've worked on, there is a need to get "startup" or test data into the database before deploying the project. Examples of startup data: a table that lists all the countries in the world or a table that lists a bunch of colors that will be used to populate a color palette.
I've been using a system where I store all my startup data in an Excel spreadsheet (with one table per worksheet), then I have a utility script in SQL that (1) creates the database, (2) creates the schemas, (3) creates the tables (including primary and foreign keys), (4) connects to the spreadsheet as a linked server, and (5) inserts all the data into the tables.
I mostly like this system. I find it very easy to lay out columns in Excel, verify foreign key relationships using simple lookup functions, perform concatenation operations, copy in data from web tables or other spreadsheets, etc. One major disadvantage of this system is the need to sync up the columns in my worksheets any time I change a table definition.
I've been going through some tutorials to learn new .NET technologies or design patterns, and I've noticed that these typically involve using Visual Studio to create the database and add tables (rather than scripts), and the data is typically entered using the built-in designer. This has me wondering if maybe the way I'm doing it is not the most efficient or maintainable.
Questions
In general, do you find it preferable to build your whole database via scripts or a GUI designer, such as SSMSE or Visual Studio?
What method do you recommend for populating your database with startup or test data and why?
Clarification
Judging by the answers so far, I think I should clarify something. Assume that I have a significant amount of data (hundreds or thousands of rows) that needs to find its way into the database. This data could be sourced from various places, such as text files, spreadsheets, web tables, etc. I've received several suggestions to script this process using INSERT
statements, but is this really viable when you're talking about a lot of data?
Which leads me to...
New questions
How would you write a SQL script to take the country data on this page and insert it into the database?
With Excel, I could just copy/paste the table into a worksheet and run my utility script, and I'd basically be done.
What if you later realized you needed a new column, CapitalCity?
With Excel, I could take that information from this page, paste it into Excel, and with a quick text-to-column manipulation, I'd have the data in the format I need.
I honestly didn't write this question to defend Excel as the best way or even a good way to get data into a database, but the answers so far don't seem to be addressing my main concern--how to get all this data into your database. Writing a script with hundreds of INSERT
statements by hand would be extremely time consuming and error prone. Somehow, this script needs to be machine generated, but how?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我认为您当前的流程适合用初始数据播种数据库。它简单、易于维护并且适合您。如果您有一个良好的数据库设计并具有足够的约束,那么如何播种初始数据并不重要。您可以使用中间工具来生成脚本,但为什么还要麻烦呢?
SSIS 的学习曲线很陡,不能很好地与源代码管理配合使用(无法分辨版本之间发生了什么变化),并且对 Excel 的类型转换非常挑剔。还有一个问题是它会提前读取多少行来确定数据类型——如果您的前 x 行包含存储为文本的数字,那么您就会遇到很大的麻烦。
I think your current process is fine for seeding the database with initial data. It's simple, easy to maintain, and works for you. If you've got a good database design with adequate constraints then it doesn't really matter how you seed the initial data. You could use an intermediate tool to generate scripts but why bother?
SSIS has a steep learning curve, doesn't work well with source control (impossible to tell what changed between versions), and is very finicky about type conversions from Excel. There's also an issue with how many rows it reads ahead to determine the data type -- you're in deep trouble if your first x rows contain numbers stored as text.
1) 我更喜欢使用脚本有几个原因。
• 脚本很容易修改,而且当我准备将应用程序部署到生产环境时,我已经编写了脚本,因此一切都已准备就绪。
• 如果我需要将数据库部署到不同的平台(例如Oracle 或MySQL),那么可以轻松地对脚本进行少量修改以在目标数据库上运行。
• 通过脚本,我不再依赖 Visual Studio 等工具来构建和维护数据库。
2)我喜欢使用脚本的老式插入语句。同样,在部署时脚本是您最好的朋友。在我们的商店,当我们部署应用程序时,我们必须准备好脚本供 DBA 运行,因为这是他们所期望的。
我只是发现脚本很简单,易于维护,并且是创建数据库并向其加载数据时的“最小公分母”。通过最小公分母,我的意思是大多数人(即 DBA、您商店中可能没有 Visual Studio 的其他人)将能够毫无问题地使用它们。
脚本的另一重要之处是它迫使您学习 SQL,更具体地说是 DDL(数据定义语言)。虽然手持式 GUI 工具很不错,但没有什么可以替代花时间从头到尾学习 SQL 和 DDL。我发现这些技能在几乎任何商店中都是无价的。
1) I prefer to use scripts for several reasons.
• Scripts are easy to modify, and plus when I get ready to deploy my application to a production environment, I already have the scripts written so I'm all set.
• If I need to deploy my database to a different platform (like Oracle or MySQL) then it's easy to make minor modifications to the scripts to work on the target database.
• With scripts, I'm not dependent on a tool like Visual Studio to build and maintain the database.
2) I like good old fashioned insert statements using a script. Again, at deployment time scripts are your best friend. At our shop, when we deploy our applications we have to have scripts ready for the DBA's to run, as that's what they expect.
I just find that scripts are simple, easy to maintain, and the "least common denominator" when it comes to creating a database and loading up data to it. By least common denominator, I mean that the majority of people (i.e. DBA's, other people in your shop that might not have visual studio) will be able to use them without any trouble.
The other thing that's important with scripts is that it forces you to learn SQL and more specfically DDL (data definition language). While the hand-holding GUI tools are nice, there's no substitute for taking the time to learn SQL and DDL inside out. I've found that those skills are invaluable to have in almost any shop.
坦白说,我觉得在这里使用 Excel 的概念有点可怕。显然,它是有效的,但它创建了对临时数据源的依赖,而这种依赖要等到很久以后才能解决。您最不希望发生的事情就是急于部署数据库并发现 Excel 文件已损坏,或者更糟的是完全丢失。我认为这种情况的严重程度因公司的风险承受能力而异,但我会积极寻求将 Excel 从等式中删除,或者至少将其作为永久固定装置删除。
我总是使用脚本来创建数据库,因为脚本是可移植且可重复的——您可以使用(几乎)相同的脚本来创建开发数据库、QA 数据库、UAT 数据库和生产数据库。因此,使用脚本修改现有数据库同样重要。
我也总是使用脚本来创建引导数据(又称启动数据),这样做有一个非常重要的原因:之后通常需要编写更多脚本。或者至少应该有。引导数据几乎总是只读的,因此,您应该将其放置在只读文件组中以提高性能并防止意外更改。因此,您通常需要首先编写数据脚本,然后将文件组设置为只读。
不过,在更哲学的层面上,如果数据库正常工作需要此启动数据(大多数情况下确实如此),那么您确实应该将其视为数据定义本身(即元数据)的一部分。出于这个原因,我认为在任何地方定义数据都是不合适的,但是在用于创建数据库本身的同一个脚本或一组脚本中定义数据是不合适的。
测试数据有点不同,但根据我的经验,您通常会尝试以某种方式自动生成该数据,这使得使用脚本变得更加重要。您不希望出于测试目的而手动维护包含数百万行的临时数据库。
如果您的问题是测试或启动数据来自外部源(网页、CSV 文件等),那么我将使用实际的“配置数据库”来处理此问题。这样您就不必像在 Excel 中那样使用 VLOOKUPS 验证引用,您实际上可以强制执行它们。
听起来可能需要很大的开销,但实际上,付出的努力是微乎其微的。您设置配置数据库一次,创建 SSIS 包一次,并刷新配置数据,可能每隔几个月刷新一次,也可能从不刷新一次(这是您已经完成的部分)做,这部分将变得更少工作)。一旦“设置”完成,实际上只需几分钟即可生成脚本,然后您可以在主数据库的所有副本上使用该脚本。
Frankly, I find the concept of using Excel here a bit scary. It obviously works, but it's creating a dependency on an ad-hoc data source that won't be resolved until much later. Last thing you want is to be in a mad rush to deploy a database and find out that the Excel file is mangled, or worse, missing entirely. I suppose the severity of this would vary from company to company as a function of risk tolerance, but I would be actively seeking to remove Excel from the equation, or at least remove it as a permanent fixture.
I always use scripts to create databases, because scripts are portable and repeatable - you can use (almost) the same script to create a development database, a QA database, a UAT database, and a production database. For this reason it's equally important to use scripts to modify existing databases.
I also always use a script to create bootstrap data (AKA startup data), and there's a very important reason for this: there's usually more scripting to be done afterward. Or at least there should be. Bootstrap data is almost invariably read-only, and as such, you should be placing it on a read-only filegroup to improve performance and prevent accidental changes. So you'll generally need to script the data first, then make the filegroup read-only.
On a more philosophical level, though, if this startup data is required for the database to work properly - and most of the time, it is - then you really ought to consider it part of the data definition itself, the metadata. For that reason, I don't think it's appropriate to have the data defined anywhere but in the same script or set of scripts that you use to create the database itself.
Test data is a little different, but in my experience you're usually trying to auto-generate that data in some fashion, which makes it even more important to use a script. You don't want to have to manually maintain an ad-hoc database of millions of rows for testing purposes.
If your problem is that the test or startup data comes from an external source - a web page, a CSV file, etc. - then I would handle this with an actual "configuration database." This way you don't have to validate references with VLOOKUPS as in Excel, you can actually enforce them.
It may sound like a lot of overhead, but in practice the effort is minimal. You set up your configuration database once, create an SSIS package once, and refresh the config data maybe once every few months or maybe never (this is the part you're already doing, and this part will become less work). Once that "setup" is out of the way, it's really just a few minutes to generate the script, which you can then use on all copies of the main database.
由于我使用对象关系映射器(Hibernate,还有 .NET 版本),因此我更喜欢用我的编程语言生成此类数据。然后 ORM 负责将内容写入数据库。我不必担心更改数据中的列名称,因为无论如何我都需要修复映射。如果涉及重构,它通常也会处理启动/测试数据。
Since I use an object-relational mapper (Hibernate, there is also a .NET version), I prefer to generate such data in my programming language. The ORM then takes care of writing things into the database. I don't have to worry about changing column names in the data because I need to fix the mapping anyway. If refactoring is involved, it usually takes care of the startup/test data also.
Excel 是此过程中不必要的组件。
为要重用的数据库组件的当前版本编写脚本,并将脚本添加到源代码管理系统中。当将来需要进行更改时,要么修改数据库中的实体并重新生成脚本,要么修改脚本并重新生成数据库。
避免混合使用 Visual Studio 的数据库设计器和 Excel,因为它们只会增加复杂性。脚本和 SQL Management Studio 是您的朋友。
Excel is an unnecessary component of this process.
Script the current version the database components that you want to reuse, and add the script to your source control system. When you need to make changes in the future, either modify the entities in the database and regenerate the script, or modify the script and regenerate the database.
Avoid mixing Visual Studio's db designer and Excel as they only add complexity. Scripts and SQL Management Studio are your friends.