在 SQL Server 2008 中插入/更新大量数据的最佳实践

发布于 2024-08-22 16:25:01 字数 317 浏览 7 评论 0原文

我正在构建一个系统,用于通过各种 CSV 源更新大量数据。通常我会循环遍历提要中的每一行,执行选择查询来检查该项目是否已存在,并根据项目是否存在插入/更新该项目。

我觉得这种方法的可扩展性不太好,并且可能会在更大的提要上对服务器造成影响。我的解决方案是像平常一样循环遍历这些项目,但将它们存储在内存中。然后,对于每 100 个左右的项目,对这 100 个项目进行选择,并获取数据库中匹配的现有项目的列表。然后将插入/更新语句连接在一起并将它们运行到数据库中。这基本上会减少数据库访问次数。

这是一个足够可扩展的解决方案吗?是否有有关将大型提要导入生产环境的示例教程?

谢谢

I'm building a system for updating large amounts of data through various CSV feeds. Normally I would just loop though each row in the feed, do a select query to check if the item already exists and insert/update an item depending if it exists or not.

I feel this method isn't very scalable and could hammer the server on larger feeds. My solution is to loop through the items as normal but store them in memory. Then for every 100 or so items do a select on those 100 items and get a list of existing items in the database that match. Then concatenate the insert/update statements together and run them into the database. This would essentially cut down on the trips to the database.

Is this a scalable enough solution and are there any example tutorials on importing large feeds into a productive environment?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

我们只是彼此的过ke 2024-08-29 16:25:01

鉴于您使用的是 SQL Server 2008,我建议您采用以下方法:

  • 首先将 CSV 文件批量复制到临时表中,
  • 然后使用 MERGE 命令从该临时表更新目标表

查看 MSDN 文档关于如何使用 MERGE 命令的精彩博客文章

基本上,您可以根据公共条件(例如公共主键)在实际数据表和临时表之间创建链接,然后您可以定义当

  • 行匹配时要执行的操作,例如该行同时存在于源和源中。目标表-->通常,您要么更新一些字段,要么完全忽略
  • 源中的行在目标中不存在 -->通常是 INSERT 的情况,

您将有一个类似如下的 MERGE 语句:

MERGE TargetTable AS t
USING SourceTable AS src
ON t.PrimaryKey = src.PrimaryKey

WHEN NOT MATCHED THEN
  INSERT (list OF fields)
  VALUES (list OF values)

WHEN MATCHED THEN
  UPDATE
    SET (list OF SET statements)
;

当然,如果需要,可以更多地涉及 ON 子句。当然,您的 WHEN 语句也可以更复杂,例如

WHEN MATCHED AND (some other condition) THEN ......

等等。

MERGE 是 SQL Server 2008 中一个非常强大且非常有用的新命令 - 如果可以的话,请使用它!

Seeing that you're using SQL Server 2008, I would recommend this approach:

  • first bulkcopy your CSV files into a staging table
  • update your target table from that staging table using the MERGE command

Check out the MSDN docs and a great blog post on how to use the MERGE command.

Basically, you create a link between your actual data table and the staging table on a common criteria (e.g. a common primary key), and then you can define what to do when

  • the rows match, e.g. the row exists in both the source and the target table --> typically you'd either update some fields, or just ignore it all together
  • the row from the source doesn't exist in the target --> typically a case for an INSERT

You would have a MERGE statement something like this:

MERGE TargetTable AS t
USING SourceTable AS src
ON t.PrimaryKey = src.PrimaryKey

WHEN NOT MATCHED THEN
  INSERT (list OF fields)
  VALUES (list OF values)

WHEN MATCHED THEN
  UPDATE
    SET (list OF SET statements)
;

Of course, the ON clause can be much more involved if needed. And of course, your WHEN statements can also be more complex, e.g.

WHEN MATCHED AND (some other condition) THEN ......

and so forth.

MERGE is a very powerful and very useful new command in SQL Server 2008 - use it, if you can!

强辩 2024-08-29 16:25:01

你的方法是最糟糕的解决方案。一般来说,您不应该考虑单独循环记录。我们曾经有一个公司构建了一个循环记录的导入工具,加载一个超过一百万条记录的文件需要 18-20 个小时(这种情况在构建时并不经常发生,但这是现在一天发生)。

我看到两个选择:
首先使用批量插入加载到临时表,然后在该表上执行需要执行的任何清理操作。您如何确定该记录是否已经存在?您应该能够通过连接到临时表中确定更新的那些字段来构建基于集的更新。通常,我会在临时表中添加一列,以获取与其匹配的记录的 ID,并通过查询填充该列,然后完成更新。然后插入没有相应 ID 的记录。如果您有太多记录无法一次完成,您可能需要分批运行(这是一个循环),但每次要使批次远大于 1 条记录(我通常从 2000 开始,然后根据确定我是否可以在批次中做更多或更少所需的时间)。

我认为2008年也有merge语句,但我还没有机会使用它。在网上的书上查一下。

另一种方法是使用针对速度进行了优化的 SSIS。 SSIS 是一件复杂的事情,学习曲线也很陡峭。

Your way is the worst possible solution. In general, you should not think in terms of looping through records individually. We used to have a company built import tool that loops through records, it would take 18-20 hours to load a file with over a million records (something that wasn't a frequent occurrence when it was built but which is a many times a day occurrence now).

I see two options:
First use bulk insert to load to a staging table and do whatever clean up you need to do on that table. How are you determining if the record already exists? You should be able to build a set-based update by joining to the staging table on those fields which determine the update. Often I have a added a column to my staging table for the id of the record it matches to and populated that through a query then done the update. Then you do an insert of the records which don't have a corresponding id. If you have too many records to do all at once, you may want to run in batches (which yes is a loop), but make the batches considerably larger than 1 record at a time (I usually start with 2000 and then based on the time it takes for that determine if I can do more or less in the batch).

I think 2008 also has a merge statement but I have not yet had a chance to use it. Look it up in books online.

The alternative is to use SSIS which is optimized for speed. SSIS is a complex thing though and the learning curve is steep.

孤独难免 2024-08-29 16:25:01

一种方法是将 CSV 加载到 DataTable(或更可能是 DataReader)中,然后使用 SqlBulkCopy 批量猛击结果 -

http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx

它非常高效,你可以这样做一些列映射。提示 - 当您使用 SqlBulkCopy 映射列时,它们区分大小写。

One way is load your CSV into a DataTable (or more likely a DataReader) and then batch slam in the results using SqlBulkCopy -

http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx

Its pretty efficient and you can do some column mapping. Tip - when you map columns using SqlBulkCopy they are case sensitive.

眼睛会笑 2024-08-29 16:25:01

另一种方法是在服务器上编写一个 .Net 存储过程来操作整个文件...

只有当您需要比 Kris Krause 的解决方案更多的控制时 - 我非常喜欢保持简单(并且可重用) )我们可以在哪里...

Another approach would be to write a .Net stored procedure on server on the server to operate on the entire file...

Only if you need more control than Kris Krause's solution though - I'm a big fan of keeping it simple (and reusable) where we can...

永言不败 2024-08-29 16:25:01

您需要在这里自己动手吗?是否可以以这样的方式提供数据,即 SQL Server 可以使用批量导入来加载数据,然后在导入完成后处理数据库中的重复项?

当涉及大量数据的繁重工作时,我的经验往往是,尽可能在数据库中工作会更快并且占用的资源更少。

Do you need to be rolling your own here at all? Would it be possible to provide the data in such a way that the SQL Server can use Bulk Import to load it in and then deal with duplicates in the database once the import is complete?

When it comes to heavy lifting with a lot of data my experience tends to be that working in the database as much as possible is much quicker and less resource intensive.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文