如何在 SQL Server 2005 中有效地操作 500k 条记录?

发布于 2024-08-26 22:44:30 字数 603 浏览 7 评论 0原文

我从客户那里收到一个包含更新信息的大型文本文件,其中包含 500,000 个用户的更新。然而,当我处理这个文件时,我经常遇到 SQL Server 超时错误。

以下是我在处理数据的 VB 应用程序中遵循的流程(一般而言):

  1. 从临时表中删除所有记录(以删除上个月的数据)(例如,DELETE * FROM tempTable
  2. 将文本文件复制到临时表
  3. 在临时表中填写额外的信息,例如他们的organization_id,他们的user_id,group_code等。
  4. 根据临时表中计算的数据更新真实表中的数据

问题是我经常运行像< code>UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id) 并且这些命令经常超时。我尝试将超时时间延长至 10 分钟,但仍然失败。现在,我意识到 500k 行并不是一个需要操作的小行数,但我认为一个声称能够处理数百万行的数据库应该能够很容易地处理 500k 行。我处理这些数据的方式是否有问题?

请帮忙。欢迎任何和所有建议。

I am getting a large text file of updated information from a customer that contains updates for 500,000 users. However, as I am processing this file, I often am running into SQL Server timeout errors.

Here's the process I follow in my VB application that processes the data (in general):

  1. Delete all records from temporary table (to remove last month's data) (eg. DELETE * FROM tempTable)
  2. Rip text file into the temp table
  3. Fill in extra information into the temp table, such as their organization_id, their user_id, group_code, etc.
  4. Update the data in the real tables based on the data computed in the temp table

The problem is that I often run commands like UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id) and these commands frequently time out. I have tried bumping the timeouts up to as far as 10 minutes, but they still fail. Now, I realize that 500k rows is no small number of rows to manipulate, but I would think that a database purported to be able to handle millions and millions of rows should be able to cope with 500k pretty easily. Am I doing something wrong with how I am going about processing this data?

Please help. Any and all suggestions welcome.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

暮凉 2024-09-02 22:44:30

像您在问题中给我们的子查询:

UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id) 

一次只适用于一行,因此您必须循环。考虑基于集合:

UPDATE t
    SET user_id = u.user_id
    FROM tempTable          t
        inner join myUsers  u ON t.external_id=u.external_id

并删除循环,这将更新一个语句中的所有行,并且速度明显更快!

subqueries like the one you give us in the question:

UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id) 

are only good on one row at a time, so you must be looping. Think set based:

UPDATE t
    SET user_id = u.user_id
    FROM tempTable          t
        inner join myUsers  u ON t.external_id=u.external_id

and remove your loops, this will update all rows in one statement and be significantly faster!

我不在是我 2024-09-02 22:44:30

需要更多信息。我定期操作 1.5 亿行表中的 3-4 百万行,我不认为这是很多数据。我有一个“产品”表,其中包含大约 800 万个条目 - 包括全文搜索。也没有问题。

您能详细说明一下您的硬件吗?我假设“普通台式电脑”或“低端服务器”,两者都具有绝对非最佳的磁盘布局,因此在更新时会出现大量 IO 问题。

Needs more information. I am manipulating 3-4 million rows in a 150 million row table regularly and I am NOT thinking this is a lot of data. I have a "products" table that contains about 8 million entries - includign full text search. No problems either.

Can you just elaborte on your hardware? I assume "normal desktop PC" or "low end server", both with absolutely non-optimal disc layout, and thus tons of IO problems - on updates.

陪我终i 2024-09-02 22:44:30

确保您要从中进行选择的表上有索引。在示例 UPDATE 命令中,您从 myUsers 表中选择 user_id。 myUsers 表上是否有包含 user_id 列的索引?索引的缺点是它们增加了插入/更新的时间。确保您尝试更新的表上没有索引。如果您尝试更新的表确实有索引,请考虑删除它们,然后在导入后重建它们。

最后,在 SQL Server Management Studio 中运行查询并查看执行计划以了解查询的执行方式。寻找诸如表扫描之类的东西来看看你可以在哪里优化。

Make sure you have indexes on your tables that you are doing the selects from. In your example UPDATE command, you select the user_id from the myUsers table. Do you have an index with the user_id column on the myUsers table? The downside of indexes is that they increase time for inserts/updates. Make sure you don't have indexes on the tables you are trying to update. If the tables you are trying to update do have indexes, consider dropping them and then rebuilding them after your import.

Finally, run your queries in SQL Server Management Studio and have a look at the execution plan to see how the query is being executed. Look for things like table scans to see where you might be able to optimize.

爱格式化 2024-09-02 22:44:30

看看KM的答案,不要忘记索引和主键。

Look at the KM's answer and don't forget about indexes and primary keys.

漆黑的白昼 2024-09-02 22:44:30

导入数据后是否索引临时表?

temp_table.external_id 绝对应该有一个索引,因为它位于 where 子句中。

Are you indexing your temp table after importing the data?

temp_table.external_id should definitely have an index since it is in the where clause.

一曲爱恨情仇 2024-09-02 22:44:30

有更有效的方法来导入大数据块。查看 SQL 联机丛书中的 BCP(批量复制协议)。

There are more efficient ways of importing large blocks of data. Look in SQL Books Online under BCP (Bulk Copy Protocol.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文