更改 100GB 表的有效方法
我们有许多数据库,其中一个表中存储了数十到数百千兆字节的数据。它包含图像数据。问题在于许多数据库的创建不正确。基本上主键实际上并不是主键。它们是在可为空的列上使用唯一索引创建的。其中一些使用 int 作为主键,而不是 bigint。
所以我们一直在慢慢地检查和修复这些数据库。它们在 SQL Server 2000 到 SQL Server 2008 上运行,尽管大多数有主键问题的都在 SQL Server 2000 上。问题是,我们不想在转换表时锁定数据库一整天。我们已经采用了几种策略:
告诉 SQL Server 直接更改列类型。这会锁定表直到它完成,并且在许多情况下将其放置过夜后,它仍然没有完成。
一次性将所有图像插入到新表中。这更容易被中断,但整个表基本上都会在此过程中写入日志文件。
一次插入 100 行,其中目标表中不存在这些行。好处是,他们可以在这种情况发生时继续使用数据库(性能会受到很大影响),并且可以在任何时候任意停止和重新启动数据库,并且可以防止超过 100GB 的日志文件。这就是我们当前正在做的事情,但是随着目标表变得越来越大,查找不存在的前 100 行变得非常慢。 UPDATE STATISTICS 和 DBCC INDEXDEFRAG 有很大帮助,但在最近的尝试中,我们发现即使一次 100 个图像也没有响应。
插入 %s 选择前 100 个来源。* FROM %s AS 源 with (NOLOCK) LEFT OUTER JOIN %s AS Target WITH (NOLOCK) ON Source.DocumentID = Target.DocumentID WHERE Target.DocumentID 为 NULL 按来源.DocumentID 排序
那么问题是,是否有一种选项可以以高效且可恢复的方式复制批量数据?它不必 100% 准确,只要它完成了 99% 的工作,我们就可以在最后返回并修复任何差异。
We have a number of databases which store 10s to 100s of gigabytes of data in one of the tables. It contains image data. The problem is that a lot of these databases were created improperly. Basically the primary key isn't actually a primary key. The were created with a unique index on a nullable column. And some of them have an int as a primary key instead of a bigint.
So we've been slowly going through and fixing these databases. They run on SQL Server 2000 through SQL Server 2008, although most of the ones with primary key issues are on SQL Server 2000. The problem is, we don't want to lock down the database for a whole day while it converts the table. We've gone through several strategies:
Tell SQL Server to directly change the column type. This locks the table until it's complete, and after leaving it overnight in many cases, it still wasn't finished.
Insert all the images into a new table in one go. This was more easily interrupted, but the whole table basically gets written to the log file in the process.
Insert 100 rows at a time where the rows don't exist in the target table. The upside is that they can continue using the database while this is going on (with a big performance hit) and that it can be stopped and restarted arbitrarily at any point, and it prevents the 100GB+ log files. This is what we're currently doing, but finding the top 100 rows that don't exist gets really really slow as the target table gets bigger and bigger. UPDATE STATISTICS and DBCC INDEXDEFRAG help considerably, but in the most recent attempt, we got to the point even 100 images at a time was sitting there not responding.
INSERT INTO %s SELECT TOP 100 Source.* FROM %s AS Source WITH (NOLOCK) LEFT OUTER JOIN %s AS Target WITH (NOLOCK) ON Source.DocumentID = Target.DocumentID WHERE Target.DocumentID IS NULL ORDER BY Source.DocumentID
So the question is, is there an option which can copy bulk data in an efficient and resumable manner? It doesn't have to be 100% accurate, we can always go back and fix any discrepancies at the end, as long as it does 99% of the work.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
连接是问题所在。不要那样做。只需使用当前聚集索引,以合理的间隔循环遍历当前表。类似于:
请注意,为了获得最佳速度,请从目标表中删除所有索引(包括聚集索引),然后在插入所有行后添加索引。
编辑:更改范围间隔以防止重叠(因为 BETWEEN 包括端点)
最后一个澄清:我的示例脚本的总体要点是您只想以某种合理的顺序遍历当前记录,批量放入新表中。没有理由每次都继续检查目标表,因为您应该已经知道放在那里的内容以及还剩下的内容。大多数时候,使用聚集索引(如果有)是有意义的,因为这意味着它可以遍历表的物理顺序,而无需进行书签查找。如果表没有簇,那么就使用最有意义的东西(可能是你的 PK)。
The join is the problem. Don't do that. Just loop through your current table using some reasonable interval, using the current clustered index. Something like:
Note that for best speed, remove all indexes (including the clustered index) from the destination table, then add the indexes once all rows have been inserted.
EDIT: changed the range interval to prevent an overlap (since BETWEEN includes the endpoints)
One final clarification: the overall point of my sample script is that you simply want to walk through your current records in some reasonable order, and put them all into the new table in batches. There's no reason to continue checking the destination table each time, since you should already know what you've put there, and what is still left. Most of the time, it makes sense to use the clustered index (if there is one), since this means it can walk through the physical order of the table without doing bookmark lookups. If the table has no cluster, then just use whatever makes the most sense (your PK, probably).