更新大型表的主表和子表的主键
我有一个相当大的数据库,其中有一个主表,该主表以单列 GUID(自定义 GUID 算法)作为主键,还有 8 个与该 GUID 列具有外键关系的子表。 所有表大约有 3-800 万条记录。 这些表都没有任何 BLOB/CLOB/TEXT 或任何其他奇特的数据类型,只有普通数字、varchar、日期和时间戳(每个表中大约有 15-45 列)。 除主键和外键外,没有分区或其他索引。
现在,自定义 GUID 算法已更改,尽管没有冲突,但我想迁移所有旧数据以使用通过新算法生成的 GUID。 无需更改其他列。 第一要务是数据完整性,性能是次要的。
我能想到的一些可能的解决方案是(你可能会注意到它们都只围绕一个想法)
- 添加新列 ngu_id 并填充新的 gu_id; 禁用约束; 将 ngu_id 更新为 gu_id 的子表; 重命名 ngu_id->gu_id; 重新启用约束
- 从子表中读取一条主记录及其依赖的子记录; 使用新的 gu_id 插入到同一个表中; 删除所有具有旧 gu_ids
- 删除约束的记录; 向主表添加触发器,以便更新所有子表; 开始用新的 gu_id 更新旧的 gu_id; 重新启用约束
- 向主表添加触发器,以便更新所有子表; 开始用新的 gu_id 更新旧的 gu_id
- 在所有主表和子表上创建新列 ngu_ids; 在 ngu_id 列上创建外键约束; 主表添加更新触发器,将值级联到子表; 将新的 gu_id 值插入 ngu_id 列; 根据 gu_id 删除旧的外键约束; 删除 gu_id 列并将 ngu_id 重命名为 gu_id; 如有必要,重新创建约束;
- 如果可用,请使用
更新级联
?
我的问题是:
- 有更好的方法吗? (不能把头埋在沙子里,必须这样做)
- 最合适的方法是什么? (我必须在 Oracle、SQL Server 和 mysql4 中执行此操作,因此欢迎特定于供应商的 hack)
- 此类练习的典型失败点是什么以及如何最大限度地减少它们?
如果到目前为止您和我在一起,谢谢您并希望您能提供帮助:)
I have a fairly huge database with a master table with a single column GUID (custom GUID like algorithm) as primary key and 8 child tables that have foreign key relationships with this GUID column. All the tables have approximately 3-8 million records. None of these tables have any BLOB/CLOB/TEXT or any other fancy data types just normal numbers, varchars, dates, and timestamps (about 15-45 columns in each table). No partitions or other indexes other than the primary and foreign keys.
Now, the custom GUID algorithm has changed and though there are no collisions I would like to migrate all the old data to use GUIDs generated using the new algorithm. No other columns need to be changed. Number one priority is data integrity and performance is secondary.
Some of the possible solutions that I could think of were (as you will probably notice they all revolve around one idea only)
- add new column ngu_id and populate with new gu_id; disable constraints; update child tables with ngu_id as gu_id; renaname ngu_id->gu_id; re-enable constraints
- read one master record and its dependent child records from child tables; insert into the same table with new gu_id; remove all records with old gu_ids
- drop constraints; add a trigger to the master table such that all the child tables are updated; start updating old gu_id's with new new gu_ids; re-enable constraints
- add a trigger to the master table such that all the child tables are updated; start updating old gu_id's with new new gu_ids
- create new column ngu_ids on all master and child tables; create foreign key constraints on ngu_id columns; add update trigger to the master table to cascade values to child tables; insert new gu_id values into ngu_id column; remove old foreign key constraints based on gu_id; remove gu_id column and rename ngu_id to gu_id; recreate constraints if necessary;
- use
on update cascade
if available?
My questions are:
- Is there a better way? (Can't burrow my head in the sand, gotta do this)
- What is the most suitable way to do this? (I've to do this in Oracle, SQL server and mysql4 so, vendor-specific hacks are welcome)
- What are the typical points of failure for such an exercise and how to minimize them?
If you are with me so far, thank you and hope you can help :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
你的想法应该可行。 第一种可能是我会使用的方式。 执行此操作时需要考虑的一些注意事项和事项:
除非您有当前备份,否则请勿执行此操作。
我会将这两个值保留在主表中。 这样,如果您需要从一些旧的文书工作中找出您需要访问哪些记录,您就可以做到。
执行此操作时,请关闭数据库进行维护并将其置于单用户模式。 在执行此类操作时,您最不需要的就是用户在您处于中途时尝试进行更改。 当然,进入单用户模式后的第一个操作就是上述备份。 您可能应该将停机时间安排在使用量最轻的时候。
首先在开发上进行测试! 这还应该让您了解需要关闭生产多长时间。 另外,您可以尝试多种方法,看看哪种方法最快。
请务必提前与用户沟通,数据库将在计划的维护时间关闭,以及何时可以再次可用。 确保时机没问题。 当他们计划熬夜运行季度报告而数据库不可用并且他们不知道时,这确实让人生气。
有相当多的记录,您可能希望批量运行子表的更新(不使用级联更新的原因之一)。 这比尝试通过一次更新更新 500 万条记录要快。 但是,不要尝试一次更新一条记录,否则明年您仍将在这里执行此任务。
删除所有表中 GUID 字段上的索引,并在完成后重新创建。 这应该会提高变更的性能。
Your ideas should work. the first is probably the way I would use. Some cautions and things to think about when doing this:
Do not do this unless you have a current backup.
I would leave both values in the main table. That way if you ever have to figure out from some old paperwork which record you need to access, you can do it.
Take the database down for maintenance while you do this and put it in single user mode. The very last thing you need while doing something like this is a user attempting to make changes while you are in midstream. Of course, the first action once in single user mode is the above-mentioned backup. You probably should schedule the downtime for some time when the usage is lightest.
Test on dev first! This should also give you an idea as to how long you will need to close production for. Also, you can try several methods to see which is the fastest.
Be sure to communicate in advance to users that the database will be going down at the scheduled time for maintenance and when they can expect to have it be available again. Make sure the timing is ok. It really makes people mad when they plan to stay late to run the quarterly reports and the database is not available and they didn't know it.
There are a fairly large number of records, you might want to run the updates of the child tables in batches (one reason not to use cascading updates). This can be faster than trying to update 5 million records with one update. However, don't try to update one record at a time or you will still be here next year doing this task.
Drop indexes on the GUID field in all the tables and recreate after you are done. This should improve the performance of the change.
创建一个新表,其中包含旧的和新的 pk 值。 对两列设置独特的约束,以确保到目前为止您还没有破坏任何内容。
禁用约束。
对所有表运行更新以将旧值修改为新值。
启用 PK,然后启用 FK。
Create a new table with the old and the new pk values in it. Place unique constraints on both columns to ensure you haven't broken anything so far.
Disable constraints.
Run an updates against all the tables to modify the old value to the new value.
Enable the PK, then enable the FK's.
很难说“最好”或“最合适”的方法是什么,因为您还没有描述您在解决方案中寻找什么。 例如,在迁移到新 ID 时,表是否需要可用于查询? 它们是否需要可用于并发修改? 尽快完成迁移很重要吗? 最小化用于迁移的空间很重要吗?
话虽如此,我更喜欢#1而不是你的其他想法,假设它们都满足你的要求。
任何涉及更新子表的触发器的操作似乎都容易出错并且过于复杂,并且可能不会像#1 那样执行。
可以安全地假设新 ID 永远不会与旧 ID 发生冲突吗? 如果不是,基于一次更新一个 ID 的解决方案将不得不担心冲突——这很快就会变得混乱。
您是否考虑过使用
CREATE TABLE AS SELECT
(CTAS) 用新ID 填充新表? 您将复制现有表,这将需要额外的空间,但它可能比更新现有表更快。 这个想法是:(i) 使用 CTAS 创建具有新 ID 的新表来代替旧表,(ii) 在新表上创建适当的索引和约束,(iii) 删除旧表,(iv) 重命名新表表更改为旧名称。It's difficult to say what the "best" or "most suitable" approach is as you have not described what you are looking for in a solution. For example, do the tables need to be available for query while you are migrating to new IDs? Do they need to be available for concurrent modification? Is it important to complete the migration as fast as possible? Is it important to minimize the space used for migration?
Having said that, I would prefer #1 over your other ideas, assuming they all met your requirements.
Anything that involves a trigger to update the child tables seems error-prone and over complicated and likely will not perform as well as #1.
Is it safe to assume that new IDs will never collide with old IDs? If not, solutions based on updating the IDs one at a time will have to worry about collisions -- this will get messy in a hurry.
Have you considered using
CREATE TABLE AS SELECT
(CTAS) to populate new tables with the new IDs? You'll be making a copy of your existing tables and this will require additional space, however it is likely to be faster than updating the existing tables in place. The idea is: (i) use CTAS to create new tables with new IDs in place of the old, (ii) create indexes and constraints as appropriate on the new tables, (iii) drop the old tables, (iv) rename the new tables to the old names.事实上,这取决于您的 RDBMS。
使用 Oracle,最简单的选择是“推迟”所有外键约束(检查提交),在单个事务中执行更新,然后提交。
In fact, it depend on your RDBMS.
Using Oracle, the simpliest choice is to make all of the foreign key constraints "deferred" (check on commit), perform updates in a single transaction, then commit.