数据库后端不一致的跨数据库迁移 - 输入?
我正在将一些数据从一个数据库迁移到另一个数据库,这是生产数据意外地进入了测试数据库。
这是典型的以单个用户表为中心的关系数据库。
需要考虑的事项
- 几乎任何表中都可能存在生产和测试之间的重复行。
- 测试中任何与用户相关的表中的任何列都可能会丢失生产中的内容,或者包含由于用户重新注册而更新的信息。
- 所有表都包含
created
和updated
列。
我一直通过 SOAP 层连接到数据库,因为这是一种“简单”的方式。不过,我确实拥有对运行这些数据库的计算机的管理访问权限。
你有什么方法、建议、指示可以帮助我实现这个目标吗?也许类似于内容迁移 - 最佳实践 (PDF) 之类的内容,真的。
I am migration some data from one database to another, it is production data that has accidentally ended up in a testing database.
It is typical a relational database centered around a single User table.
Things to consider
- Duplicate rows between production and testing may exist in almost any table.
- Any column in any User-related table in testing may miss content from production, or contain updated information due to User re-registration.
- All tables contain
created
andupdated
columns.
I have been connecting to the database via a SOAP layer because it is was the "easy" way. I do however have administration access to the machines running these databases.
Do you have any methods, any advice, any pointers for me to aid me in my goal to make this? Perhaps something along the lines of Content Migration - Best Practices (PDF), anything, really.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
1.先备份所有数据。说这话总没有坏处!
2. 建立合理的样本量,即您愿意详细查看多少记录,部分取决于您的时间/金钱以及更正的准确数据的价值。
3. 创建一个列表,例如在这些记录的电子表格中。
4. 如果可以的话,(从外部)识别哪些是真实的,可以使用电子邮件地址或其他字段与其他数据进行比较。
5.寻找模式。是否有任何单独的字段:- id、date、user_id 等看起来可以帮助您了解哪些记录是好的?查找值模式、低/高范围、重复的“样本”数据(许多记录中的列的值相同)、没有时间的日期、具有孤立外来 ID 的记录,您可以检查的内容数量惊人!
6.确定您的最终容差 - 您是否寻求 100% 的容差?或者 99.94% 的修复对用户来说可以吗(好吧,可以接受!)?
7.看看你提到的那些重复项。对于这些记录,您是否可以应用任何规则(例如“较旧的记录”或“较新的记录”或低 ID 号)来至少消除它们?
我希望这有帮助!
1. Backup all the data first. It never hurts to say this!
2. Establish a reasonable sample size, i.e. how many records are you willing to look at in details, partly based on your time/money and the value of corrected accurate data.
3. Create a list, say in a spreadsheet of those records.
4. If you can, identify (externally) which ones are real, maybe using email address or other fields to compare with other data.
5. Look for patterns. Is there any individual field:- id, date, user_id, etc that looks as if it will help you know which records are good? Looks for value patterns, low/high ranges, duplicated 'sample' data (same value for a column in many records), dates without times, records with orphaned foreign ID's, there are a suprising number of things you can check!
6. Determine your final tolerance - are you looking for 100% ? Or would 99.94% fixed be ok (well, acceptable then!) to the users?
7. Look at those duplicates you mentioned. For those records, can you apply any rule such as 'older record' or 'newer record' or low ID number to at least eliminate them?
I hope this helps!