修复数据完整性

发布于 2024-09-16 10:27:21 字数 289 浏览 6 评论 0原文

我认为这是一个不太可能的事情,但事情是这样的:

基本问题是:开发团队如何开始修复大型损坏数据集上的数据完整性?

我正在帮助的公司有一个巨大的 MySQL/PHP5 系统,多年来一直存在缺陷、无效数据、损坏的引用等。最重要的是,这些数据引用了一些在线服务的数据,例如 Google AdWords。

所以本地数据库有问题,本地和远程(例如AdWords)之间的关系也有问题,使问题变得更加复杂。

是否有人可以分享开始修复数据完整性的提示、技巧或最佳实践?并保持快速、持续添加和更新的系统中的数据完整性?

I think this is a long-shot, but here it goes:

The basic question is: how does a development team beginning to repair data integrity on a large, damaged dataset?

The company I'm helping out has a huge MySQL/PHP5 sytem with a few years of cruft, invalid data, broken references, etc. To top it all off, this data references data on a few online services, such as Google AdWords.

So the local db has problems, and the relationships between the local and the remote (e.g. AdWords) also has problems, compounding the issue.

Does anyone have tips, tricks, or best-practices they can share for beginning to repair the data integrity? And to maintain data integrity in a system that is rapidly and continuously being added to and updated?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

意犹 2024-09-23 10:27:21

最大的问题是确定您打算对问题数据做什么:

  • 没有
  • 从其他地方保存的数据进行重建并且可以通过代码访问
  • 重建数据手动
  • 删除它(或者最好将其存档)

并且为了做到这一点,您需要确定问题数据如何影响系统/组织以及决议将如何影响系统/组织。

这是您的第一级分类。一旦掌握了这一点,您就需要开始识别特定问题,并从中导出一组定义错误模式的语义规则。

这样您就可以定义所需的修复、有效地确定工作的优先级并规划资源利用率。它还应该允许您确定优先级、计划并部分识别根本原因的消除。

我不确定你对“巨大”的定义是什么 - 但我推断这意味着有很多程序员为其做出贡献 - 在这种情况下,你肯定需要建立标准和程序来管理未来的数据完整性,正如您应该对性能和安全性所做的那样。

您定义的规则是持续数据管理的起点,但您应该考虑如何应用这些规则 - 向每个表添加时间戳字段/维护引用违反特定规则的行的表意味着您将“每次您想要检查数据时,不需要处理所有数据 - 只需处理自上次检查以来发生更改的内容 - 跟踪从违规列表中删除的案例以及那些案例是一个好主意正在添加。

请保留所应用的修复和相应规则违规的记录 - 并分析数据以识别重构可能会产生更易于维护的代码的热点。

The big problem is identifying what you intend doing about the problem data:

  • nothing
  • reconstruct from data held elsewhere and accessible via code
  • reconstruct the data manually
  • delete it (or preferably archive it)

And in order to do that you need to establish how the problem data affects the system/organization and how the resolution will affect the system/organization.

This is your first level of classification. Once you've got this, you need to start identifying specific issues and from this derive a set of semantic rules defining errant patterns.

This should then allow you to define the fixes required, prioritize the work effectively and plan your resource utilization. It should also allow you to prioritize, plan and partially identify root-cause removal.

I'm not sure what your definition of 'huge' is - but I would infer that it means that there are lots of programmers contributing to it - in which case you certainly need to establish standards and procedures for managing the data integrity going forward, just as you should do with performance and security.

The rules you have defined are a starting point for ongoing data management, but you should think about how you are going to apply these going forward - adding a timestamp field to every table / maintaining tables referencing rows which violate specific rules means that you won't need to process all the data every time you want to check the data - just the stuff which has changed since the last time you checked - its a good idea to keep track of the cases being removed from the violation list as well as the ones being added.

Do keep records of fixes applied and corresponding rule violations - and analyse the data to identify hotspots where re-factoring may result in more maintainable code.

我乃一代侩神 2024-09-23 10:27:21

根据需求和存在的“损坏”程度,创建新数据库并修改应用程序以并行更新两者可能是明智的做法。

有效的数据可以导入到新的 d/b 中,然后一系列渐进的提取可以添加有效数据并导入这些数据,直到工作量增加到尝试恢复严重损坏的数据不再有意义为止。当然,一个未损坏的、不完整的数据库比一个损坏的数据库更好、更有用——只要它是损坏的,就不能称为“完整”。

Depending on the requirements and how much "damage" exists, it might be prudent to create a new database and modify the application to update both in parallel.

Data which are valid could be imported into the new d/b, and then a progressive series of extractions could add valid data and import those until the effort increases to the point where it no longer makes sense to try to recover seriously damaged data. Surely an undamaged incomplete database is better and more useful than a corrupt database—as long as it's corrupt, it cannot be called "complete".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文