MySQL 遗留数据库的终极噩梦

发布于 2024-07-05 05:22:08 字数 1715 浏览 9 评论 0原文

表格1：一切包括厨房水槽。日期格式错误（去年，因此无法对该列进行排序）、存储为 VARCHAR 的数字、“街道”列中的完整地址、名字列中的名字和姓氏、姓氏列中的城市、不完整的地址、通过根据多年来更改的一组规则将数据从一个字段移动到另一个字段来更新前面的行，重复记录，不完整记录，垃圾记录......凡是你能想到的......哦，当然不是 TIMESTAMP 或 PRIMARY关键列就在眼前。

表2：当这个婴儿被打开时，任何正常化的希望都破灭了。我们为表一中的每个条目和行更新设置一行。因此，像没有明天这样的重复项（价值 800MB）和诸如 Phone1 Phone2 Phone3 Phone4 ... Phone15 之类的列（它们不称为电话。我用它来说明）外键是..我们猜一下。根据表 1 表 3 中的行中的数据类型，有 3 个候选者

：还能变得更糟吗。哦是的。 “外键是短划线、点、数字和字母的 VARCHAR 列组合！如果它不能提供匹配项（通常不会），则类似产品代码的第二列应该提供。具有以下名称的列与其中的数据没有关联，并且强制的 Phone1 Phone2 Phone3 Phone4... Phone15 有从 Table1 复制的列，并且看不到

Table4 的 TIMESTAMP 或 PRIMARY KEY 列：被描述为正在进行中的工作，可能会发生变化。在任何时候，它本质上都与其他人相似，

必须为每个“客户”提取一个合成记录。

幸运的是，这不是我的大混乱，不幸的是，我最初设计了 Table1 的四步转换，添加了主键并将所有日期转换为可排序格式，然后执行了几个返回过滤数据的查询步骤，直到我有了 Table1 到可以使用它从其他表中提取数据以形成的位置。经过几周的工作，我使用一些技巧将其简化为一个步骤。所以现在我可以将我的应用程序指向混乱的地方并拉出一个漂亮干净的合成数据表。幸运的是，我只需要其中一个电话号码来实现我的目的，因此标准化我的桌子不是问题。

然而，这才是真正的任务开始的地方，因为每天都有数百名员工以您无法想象的方式添加/更新/删除该数据库，并且每天晚上我都必须检索新行。

由于任何表中的现有行都可以更改，并且由于没有 TIMESTAMP ON UPDATE 列，因此我将不得不求助于日志来了解发生了什么。当然，这假设有二进制日志，但实际上不存在！

这个概念的推出就像铅气球一样坠落。我还不如告诉他们，他们的孩子将不得不接受实验性手术。他们并不完全是高科技……如果你没有收集到……

情况有点微妙，因为他们有一些我公司急需的有价值的信息。一家大公司的高级管理层（你知道他们是怎样的）派我去“实现这一目标”。

我想不出任何其他方法来处理夜间更新，除了使用另一个应用程序解析 bin 日志文件，找出它们白天对该数据库做了什么，然后相应地组合我的表。我真的只需要查看他们的 table1 就可以知道如何处理我的桌子。其他表仅提供用于刷新记录的字段。（使用 MASTER SLAVE 不会有帮助，因为我会得到一个混乱的副本。）

另一种方法是为其 table1 的每一行创建一个唯一的哈希并构建一个哈希表。然后我每天晚上都会检查整个数据库，检查哈希值是否匹配。如果它们不存在，那么我会读取该记录并检查它是否存在于我的数据库中，如果存在，那么我会在我的数据库中更新它，如果不存在，那么它是一个新记录，我会插入它。这很丑陋而且速度不快，但是解析二进制日志文件也不是很好。

我写这篇文章是为了帮助弄清楚这个问题。经常告诉别人有助于澄清问题，使解决方案更加明显。这样的话我就更头疼了！

我们将不胜感激您的想法。

原文

Table1:
Everything including the kitchen sink. Dates in the wrong format (year last so you cannot sort on that column), Numbers stored as VARCHAR, complete addresses in the 'street' column, firstname and lastname in the firstname column, city in the lastname column, incomplete addresses, Rows that update preceeding rows by moving data from one field to another based on some set of rules that has changed over the years, duplicate records, incomplete records, garbage records... you name it... oh and of course not a TIMESTAMP or PRIMARY KEY column in sight.

Table2:
Any hope of normalization went out the window upon cracking this baby open.
We have a row for each entry AND update of rows in table one. So duplicates like there is no tomorrow (800MB worth) and columns like Phone1 Phone2 Phone3 Phone4 ... Phone15 (they are not called phone. I use this for illustration) The foriegn key is.. well take guess. There are three candidates depending on what kind of data was in the row in table1

Table3:
Can it get any worse. Oh yes.
The "foreign key is a VARCHAR column combination of dashes, dots, numbers and letters! if that doesn't provide the match (which it often doesn't) then a second column of similar product code should. Columns that have names that bear NO correlation to the data within them, and the obligatory Phone1 Phone2 Phone3 Phone4... Phone15. There are columns Duplicated from Table1 and not a TIMESTAMP or PRIMARY KEY column in sight.

Table4: was described as a work in progess and subject to change at any moment. It is essentailly simlar to the others.

At close to 1m rows this is a BIG mess. Luckily it is not my big mess. Unluckily I have to pull out of it a composit record for each "customer".

Initially I devised a four step translation of Table1 adding a PRIMARY KEY and converting all the dates into sortable format. Then a couple more steps of queries that returned filtered data until I had Table1 to where I could use it to pull from the other tables to form the composit. After weeks of work I got this down to one step using some tricks. So now I can point my app at the mess and pull out a nice clean table of composited data. Luckily I only need one of the phone numbers for my purposes so normalizing my table is not an issue.

However this is where the real task begins, because every day hundreds of employees add/update/delete this database in ways you don't want to imagine and every night I must retrieve the new rows.

Since existing rows in any of the tables can be changed, and since there are no TIMESTAMP ON UPDATE columns, I will have to resort to the logs to know what has happened. Of course this assumes that there is a binary log, which there is not!

Introducing the concept went down like lead balloon. I might as well have told them that their children are going to have to undergo experimental surgery. They are not exactly hi tech... in case you hadn't gathered...

The situation is a little delicate as they have some valuable information that my company wants badly. I have been sent down by senior management of a large corporation (you know how they are) to "make it happen".

I can't think of any other way to handle the nightly updates, than parsing the bin log file with yet another application, to figure out what they have done to that database during the day and then composite my table accordingly. I really only need to look at their table1 to figure out what to do to my table. The other tables just provide fields to flush out the record. (Using MASTER SLAVE won't help because I will have a duplicate of the mess.)

The alternative is to create a unique hash for every row of their table1 and build a hash table. Then I would go through the ENTIRE database every night checking to see if the hashs match. If they do not then I would read that record and check if it exists in my database, if it does then I would update it in my database, if it doesn't then its a new record and I would INSERT it. This is ugly and not fast, but parsing a binary log file is not pretty either.

I have written this to help get clear about the problem. often telling it to someone else helps clarify the problem making a solution more obvious. In this case I just have a bigger headache!

Your thoughts would be greatly appreciated.

分享到QQ

分享到微博