MySQL 遗留数据库的终极噩梦
表格1: 一切包括厨房水槽。 日期格式错误(去年,因此无法对该列进行排序)、存储为 VARCHAR 的数字、“街道”列中的完整地址、名字列中的名字和姓氏、姓氏列中的城市、不完整的地址、通过根据多年来更改的一组规则将数据从一个字段移动到另一个字段来更新前面的行,重复记录,不完整记录,垃圾记录......凡是你能想到的......哦,当然不是 TIMESTAMP 或 PRIMARY关键列就在眼前。
表2: 当这个婴儿被打开时,任何正常化的希望都破灭了。 我们为表一中的每个条目和行更新设置一行。 因此,像没有明天这样的重复项(价值 800MB)和诸如 Phone1 Phone2 Phone3 Phone4 ... Phone15 之类的列(它们不称为电话。我用它来说明)外键是..我们猜一下。 根据表 1 表 3 中的行中的数据类型,有 3 个候选者
: 还能变得更糟吗。 哦是的。 “外键是短划线、点、数字和字母的 VARCHAR 列组合!如果它不能提供匹配项(通常不会),则类似产品代码的第二列应该提供。具有以下名称的列与其中的数据没有关联,并且强制的 Phone1 Phone2 Phone3 Phone4... Phone15 有从 Table1 复制的列,并且看不到
Table4 的 TIMESTAMP 或 PRIMARY KEY 列:被描述为正在进行中的工作,可能会发生变化。在任何时候,它本质上都与其他人相似,
必须为每个“客户”提取一个合成记录。
幸运的是,这不是我的大混乱,不幸的是,我最初 设计了 Table1 的四步转换,添加了主键并将所有日期转换为可排序格式,然后执行了几个返回过滤数据的查询步骤,直到我有了 Table1 到可以使用它从其他表中提取数据以形成的位置。经过几周的工作,我使用一些技巧将其简化为一个步骤。 所以现在我可以将我的应用程序指向混乱的地方并拉出一个漂亮干净的合成数据表。 幸运的是,我只需要其中一个电话号码来实现我的目的,因此标准化我的桌子不是问题。
然而,这才是真正的任务开始的地方,因为每天都有数百名员工以您无法想象的方式添加/更新/删除该数据库,并且每天晚上我都必须检索新行。
由于任何表中的现有行都可以更改,并且由于没有 TIMESTAMP ON UPDATE 列,因此我将不得不求助于日志来了解发生了什么。 当然,这假设有二进制日志,但实际上不存在!
这个概念的推出就像铅气球一样坠落。 我还不如告诉他们,他们的孩子将不得不接受实验性手术。 他们并不完全是高科技……如果你没有收集到……
情况有点微妙,因为他们有一些我公司急需的有价值的信息。 一家大公司的高级管理层(你知道他们是怎样的)派我去“实现这一目标”。
我想不出任何其他方法来处理夜间更新,除了使用另一个应用程序解析 bin 日志文件,找出它们白天对该数据库做了什么,然后相应地组合我的表。 我真的只需要查看他们的 table1 就可以知道如何处理我的桌子。 其他表仅提供用于刷新记录的字段。 (使用 MASTER SLAVE 不会有帮助,因为我会得到一个混乱的副本。)
另一种方法是为其 table1 的每一行创建一个唯一的哈希并构建一个哈希表。 然后我每天晚上都会检查整个数据库,检查哈希值是否匹配。 如果它们不存在,那么我会读取该记录并检查它是否存在于我的数据库中,如果存在,那么我会在我的数据库中更新它,如果不存在,那么它是一个新记录,我会插入它。 这很丑陋而且速度不快,但是解析二进制日志文件也不是很好。
我写这篇文章是为了帮助弄清楚这个问题。 经常告诉别人有助于澄清问题,使解决方案更加明显。 这样的话我就更头疼了!
我们将不胜感激您的想法。
Table1:
Everything including the kitchen sink. Dates in the wrong format (year last so you cannot sort on that column), Numbers stored as VARCHAR, complete addresses in the 'street' column, firstname and lastname in the firstname column, city in the lastname column, incomplete addresses, Rows that update preceeding rows by moving data from one field to another based on some set of rules that has changed over the years, duplicate records, incomplete records, garbage records... you name it... oh and of course not a TIMESTAMP or PRIMARY KEY column in sight.
Table2:
Any hope of normalization went out the window upon cracking this baby open.
We have a row for each entry AND update of rows in table one. So duplicates like there is no tomorrow (800MB worth) and columns like Phone1 Phone2 Phone3 Phone4 ... Phone15 (they are not called phone. I use this for illustration) The foriegn key is.. well take guess. There are three candidates depending on what kind of data was in the row in table1
Table3:
Can it get any worse. Oh yes.
The "foreign key is a VARCHAR column combination of dashes, dots, numbers and letters! if that doesn't provide the match (which it often doesn't) then a second column of similar product code should. Columns that have names that bear NO correlation to the data within them, and the obligatory Phone1 Phone2 Phone3 Phone4... Phone15. There are columns Duplicated from Table1 and not a TIMESTAMP or PRIMARY KEY column in sight.
Table4: was described as a work in progess and subject to change at any moment. It is essentailly simlar to the others.
At close to 1m rows this is a BIG mess. Luckily it is not my big mess. Unluckily I have to pull out of it a composit record for each "customer".
Initially I devised a four step translation of Table1 adding a PRIMARY KEY and converting all the dates into sortable format. Then a couple more steps of queries that returned filtered data until I had Table1 to where I could use it to pull from the other tables to form the composit. After weeks of work I got this down to one step using some tricks. So now I can point my app at the mess and pull out a nice clean table of composited data. Luckily I only need one of the phone numbers for my purposes so normalizing my table is not an issue.
However this is where the real task begins, because every day hundreds of employees add/update/delete this database in ways you don't want to imagine and every night I must retrieve the new rows.
Since existing rows in any of the tables can be changed, and since there are no TIMESTAMP ON UPDATE columns, I will have to resort to the logs to know what has happened. Of course this assumes that there is a binary log, which there is not!
Introducing the concept went down like lead balloon. I might as well have told them that their children are going to have to undergo experimental surgery. They are not exactly hi tech... in case you hadn't gathered...
The situation is a little delicate as they have some valuable information that my company wants badly. I have been sent down by senior management of a large corporation (you know how they are) to "make it happen".
I can't think of any other way to handle the nightly updates, than parsing the bin log file with yet another application, to figure out what they have done to that database during the day and then composite my table accordingly. I really only need to look at their table1 to figure out what to do to my table. The other tables just provide fields to flush out the record. (Using MASTER SLAVE won't help because I will have a duplicate of the mess.)
The alternative is to create a unique hash for every row of their table1 and build a hash table. Then I would go through the ENTIRE database every night checking to see if the hashs match. If they do not then I would read that record and check if it exists in my database, if it does then I would update it in my database, if it doesn't then its a new record and I would INSERT it. This is ugly and not fast, but parsing a binary log file is not pretty either.
I have written this to help get clear about the problem. often telling it to someone else helps clarify the problem making a solution more obvious. In this case I just have a bigger headache!
Your thoughts would be greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您不能使用访问该数据库的现有代码并使其适应您的需求吗? 当然,代码一定很糟糕,但它可能为您处理数据库结构,不是吗? 希望你可以专注于完成你的工作,而不是扮演考古学家。
Can't you use the existing code which accesses this database and adapt it to your needs? Of course, the code must be horrible, but it might handle the database structure for you, no? You could hopefully concentrate on getting your work done instead of playing archaeologist then.
您也许可以使用 maatkit 的 mk-table-sync 工具来同步临时数据库(毕竟您的数据库非常小)。 这将“重复混乱”,
然后您可以编写一些内容,在同步后执行各种查询以生成一组更合理的表,然后您可以报告这些表。
我想这可以每天完成,而不会出现性能问题。
在不同的服务器上完成这一切将避免影响原始数据库。
我能看到的唯一问题是某些表是否没有主键。
you might be able to use maatkit's mk-table-sync tool to synchronise a staging database (your database is only very small, after all). This will "duplicate the mess"
You could then write something that, after the sync, does various queries to generate a set of more sane tables that you can then report off.
I imagine that this could be done on a daily basis without a performance problem.
Doing it all off a different server will avoid impacting the original database.
The only problem I can see is if some of the tables don't have primary keys.
我不是 MySQL 专家,所以这是来自左领域。
但我认为日志文件可能就是答案。
值得庆幸的是,您实际上只需要从日志中了解两件事。
你需要记录/rowid,并且你需要操作。
在大多数数据库中,我假设 MySQL 中,每行都有一个隐式列,例如 rowid 或 recordid 等。 它是数据库使用的内部行号。 这是您的“免费”主键。
接下来,就需要进行手术了。 特别是它是对行的插入、更新还是删除操作。
您按时间顺序整合所有这些信息,然后进行浏览。
对于每次插入/更新,您从原始数据库中选择行,然后在目标数据库中插入/更新该行。 如果是删除,则删除该行。
您不关心字段值,它们只是不重要。 做整行。
你希望不必“解析”二进制日志文件,MySQL 已经必须有例程来做到这一点,你只需要找到并弄清楚如何使用它们(甚至可能有一些方便的“转储日志”实用程序你可以使用)。
这使您可以保持系统非常简单,并且它应该仅取决于您一天中的实际活动,而不是总数据库大小。 最后,您可以稍后通过使其“更智能”来优化它。 例如,也许他们插入一行,然后更新它,然后删除它。 您会知道您可以在重播中完全忽略该行。
显然,这需要一些晦涩的知识才能真正读取日志文件,但其余的应该很简单。 我想日志文件也带有时间戳,因此您可以知道要处理“从今天开始”的行,或者您想要的任何日期范围。
I am not a MySQL person, so this is coming out of left field.
But I think the log files might be the answer.
Thankfully, you really only need to know 2 things from the log.
You need the record/rowid, and you need the operation.
In most DB's, and I assume MySQL, there's an implicit column on each row, like a rowid or recordid, or whatever. It's the internal row number used by the database. This is your "free" primary key.
Next, you need the operation. Notably whether it's an insert, update, or delete operation on the row.
You consolidate all of this information, in time order, and then run through it.
For each insert/update, you select the row from your original DB, and insert/update that row in your destination DB. If it's a delete, then you delete the row.
You don't care about field values, they're just not important. Do the whole row.
You hopefully shouldn't have to "parse" binary log files, MySQL already must have routines to do that, you just need to find and figure out how to use them (there may even be some handy "dump log" utility you could use).
This lets you keep the system pretty simple, and it should only depend on your actual activity during the day, rather than the total DB size. Finally, you could later optimize it by making it "smarter". For example, perhaps they insert a row, then update it, then delete it. You would know you can just ignore that row completely in your replay.
Obviously this takes a bit of arcane knowledge in order actually read the log files, but the rest should be straightforward. I would like to think that the log files are timestamped as well, so you can know to work on rows "from today", or whatever date range you want.
日志文件(二进制日志)也是我的第一个想法。 如果你知道他们是如何做事的,你会感到不寒而栗。 对于每一行,随着片段的添加和更改,日志中都会有许多条目。 它实在是太大了!
现在我选择了哈希方法。 通过一些巧妙的文件内存分页,速度相当快。
The Log Files (binary Logs) were my first thought too. If you knew how they did things you would shudder. For every row there are many many entries in the log as pieces are added and changed. Its just HUGE!
For now I settled upon the Hash approach. With some clever file memory paging this is quite fast.