mySQL 数据库的自动规范化 - 如何做到?
我有一个 mySQL 数据库,其中包含一张包含 80 列和 1000 万行的巨大表。 数据可能存在不一致。
我想以自动化且有效的方式规范化数据库。
我可以使用 java/c++/... 来完成,但我想在数据库内尽可能多地完成。 我想数据库之外的任何工作都会大大减慢速度。
关于如何做的建议? 有哪些好的资源/教程可以开始?
我不是寻找任何关于标准化是什么的提示(使用谷歌发现了很多这样的东西)!
I have a mySQL database filled with one huge table of 80 columns and 10 million rows. The data may have inconsistencies.
I would like to normalize the database in an automated and efficient way.
I could do it using java/c++/..., but I would like to do as much as possible inside the database. I guess that any work outside the database will slow down things very much.
Suggestions on how to do it? What are good resources/tutorials to start with?
I am not looking for any hints on what normalization is (found plenty of this stuff using google)!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您需要研究列以识别“类似”实体并将它们分成单独的表格。 自动化工具充其量可能会识别某些列具有相同值的行组,但理解数据的人必须确定这些行是否真正属于一个单独的实体。
这是一个人为的示例 - 假设您的列是名字、姓氏、地址、城市、州、邮政编码。 自动化工具可能会识别出具有相同姓氏、地址、城市、州和邮政编码的同一家庭成员的行,并错误地得出结论:这五列代表一个实体。 然后它可能会将表拆分:
名字、ReferenceID
和另一个表
ID、姓氏、地址、城市、州、邮政编码
明白我的意思吗?
You need to study the columns to identify 'like' entities and break them out into seperate tabels. At best an automated tool might identify groups of rows with identical values for some of the columns, but a person who understood the data would have to decide if those truely belong as a seperate entity.
Here's a contrived example - suppose your columns were first name, last name, address, city, state, zip. An automated tool might identify rows of people who were members of the same family with the same last name, address, city, state, and zip and incorrectly conclude that those five columns represented an entity. It might then split the tables up:
First Name, ReferenceID
and another table
ID, Last Name, Address, City, State, Zip
See what i mean?
我想不出任何方法可以使其自动化。 您必须创建所需的表,然后通过手动查询检查并替换每条数据。
例如,
那么您可以从庞大的表中删除这些列,并将其替换为 contact_id 列。
当提取进入一对多表的行时,您将有类似的过程。
I can't think of any way you can automate it. You would have to create the tables that you want, and then go through and replace each piece of data with manual queries.
e.g.,
then you could drop the columns out of the massive table and replace it with a contact_id column.
You would have a similar process when pulling out rows that go into a one-to-many table.
在清理混乱的数据时,我喜欢创建用户定义的 mysql 函数来执行典型的数据清理工作……这样您就可以稍后重用它们。 通过这种方式还可以让您查看是否可以找到已编写的可以使用的现有 udf(无论是否经过修改)...例如 mysqludf.org
In cleaning up messy data, I like to create user defined mysql functions to do typical data-scrubbing stuff... that way you can reuse them later. Approaching this way also lets you see if you can find existing udf's that have been written which you can use (with or without modification)... for example mysqludf.org