帮助标准化数据库的建议和工具
我有 7 个 MySQL 表,其中包含大约 17000 行的部分重叠和冗余数据。所有表都包含学校的名称和地址。有时同一所学校会在一个表中重复,但名称略有不同,有时同一所学校会出现在多个表中,但其名称或地址略有不同。
我的任务是创建一个包含 id、名称和城镇/城市 id 字段的表,其中包含 7 个表中的数据。城镇将有一个单独的表,其中包含 ID 和名称字段。
由于原始表没有针对城市的单独字段,因此必须从包含格式截然不同的数据的地址字段中提取该字段,这一事实使情况变得复杂。
我意识到大部分工作都必须手动完成,我已经硬化了我的灵魂,我准备好应对这将给我带来的生活地狱。我的问题是:
- 你将如何开始这样的任务?尽可能实现自动化的明智策略是什么?
- 有没有可用的工具可以加快速度?就像可以比较字符串、确定它们的“相似性”并建议可能的重复项的东西?
谢谢!
I have 7 MySQL tables that contain partly overlapping and redundant data in approximately 17000 rows. All tables contain names and addresses of schools. Sometimes the same school is duplicated in a table with a slightly different name, and sometimes the same school appears in multiple tables, again, with small differences in its name or address.
My task is to create a single table with an id, name and town/city id field which would contain the data from the 7 tables. There will be a separate table for towns with an id and name field.
This is complicated by the fact that the original tables have no separate field for the city, it has to be extracted from the address field which has data in wildly different formats.
I realize that most of this has to be done manually, I've hardened my soul and I am ready to deal with the living hell this will bring upon me. My questions are:
- how would you start such a task? what would be an intelligent strategy to automate as much of it as possible?
- are there any tools available that could make this faster? like something that can compare strings, determine their 'likeness' and suggest possible duplications?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我曾经做过类似的事情,尽管使用的是 Oracle 而不是 MySQL。 17000 条记录是可管理的记录数量,但足以使其值得构建工具。
维基百科有几乎世界各地的学校列表,但遗憾的是匈牙利没有。这很遗憾,因为最好列出系统中应有的数据,而不是尝试从破碎的数据库中重建它。如果您能获得这样一个列表,这样您就可以将您的记录与其进行比较,而不是相互比较,这将很有用。即使您只能获取邮政编码和城市名称列表,这仍然会有帮助。
当涉及模糊匹配字符串时,有许多可用的统计技巧。一种常见的距离是编辑距离。这给出了一个分数,指示两个字符串之间的相似性,表示为从字符串 A 到字符串 B 所需的更改数量(也称为编辑距离)。 Code Janitor 有一个 MySQL 的实现,但这只是 Google 的热门产品,我没有提供任何保证。 在这里找到它 。 Jaro Winkler 是另一种匹配算法,但互联网上的实现似乎较少。
一般处理提示
I did something like this once, although using Oracle rather than MySQL. 17000 records is a manageable number of records but enough to make it worthwhile building tools.
Wikipedia has lists of schools pretty much all over the world, but alas not for Hungary. This is a pity, because it would be better to have list of the data you should have in your system rather than trying to reconstruct it from your fractured databases. It would be useful if you can get hold of such a list so you can compare your records against it rather than against each other. Even if you can just get hold of lists of postcodes and city names that would still help.
When it comes to fuzzy matching strings there are a number of statistical tricks available. A common one is the Levenshtein Distance. This gives a score indicating the similarity between two strings, expressed as the number of changes needed to get from string A to string B (AKA edit distance). Code Janitor has an implementation for MySQL but this is just the top hit in Google, and comes with no warranty from me. Find it here. Jaro Winkler is another matching algorithm, but there seem to be fewer implementations kicking around the Internet.
General processing hints