帮助标准化数据库的建议和工具

发布于 2024-09-24 00:16:58 字数 442 浏览 6 评论 0原文

我有 7 个 MySQL 表,其中包含大约 17000 行的部分重叠和冗余数据。所有表都包含学校的名称和地址。有时同一所学校会在一个表中重复,但名称略有不同,有时同一所学校会出现在多个表中,但其名称或地址略有不同。

我的任务是创建一个包含 id、名称和城镇/城市 id 字段的表,其中包含 7 个表中的数据。城镇将有一个单独的表,其中包含 ID 和名称字段。

由于原始表没有针对城市的单独字段,因此必须从包含格式截然不同的数据的地址字段中提取该字段,这一事实使情况变得复杂。

我意识到大部分工作都必须手动完成,我已经硬化了我的灵魂,我准备好应对这将给我带来的生活地狱。我的问题是:

  • 你将如何开始这样的任务?尽可能实现自动化的明智策略是什么?
  • 有没有可用的工具可以加快速度?就像可以比较字符串、确定它们的“相似性”并建议可能的重复项的东西?

谢谢!

I have 7 MySQL tables that contain partly overlapping and redundant data in approximately 17000 rows. All tables contain names and addresses of schools. Sometimes the same school is duplicated in a table with a slightly different name, and sometimes the same school appears in multiple tables, again, with small differences in its name or address.

My task is to create a single table with an id, name and town/city id field which would contain the data from the 7 tables. There will be a separate table for towns with an id and name field.

This is complicated by the fact that the original tables have no separate field for the city, it has to be extracted from the address field which has data in wildly different formats.

I realize that most of this has to be done manually, I've hardened my soul and I am ready to deal with the living hell this will bring upon me. My questions are:

  • how would you start such a task? what would be an intelligent strategy to automate as much of it as possible?
  • are there any tools available that could make this faster? like something that can compare strings, determine their 'likeness' and suggest possible duplications?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

舟遥客 2024-10-01 00:16:58

我曾经做过类似的事情,尽管使用的是 Oracle 而不是 MySQL。 17000 条记录是可管理的记录数量,但足以使其值得构建工具。

维基百科有几乎世界各地的学校列表,但遗憾的是匈牙利没有。这很遗憾,因为最好列出系统中应有的数据,而不是尝试从破碎的数据库中重建它。如果您能获得这样一个列表,这样您就可以将您的记录与其进行比较,而不是相互比较,这将很有用。即使您只能获取邮政编码和城市名称列表,这仍然会有帮助。

当涉及模糊匹配字符串时,有许多可用的统计技巧。一种常见的距离是编辑距离。这给出了一个分数,指示两个字符串之间的相似性,表示为从字符串 A 到字符串 B 所需的更改数量(也称为编辑距离)。 Code Janitor 有一个 MySQL 的实现,但这只是 Google 的热门产品,我没有提供任何保证。 在这里找到它 。 Jaro Winkler 是另一种匹配算法,但互联网上的实现似乎较少。

一般处理提示

  • 将所有学校名称和地址提取到一个表中,并使用元数据指示每行的出处(数据库、表、主键)。
  • 添加列来保存学校名称和地址的搜索字符串。例如,有一列去掉标点符号和数字(在某些匹配算法中,THMAS 比 TH0MAS 更接近 THOMAS)。
  • 在另一列中,允许常见的拼写错误(在英国地址中,字符串 SW!^ 可能代表 SW16,即 Streatham 的邮政编码)。
  • 在这些搜索列上构建全文索引,以帮助找出城市名称等常见单词的出现。
  • 如果您有大量数据并且有一定的耐心,那么构建一个同义词库来识别 Rd、St、Blvd 等常见缩写可能是一个有用的练习,但对于 17000 行来说可能不值得。
  • 使用正则表达式来匹配模式,例如邮政编码。

I did something like this once, although using Oracle rather than MySQL. 17000 records is a manageable number of records but enough to make it worthwhile building tools.

Wikipedia has lists of schools pretty much all over the world, but alas not for Hungary. This is a pity, because it would be better to have list of the data you should have in your system rather than trying to reconstruct it from your fractured databases. It would be useful if you can get hold of such a list so you can compare your records against it rather than against each other. Even if you can just get hold of lists of postcodes and city names that would still help.

When it comes to fuzzy matching strings there are a number of statistical tricks available. A common one is the Levenshtein Distance. This gives a score indicating the similarity between two strings, expressed as the number of changes needed to get from string A to string B (AKA edit distance). Code Janitor has an implementation for MySQL but this is just the top hit in Google, and comes with no warranty from me. Find it here. Jaro Winkler is another matching algorithm, but there seem to be fewer implementations kicking around the Internet.

General processing hints

  • Extract all the school names and addresses into a single table, with metadata indicating each row's provenance (database, table, primary key).
  • Add columns to hold search strings, for the school name and address. For instance, have one column which strips out punctation and numbers (in some matching algorithms THMAS is closer to THOMAS than is TH0MAS).
  • In another column allow for common typos (in a British address a string SW!^ is likely to represent SW16, the postcode for Streatham).
  • Build FullText indexes on those search columns to help pick out occurrences of common words like city names.
  • If you had a huge amount of data and some patience, building a thesaurus to identify common contractions like Rd, St, Blvd might be a useful exercise, but it probably isn't worth it for 17000 rows.
  • Use regex to match patterns, such as postcodes.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文