如何确定每个来源中的记录是否代表同一个人

发布于 2024-07-06 03:43:28 字数 869 浏览 12 评论 0原文

我有多个包含个人数据的表源，如下所示：

SOURCE 1
ID, FIRST_NAME, LAST_NAME, FIELD1, ...
1, jhon, gates ...

SOURCE 2
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
1, jon, gate ...

SOURCE 3
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
2, jhon, ballmer ...

因此，假设来自源 1 和 2 的 ID 为 1 的记录是同一个人，我的问题是如何确定每个源中的记录是否代表同一个人。此外，确保并非所有记录都存在于所有来源中。所有的名字，主要是用西班牙语写的。

在这种情况下，需要放宽精确匹配的要求，因为我们假设数据源没有经过国家官方鉴定局的严格检查。此外，我们需要假设打字错误很常见，因为收集数据的过程的性质。更重要的是，每个来源的记录量约为 2 或 3 百万条...

我们的团队想到了这样的事情：首先，强制在 ID NUMBER 和 NAMES 等选定字段中进行精确匹配，以了解问题的难度是。其次，放宽匹配标准，统计还能匹配多少条记录，但问题就出在这里：如何放宽匹配标准，既不产生太多噪音，又不限制太多？

什么工具可以更有效地处理这个问题吗？例如，您是否知道某些数据库引擎中的某些特定扩展可以支持这种匹配？您是否知道像 soundex 这样的聪明算法来处理这种近似匹配，但适用于西班牙语文本？

任何帮助，将不胜感激！

谢谢。

原文

I have several sources of tables with personal data, like this:

SOURCE 1
ID, FIRST_NAME, LAST_NAME, FIELD1, ...
1, jhon, gates ...

SOURCE 2
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
1, jon, gate ...

SOURCE 3
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
2, jhon, ballmer ...

So, assuming that records with ID 1, from sources 1 and 2, are the same person, my problem is how to determine if a record in every source, represents the same person. Additionally, sure not every records exists in all sources. All the names, are written in spanish, mainly.

In this case, the exact matching needs to be relaxed because we assume the data sources has not been rigurously checked against the official bureau of identification of the country. Also we need to assume typos are common, because the nature of the processes to collect the data. What is more, the amount of records is around 2 or 3 millions in every source...

Our team had thought in something like this: first, force exact matching in selected fields like ID NUMBER, and NAMES to know how hard the problem can be. Second, relaxing the matching criteria, and count how much records more can be matched, but is here where the problem arises: how to do to relax the matching criteria without generating too noise neither restricting too much?

What tool can be more effective to handle this?, for example, do you know about some especific extension in some database engine to support this matching?
Do you know about clever algorithms like soundex to handle this approximate matching, but for spanish texts?

Any help would be appreciated!

Thanks.

分享到QQ

分享到微博