如何确定每个来源中的记录是否代表同一个人

发布于 2024-07-06 03:43:28 字数 869 浏览 11 评论 0原文

我有多个包含个人数据的表源,如下所示:

SOURCE 1
ID, FIRST_NAME, LAST_NAME, FIELD1, ...
1, jhon, gates ...

SOURCE 2
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
1, jon, gate ...

SOURCE 3
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
2, jhon, ballmer ...

因此,假设来自源 1 和 2 的 ID 为 1 的记录是同一个人,我的问题是如何确定每个源中的记录是否代表同一个人。 此外,确保并非所有记录都存在于所有来源中。 所有的名字,主要是用西班牙语写的。

在这种情况下,需要放宽精确匹配的要求,因为我们假设数据源没有经过国家官方鉴定局的严格检查。 此外,我们需要假设打字错误很常见,因为收集数据的过程的性质。 更重要的是,每个来源的记录量约为 2 或 3 百万条...

我们的团队想到了这样的事情:首先,强制在 ID NUMBER 和 NAMES 等选定字段中进行精确匹配,以了解问题的难度是。 其次,放宽匹配标准,统计还能匹配多少条记录,但问题就出在这里:如何放宽匹配标准,既不产生太多噪音,又不限制太多?

什么工具可以更有效地处理这个问题吗?例如,您是否知道某些数据库引擎中的某些特定扩展可以支持这种匹配? 您是否知道像 soundex 这样的聪明算法来处理这种近似匹配,但适用于西班牙语文本?

任何帮助,将不胜感激!

谢谢。

I have several sources of tables with personal data, like this:

SOURCE 1
ID, FIRST_NAME, LAST_NAME, FIELD1, ...
1, jhon, gates ...

SOURCE 2
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
1, jon, gate ...

SOURCE 3
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
2, jhon, ballmer ...

So, assuming that records with ID 1, from sources 1 and 2, are the same person, my problem is how to determine if a record in every source, represents the same person. Additionally, sure not every records exists in all sources. All the names, are written in spanish, mainly.

In this case, the exact matching needs to be relaxed because we assume the data sources has not been rigurously checked against the official bureau of identification of the country. Also we need to assume typos are common, because the nature of the processes to collect the data. What is more, the amount of records is around 2 or 3 millions in every source...

Our team had thought in something like this: first, force exact matching in selected fields like ID NUMBER, and NAMES to know how hard the problem can be. Second, relaxing the matching criteria, and count how much records more can be matched, but is here where the problem arises: how to do to relax the matching criteria without generating too noise neither restricting too much?

What tool can be more effective to handle this?, for example, do you know about some especific extension in some database engine to support this matching?
Do you know about clever algorithms like soundex to handle this approximate matching, but for spanish texts?

Any help would be appreciated!

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

川水往事 2024-07-13 03:43:28

问题的关键是计算每对条目之间的一个或多个距离度量,然后当其中一个距离小于某个可接受的阈值时认为它们是相同的。 关键是设置分析,然后改变可接受的距离,直到达到您认为的假阳性和假阴性之间的最佳权衡。

一种距离测量可以是语音的。 您可能会考虑的另一个是整体之间的编辑或编辑距离,它会尝试测量拼写错误。

如果您对应该拥有多少人有合理的想法,那么您的目标就是找到获得适当人数的最佳点。 如果你的匹配太模糊,那么你的匹配就会太少。 对其进行限制,你就会拥有太多。

如果您大致知道一个人应该有多少条目,那么您可以使用它作为衡量标准来查看您何时接近。 或者,您可以将记录数除以每个人的平均记录数,并获得您要拍摄的大致人数。

如果您没有任何数字可供使用,那么您只能从分析中挑选出一组记录,并手动检查它们是否看起来像同一个人。 所以这是猜测和检查。

我希望这有帮助。

The crux of the problem is to compute one or more measures of distance between each pair of entries and then consider them to be the same when one of the distances is less than a certain acceptable threshold. The key is to setup the analysis and then vary the acceptable distance until you reach what you consider to be the best trade-off between false-positives and false-negatives.

One distance measurement could be phonetic. Another you might consider is the Levenshtein or edit distance between the entires, which would attempt to measure typos.

If you have a reasonable idea of how many persons you should have, then your goal is to find the sweet spot where you are getting about the right number of persons. Make your matching too fuzzy and you'll have too few. Make it to restrictive and you'll have too many.

If you know roughly how many entries a person should have, then you can use that as the metric to see when you are getting close. Or you can divide the number of records into the average number of records for each person and get a rough number of persons that you're shooting for.

If you don't have any numbers to use, then you're left picking out groups of records from your analysis and checking by hand whether they look like the same person or not. So it's guess and check.

I hope that helps.

烟火散人牵绊 2024-07-13 03:43:28

这听起来像是一个客户数据集成问题。 搜索该术语,您可能会找到更多信息。 另外,看看数据仓库研究所,您也可能会在那里找到一些答案。

编辑:此外,这里一篇文章您可能对西班牙语语音匹配感兴趣。

This sounds like a Customer Data Integration problem. Search on that term and you might find some more information. Also, have a poke around inside The Data Warehousing Institude, and you might find some answers there as well.

Edit: In addition, here's an article that might interest you on spanish phonetic matching.

昵称有卵用 2024-07-13 03:43:28

SSIS ,尝试使用模糊查找转换

SSIS , try using the Fuzzy Lookup transformation

软糯酥胸 2024-07-13 03:43:28

只是为了添加一些细节来解决这个问题,我发现了 Postgresql 8.3 的这个模块

Just to add some details to solve this issue, I'd found this modules for Postgresql 8.3

我一直都在从未离去 2024-07-13 03:43:28

我以前必须做类似的事情,我所做的是使用 双变音位 语音搜索关于名字。

不过,在比较名称之前,我尝试通过在我创建的昵称表中查找名称来规范化任何名称/昵称差异。 (我用网上找到的人口普查数据填充了表格)所以叫 Bob 的人变成了 Robert,Alex 变成了 Alexander,Bill 变成了 William,等等。

编辑:Double Metaphone 专门设计为比 Soundex 更好并且工作英语以外的语言。

I've had to do something similar before and what I did was use a double metaphone phonetic search on the names.

Before I compared the names though, I tried to normalize away any name/nickname differences by looking up the name in a nick name table I created. (I populated the table with census data I found online) So people called Bob became Robert, Alex became Alexander, Bill became William, etc.

Edit: Double Metaphone was specifically designed to be better than Soundex and work in languages other than English.

无声无音无过去 2024-07-13 03:43:28

您可以尝试通过将名称与词典进行比较来规范化名称。
这将使您能够发现一些常见的拼写错误并纠正它们。

You might try to cannonicalise the names by comparing them with a dicionary.
This would allow you to spot some common typos and correct them.

缺⑴份安定 2024-07-13 03:43:28

在我看来,您遇到了记录链接问题。 您可以使用链接中的参考资料。

Sounds to me you have a record linkage problem. You can use the references in the link.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文