多列信息的模糊记录匹配

发布于 2024-10-20 16:21:00 字数 752 浏览 10 评论 0原文

我的问题有点高层次,所以我会尽量具体。

我正在进行大量研究,涉及将不同的数据集与引用同一实体(通常是公司或金融证券)的标头信息相结合。此记录链接通常涉及标头信息,其中名称是唯一常见的主要标识符,但通常可以使用一些辅助信息(例如城市和州、运营日期、相对大小等)。这些匹配通常是一对多,但也可能是一对一甚至多对多。我通常通过手动或通过清理子字符串的非常基本的文本比较来完成此匹配。我偶尔会使用简单的匹配算法,例如 Levenshtein 距离度量,但我从未从中得到太多好处,部分原因是我没有良好的正式方法来应用它。

我的猜测是,这是一个相当常见的问题,并且必须开发一些正式的流程来完成此类事情。我读过一些关于该主题的学术论文,这些论文涉及给定方法的理论适用性,但我还没有找到任何好的资料来介绍秘诀或至少是实用的框架。

我的问题如下:

  • 有谁知道实现多维模糊记录匹配的好来源,例如书籍或网站或已发表的文章或工作论文?

  • 我更喜欢有实际例子和明确定义方法的东西。

  • 该方法可以是迭代的,在中间阶段进行人工检查以进行改进。

  • 编辑)链接的数据用于统计分析。因此,一点点噪音是可以的,但与更少的“不正确的非匹配”相比,人们更倾向于减少“不正确的匹配”。

  • 如果它们是用 Python 编写的,那就太好了,但不是必需的。

最后一件事(如果重要的话)是我不太关心计算效率。我不是动态实现这个,我通常处理几千条记录。

I have a question that is somewhat high level, so I'll try to be as specific as possible.

I'm doing a lot of research that involves combining disparate data sets with header information that refers to the same entity, usually a company or a financial security. This record linking usually involves header information in which the name is the only common primary identifier, but where some secondary information is often available (such as city and state, dates of operation, relative size, etc). These matches are usually one-to-many, but may be one-to-one or even many-to-many. I have usually done this matching by hand or with very basic text comparison of cleaned substrings. I have occasionally used a simple matching algorithm like a Levenshtein distance measure, but I never got much out of it, in part because I didn't have a good formal way of applying it.

My guess is that this is a fairly common question and that there must be some formalized processes that have been developed to do this type of thing. I've read a few academic papers on the subject that deal with theoretical appropriateness of given approaches, but I haven't found any good source that walks through a recipe or at least a practical framework.

My question is the following:

  • Does anyone know of a good source for implementing multi-dimensional fuzzy record matching, like a book or a website or a published article or working paper?

  • I'd prefer something that had practical examples and a well defined approach.

  • The approach could be iterative, with human checks for improvement at intermediate stages.

  • (edit) The linked data is used for statistical analysis. As such, a little bit of noise is OK, but there is a strong preference for fewer "incorrect matches" over fewer "incorrect non-matches".

  • If they were in Python that would be fantastic, but not necessary.

One last thing, if it matters, is that I don't care much about computational efficiency. I'm not implementing this dynamically and I'm usually dealing with a few thousand records.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

淡墨 2024-10-27 16:21:00

对于“几千条记录”来说,一种常见的方法应该不会太昂贵,那就是余弦相似度。尽管最常用于比较文本文档,但您可以轻松修改它以处理任何类型的数据。

链接的维基百科文章的细节非常稀疏,但是点击链接并进行一些搜索将为您提供一些有用的信息。您可以修改该实现以适合您的目的。事实上,看看 简单Python 中 N-Gram、tf-idf 和余弦相似度的实现

一种更简单的计算,并且可能“足够好”满足您的目的,那就是 Jaccard 索引。主要区别在于,余弦相似度通常会考虑某个单词在文档中以及整个文档集中使用的次数,而 Jaccard 索引仅关心特定单词在文档中的情况。还有其他差异,但我认为这一差异是最重要的。

One common method that shouldn't be terribly expensive for "a few thousand records" would be cosine similarity. Although most often used for comparing text documents, you can easily modify it to work with any kind of data.

The linked Wikipedia article is pretty sparse on details, but following links and doing a few searches will get you some good info. Potentially an implementation that you can modify to fit your purposes. In fact, take a look at Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

A simpler calculation, and one that might be "good enough" for your purposes would be a Jaccard index. The primary difference is that typically cosine similarity takes into account the number of times a word is used in a document and in the entire set of documents, whereas the Jaccard index only cares that a particular word is in the document. There are other differences, but that one strikes me as the most important.

穿透光 2024-10-27 16:21:00

问题是您有一个距离数组,每列至少有一个距离,并且您希望以最佳方式组合这些距离来指示一对记录是否相同。

这是一个分类问题,有很多方法可以做到,但 逻辑回归 是其中之一更简单的方法。要训​​练分类器,您需要将一些记录对标记为匹配或不匹配。

重复数据删除 Python 库 可帮助您完成此部分以及记录链接这一艰巨任务的其他部分。该文档很好地概述了如何全面处理记录链接问题

The problem is that you have an array of distances, at least one for each column, and you want to combine those distances in an optimal way to indicate whether a pair of records are the same thing or not.

This is a problem of classification, there are many ways to do it, but logistic regression is one of simpler methods. To train a classifer, you will need to label some pairs of records as either matches or not.

The dedupe python library helps you do this and other parts of the difficult task of record linkage. The documentation has a pretty good overview of how to approach the problem of record linkage comprehensively.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文