多列信息的模糊记录匹配

发布于 2024-10-20 16:21:00 字数 752 浏览 14 评论 0原文

我的问题有点高层次，所以我会尽量具体。

我正在进行大量研究，涉及将不同的数据集与引用同一实体（通常是公司或金融证券）的标头信息相结合。此记录链接通常涉及标头信息，其中名称是唯一常见的主要标识符，但通常可以使用一些辅助信息（例如城市和州、运营日期、相对大小等）。这些匹配通常是一对多，但也可能是一对一甚至多对多。我通常通过手动或通过清理子字符串的非常基本的文本比较来完成此匹配。我偶尔会使用简单的匹配算法，例如 Levenshtein 距离度量，但我从未从中得到太多好处，部分原因是我没有良好的正式方法来应用它。

我的猜测是，这是一个相当常见的问题，并且必须开发一些正式的流程来完成此类事情。我读过一些关于该主题的学术论文，这些论文涉及给定方法的理论适用性，但我还没有找到任何好的资料来介绍秘诀或至少是实用的框架。

我的问题如下：

有谁知道实现多维模糊记录匹配的好来源，例如书籍或网站或已发表的文章或工作论文？
我更喜欢有实际例子和明确定义方法的东西。
该方法可以是迭代的，在中间阶段进行人工检查以进行改进。
（编辑）链接的数据用于统计分析。因此，一点点噪音是可以的，但与更少的“不正确的非匹配”相比，人们更倾向于减少“不正确的匹配”。
如果它们是用 Python 编写的，那就太好了，但不是必需的。

最后一件事（如果重要的话）是我不太关心计算效率。我不是动态实现这个，我通常处理几千条记录。

原文

I have a question that is somewhat high level, so I'll try to be as specific as possible.

I'm doing a lot of research that involves combining disparate data sets with header information that refers to the same entity, usually a company or a financial security. This record linking usually involves header information in which the name is the only common primary identifier, but where some secondary information is often available (such as city and state, dates of operation, relative size, etc). These matches are usually one-to-many, but may be one-to-one or even many-to-many. I have usually done this matching by hand or with very basic text comparison of cleaned substrings. I have occasionally used a simple matching algorithm like a Levenshtein distance measure, but I never got much out of it, in part because I didn't have a good formal way of applying it.

My guess is that this is a fairly common question and that there must be some formalized processes that have been developed to do this type of thing. I've read a few academic papers on the subject that deal with theoretical appropriateness of given approaches, but I haven't found any good source that walks through a recipe or at least a practical framework.

My question is the following:

Does anyone know of a good source for implementing multi-dimensional fuzzy record matching, like a book or a website or a published article or working paper?
I'd prefer something that had practical examples and a well defined approach.
The approach could be iterative, with human checks for improvement at intermediate stages.
(edit) The linked data is used for statistical analysis. As such, a little bit of noise is OK, but there is a strong preference for fewer "incorrect matches" over fewer "incorrect non-matches".
If they were in Python that would be fantastic, but not necessary.

One last thing, if it matters, is that I don't care much about computational efficiency. I'm not implementing this dynamically and I'm usually dealing with a few thousand records.

分享到QQ

分享到微博