“名字姓氏”/“姓氏名字”的与顺序无关的模糊匹配在 R 中?

发布于 2025-01-02 07:25:19 字数 248 浏览 0 评论 0原文

我有两份分别收集的同一组学生的名单。有很多印刷错误,我一直在使用模糊匹配来链接两个列表。我对 agrep 和类似的东西有 99+% 的支持,但我遇到了以下基本问题:我如何匹配(例如)名字“Adrian Bruce”和“Bruce Adrian”? Levenshtein 编辑距离不适用于这种特殊情况,因为它计算替换的数量。

这一定是一个非常常见的问题,但我找不到任何标准 R 包或例程来解决它。我想我错过了一些明显的东西......???

I have two lists of names for the same set of students which have been collected separately. There are numerous typographical errors and I have been using fuzzy matching to link the two lists. I am 99+% there with agrep and similar, but am stuck on the following basic problem: how can I match (for example) the forenames "Adrian Bruce" and "Bruce Adrian"? The Levenshtein edit distance is no good for this particular case as it counts number of substitutions.

This must be a very common problem, but I cannot find any standard R package or routine for addressing it. I presume I am missing something obvious...???

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

东京女 2025-01-09 07:25:19

嗯,一种相当简单的方法是交换单词并再次匹配......

y=c("Bruce Almighty", "Lee, Bruce", "Leroy Brown")
y2 <- sub("(.*) (.*)", "\\2 \\1", y)

agrep("Bruce Lee", y)  # No match
agrep("Bruce Lee", y2) # Match!

Well, one fairly easy way is to swap the words and match again...

y=c("Bruce Almighty", "Lee, Bruce", "Leroy Brown")
y2 <- sub("(.*) (.*)", "\\2 \\1", y)

agrep("Bruce Lee", y)  # No match
agrep("Bruce Lee", y2) # Match!
旧夏天 2025-01-09 07:25:19

我通常使用的技术非常强大,并且对顺序、标点符号等相对不敏感。它基于称为“n-gram”的对象。如果 n=2,则为“二元组”。例如:

"Adrian Bruce" --> ("Ad","dr","ri","ia","an","n "," B","Br","ru","uc","ce")
"Bruce Adrian" --> ("Br","ru","uc","ce","e "," A","Ad","dr","ri","ia","an")

每个字符串有 11 个二元组。其中有9个是共同的。因此,相似度得分非常高:9/11 或 0.818,其中 1.000 是完美匹配。

我对 R 不是很熟悉,但如果包不存在,这种技术很容易编码。您可以编写一段代码,循环遍历字符串 1 的二元组并计算字符串 2 中包含的二元组数量。

The technique I usually use is pretty robust and relatively insensitive to ordering, punctuation, etc.. It's based on objects called "n-grams". If n=2, "bigrams". For instance:

"Adrian Bruce" --> ("Ad","dr","ri","ia","an","n "," B","Br","ru","uc","ce")
"Bruce Adrian" --> ("Br","ru","uc","ce","e "," A","Ad","dr","ri","ia","an")

Each string has 11 bigrams. 9 of them are in common. Thus, the similarity score is very high: 9/11 or 0.818 where 1.000 is a perfect match.

I am not very familiar with R, but if a package does not exist, this technique is very easy to code. You can write a code that loops through the bigrams of string 1 and tallies how many are contained in string 2.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文